Skip to content

fix: seed peers being disconnected while seedstrap is in progress#7297

Closed
hansieodendaal wants to merge 1 commit intotari-project:developmentfrom
hansieodendaal:ho_soft_disconnect
Closed

fix: seed peers being disconnected while seedstrap is in progress#7297
hansieodendaal wants to merge 1 commit intotari-project:developmentfrom
hansieodendaal:ho_soft_disconnect

Conversation

@hansieodendaal
Copy link
Copy Markdown
Contributor

@hansieodendaal hansieodendaal commented Jul 9, 2025

Description

  • Fixed seedstrap seeds disconnecting during seedstrap phase. This was caused by the health service that obtained conenctions to all seed peers and disconnected them while seedstrap was busy. This was fixed by adding a soft disconnect alongside the hard disconnect in peer_connection.rs::PeerConnection so that parallel spawned process can politely request connections to be broken if not in use.
  • Also added better logging information for peer disconnects in that the requester is also logged.

Motivation and Context

Seed peer connections are being dropped while still busy streaming peer info via RPC durign seedstrap.

How Has This Been Tested?

System-level testing

Seedstrap can now finish within 3s if there are no seed peer connection issues:

[comms::dht::network_discovery::seed_strap] INFO  SeedStrap: Original early exit conditions met. Total peers added (455) >= needed (15). Successful seed contacts (5) >= min (5). Exiting seed node loop.
[comms::dht::network_discovery::seed_strap] INFO  SeedStrap: discover_peers_via_seeds finished. Attempted to contact: 5/5. Successfully synced from: 5. Total new peers added in this round: 455. Total duplicates processed in this round: 38.
[comms::dht::network_discovery::seed_strap] INFO  SeedStrap: Peer DB counts after seed bootstrap completion: Total peers in DB = 468, Non-seed peers in DB = 455
[comms::dht::network_discovery::seed_strap] INFO  SeedStrap: Added 455 (new) peers via seed nodes (5 successful seed node contacts).
[comms::dht::network_discovery::seed_strap] DEBUG SeedStrap: Round info at completion - new_peers: 455, duplicate_peers: 38, succeeded: 5, sync_peers_contacted: 5
[comms::dht::network_discovery] DEBUG Transition triggered from current state `SeedStrap` by event `DiscoveryComplete(Synced 5/5, num_new_peers = 455, num_duplicate_peers = 38)`
[comms::dht::network_discovery] INFO  [DHT EVENT PUBLISH] Successfully published DhtEvent::NetworkDiscoveryPeersAdded to 3 receiver(s)
[comms::dht::network_discovery] INFO  [DHT BOOTSTRAP] Bootstrap completed via SeedStrap in 2.397606s

Previously when the connections were being closed:

[comms::dht::network_discovery::seed_strap] WARN  SeedStrap: Timeout after 10.00s waiting for next peer from stream for seed 'd4b68aa41844d331149939e1da'. Breaking from peer stream collection.
[comms::dht::network_discovery::seed_strap] WARN  SeedStrap: Timeout after 10.00s waiting for next peer from stream for seed 'd7c289e9e3c8377705ce599a96'. Breaking from peer stream collection.
[comms::dht::network_discovery::seed_strap] WARN  SeedStrap: Timeout after 10.00s waiting for next peer from stream for seed '01dc53343af8e28b86892e1c0d'. Breaking from peer stream collection.
[comms::dht::network_discovery::seed_strap] WARN  SeedStrap: Timeout after 10.00s waiting for next peer from stream for seed '515391f40a2cf72ca362589ccb'. Breaking from peer stream collection.
[comms::dht::network_discovery::seed_strap] WARN  SeedStrap: Timeout after 10.00s waiting for next peer from stream for seed '6b959c029a0bbd3299b96efd78'. Breaking from peer stream collection.

...    

[comms::dht::network_discovery::seed_strap] INFO  SeedStrap: Original early exit conditions met. Total peers added (441) >= needed (15). Successful seed contacts (5) >= min (5). Exiting seed node loop.
[comms::dht::network_discovery::seed_strap] INFO  SeedStrap: discover_peers_via_seeds finished. Attempted to contact: 5/5. Successfully synced from: 5. Total new peers added in this round: 441. Total duplicates processed in this round: 48.
[comms::dht::network_discovery::seed_strap] INFO  SeedStrap: Peer DB counts after seed bootstrap completion: Total peers in DB = 453, Non-seed peers in DB = 440
[comms::dht::network_discovery::seed_strap] INFO  SeedStrap: Added 441 (new) peers via seed nodes (5 successful seed node contacts).
[comms::dht::network_discovery::seed_strap] DEBUG SeedStrap: Round info at completion - new_peers: 441, duplicate_peers: 48, succeeded: 5, sync_peers_contacted: 5
[comms::dht::network_discovery] DEBUG Transition triggered from current state `SeedStrap` by event `DiscoveryComplete(Synced 5/5, num_new_peers = 441, num_duplicate_peers = 48)`
[comms::dht::network_discovery] INFO  [DHT EVENT PUBLISH] Successfully published DhtEvent::NetworkDiscoveryPeersAdded to 3 receiver(s)
[comms::dht::network_discovery] INFO  [DHT BOOTSTRAP] Bootstrap completed via SeedStrap in 12.1362851s

What process can a PR reviewer use to test or verify this change?

Code review
System-level testing

Breaking Changes

  • None
  • Requires data directory on base node to be deleted
  • Requires hard fork
  • Other - Please specify

Summary by CodeRabbit

  • New Features

    • Improved peer discovery by introducing concurrent, batched queries to seed peers, resulting in faster and more efficient network synchronization.
    • Enhanced connection management by allowing disconnections to include requester context, improving traceability in logs.
  • Refactor

    • Simplified and modularized peer discovery logic, making the process more robust and maintainable.
    • Updated disconnect methods across the application to require a reason string for better diagnostics.
  • Bug Fixes

    • Improved handling of peer connections by conditionally disconnecting only when unused, reducing unnecessary disconnects.
  • Chores

    • Removed unused fields related to discovery round tracking to streamline data structures.
    • Updated tests and mocks to align with new disconnect method signatures.

@hansieodendaal hansieodendaal requested a review from a team as a code owner July 9, 2025 07:17
@hansieodendaal hansieodendaal marked this pull request as draft July 9, 2025 07:17
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jul 9, 2025

Walkthrough

This change propagates a requester/context string through all peer disconnection operations, updating function signatures and method calls to include this parameter for improved logging and traceability. It introduces a conditional disconnect method (disconnect_if_unused), refactors health check peer disconnections to defer and use this method, and updates all relevant tests and mocks accordingly.

Changes

File(s) Change Summary
comms/core/src/connection_manager/peer_connection.rs Added requester string to disconnect methods and logs; introduced disconnect_if_unused; updated enum variant.
comms/core/src/connectivity/manager.rs Added requester parameter to all disconnect helper functions and their call sites.
base_layer/core/src/base_node/tari_pulse_service/mod.rs Refactored health check peer disconnects to defer and use disconnect_if_unused with requester context.
comms/core/src/test_utils/mocks/peer_connection.rs Updated mock to handle new disconnect signature with requester parameter.
comms/core/src/connectivity/test.rs
comms/core/src/protocol/rpc/client/tests.rs
comms/core/tests/tests/rpc.rs
comms/core/src/connection_manager/tests/listener_dialer.rs
Updated test disconnect calls to include requester string.
base_layer/p2p/src/services/liveness/service.rs
base_layer/p2p/src/services/monitor_peers/service.rs
comms/dht/src/network_discovery/seed_strap.rs
Updated disconnect calls to include requester/context string for logging.

Sequence Diagram(s)

sequenceDiagram
    participant Service as Service (e.g., Pulse, Liveness)
    participant ConnMgr as ConnectivityManagerActor
    participant PeerConn as PeerConnection

    Service->>ConnMgr: Request peer disconnect with context string
    ConnMgr->>PeerConn: Call disconnect/minimized/if_unused with requester string
    PeerConn->>PeerConn: Log disconnect with requester context
    PeerConn-->>ConnMgr: Return disconnect result
    ConnMgr-->>Service: Return operation result
Loading

Possibly related PRs

  • tari-project/tari#6921: Also modifies comms/core/src/connectivity/manager.rs to improve peer disconnection handling with requester context and timeouts.
  • tari-project/tari#7011: Introduces disconnect_if_unused and extends disconnect calls to include a requester parameter for better logging and conditional disconnect behavior.

Suggested reviewers

  • SWvheerden

Poem

In the warren of code, with connections so vast,
Now every goodbye is logged—no more shadows cast!
With reasons attached, each disconnect’s clear,
No bunny’s confused when a peer disappears.
🐇✨
Traceable hops, from service to core—
This rabbit’s delighted by clarity galore!


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75dec99 and 7dda561.

📒 Files selected for processing (11)
  • base_layer/core/src/base_node/tari_pulse_service/mod.rs (2 hunks)
  • base_layer/p2p/src/services/liveness/service.rs (1 hunks)
  • base_layer/p2p/src/services/monitor_peers/service.rs (1 hunks)
  • comms/core/src/connection_manager/peer_connection.rs (7 hunks)
  • comms/core/src/connection_manager/tests/listener_dialer.rs (1 hunks)
  • comms/core/src/connectivity/manager.rs (9 hunks)
  • comms/core/src/connectivity/test.rs (2 hunks)
  • comms/core/src/protocol/rpc/client/tests.rs (1 hunks)
  • comms/core/src/test_utils/mocks/peer_connection.rs (1 hunks)
  • comms/core/tests/tests/rpc.rs (1 hunks)
  • comms/dht/src/network_discovery/seed_strap.rs (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • comms/dht/src/network_discovery/seed_strap.rs
🚧 Files skipped from review as they are similar to previous changes (10)
  • base_layer/p2p/src/services/liveness/service.rs
  • base_layer/p2p/src/services/monitor_peers/service.rs
  • comms/core/tests/tests/rpc.rs
  • comms/core/src/connection_manager/tests/listener_dialer.rs
  • comms/core/src/protocol/rpc/client/tests.rs
  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
  • comms/core/src/connectivity/test.rs
  • comms/core/src/test_utils/mocks/peer_connection.rs
  • comms/core/src/connectivity/manager.rs
  • comms/core/src/connection_manager/peer_connection.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: test (nextnet, nextnet)
  • GitHub Check: test (mainnet, stagenet)
  • GitHub Check: test (testnet, esmeralda)
  • GitHub Check: ci
  • GitHub Check: cargo check with stable
  • GitHub Check: Cucumber tests / Base Layer
  • GitHub Check: Cucumber tests / FFI
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jul 9, 2025

Test Results (Integration tests)

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit 7dda561.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jul 9, 2025

Test Results (CI)

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit 7dda561.

♻️ This comment has been updated with latest results.

@hansieodendaal hansieodendaal marked this pull request as ready for review July 9, 2025 08:38
@hansieodendaal hansieodendaal marked this pull request as draft July 9, 2025 08:41
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
comms/dht/src/network_discovery/seed_strap.rs (1)

161-244: Excellent refactoring to concurrent seed peer processing.

The change from sequential to concurrent processing will significantly improve performance when discovering peers from multiple seeds. The dynamic adjustment of permitted_concurrent based on whether seeds return empty peer lists is a clever optimization.

However, consider handling task panics more gracefully:

-                let Ok((peers_from_seed, new_peers_this_seed, duplicates_this_seed)) = result else {
-                    warn!(target: LOG_TARGET, "SeedStrap: get_peers task panicked or was cancelled");
-                    permitted_concurrent += 1;
-                    continue;
-                };
+                match result {
+                    Ok((peers_from_seed, new_peers_this_seed, duplicates_this_seed)) => {
+                        // Process results...
+                    },
+                    Err(e) => {
+                        warn!(target: LOG_TARGET, "SeedStrap: get_peers task failed: {:?}", e);
+                        permitted_concurrent += 1;
+                        continue;
+                    }
+                }

This would provide more diagnostic information if a task fails.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22bd4b9 and 75dec99.

📒 Files selected for processing (15)
  • base_layer/core/src/base_node/tari_pulse_service/mod.rs (1 hunks)
  • base_layer/core/src/consensus/consensus_manager.rs (1 hunks)
  • base_layer/p2p/src/services/liveness/service.rs (1 hunks)
  • base_layer/p2p/src/services/monitor_peers/service.rs (1 hunks)
  • comms/core/src/connection_manager/peer_connection.rs (7 hunks)
  • comms/core/src/connection_manager/tests/listener_dialer.rs (1 hunks)
  • comms/core/src/connectivity/manager.rs (9 hunks)
  • comms/core/src/connectivity/test.rs (2 hunks)
  • comms/core/src/protocol/rpc/client/tests.rs (1 hunks)
  • comms/core/src/test_utils/mocks/peer_connection.rs (1 hunks)
  • comms/core/tests/tests/rpc.rs (1 hunks)
  • comms/dht/src/network_discovery/discovering.rs (0 hunks)
  • comms/dht/src/network_discovery/on_connect.rs (0 hunks)
  • comms/dht/src/network_discovery/seed_strap.rs (3 hunks)
  • comms/dht/src/network_discovery/state_machine.rs (0 hunks)
💤 Files with no reviewable changes (3)
  • comms/dht/src/network_discovery/discovering.rs
  • comms/dht/src/network_discovery/on_connect.rs
  • comms/dht/src/network_discovery/state_machine.rs
🧰 Additional context used
🧠 Learnings (13)
📓 Common learnings
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.184Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
base_layer/core/src/base_node/tari_pulse_service/mod.rs (8)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.184Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Learnt from: hansieodendaal
PR: tari-project/tari#6921
File: comms/core/src/connectivity/manager.rs:470-476
Timestamp: 2025-04-03T15:16:27.829Z
Learning: In the Tari codebase's connectivity manager, connections with uncertain state (like timed-out disconnects) are deliberately kept in the pool rather than forcibly removed. The connection pool refresh function runs on a regular interval, which handles any inconsistencies between the actual connection state and how it's tracked in the pool.
Learnt from: hansieodendaal
PR: tari-project/tari#7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the `hard_delete_all_stale_peers` method in `comms/core/src/peer_manager/storage/database.rs`, the SQL query intentionally uses exact equality (`peers.features = ?`) rather than bitwise operations (`peers.features & ? != 0`) when matching `COMMUNICATION_NODE` features. This is the intended behavior to match only peers with exactly the `COMMUNICATION_NODE` feature, excluding those with additional feature flags.
base_layer/p2p/src/services/liveness/service.rs (6)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.184Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
base_layer/p2p/src/services/monitor_peers/service.rs (7)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.184Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the `hard_delete_all_stale_peers` method in `comms/core/src/peer_manager/storage/database.rs`, the SQL query intentionally uses exact equality (`peers.features = ?`) rather than bitwise operations (`peers.features & ? != 0`) when matching `COMMUNICATION_NODE` features. This is the intended behavior to match only peers with exactly the `COMMUNICATION_NODE` feature, excluding those with additional feature flags.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
comms/core/src/connectivity/test.rs (5)
Learnt from: hansieodendaal
PR: tari-project/tari#6921
File: comms/core/src/connectivity/manager.rs:470-476
Timestamp: 2025-04-03T15:16:27.829Z
Learning: In the Tari codebase's connectivity manager, connections with uncertain state (like timed-out disconnects) are deliberately kept in the pool rather than forcibly removed. The connection pool refresh function runs on a regular interval, which handles any inconsistencies between the actual connection state and how it's tracked in the pool.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the `hard_delete_all_stale_peers` method in `comms/core/src/peer_manager/storage/database.rs`, the SQL query intentionally uses exact equality (`peers.features = ?`) rather than bitwise operations (`peers.features & ? != 0`) when matching `COMMUNICATION_NODE` features. This is the intended behavior to match only peers with exactly the `COMMUNICATION_NODE` feature, excluding those with additional feature flags.
comms/core/src/protocol/rpc/client/tests.rs (6)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.184Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
comms/core/src/connection_manager/tests/listener_dialer.rs (3)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
base_layer/core/src/consensus/consensus_manager.rs (3)
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
comms/core/src/test_utils/mocks/peer_connection.rs (3)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
comms/core/tests/tests/rpc.rs (3)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
comms/core/src/connection_manager/peer_connection.rs (5)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
comms/core/src/connectivity/manager.rs (7)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.184Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#6921
File: comms/core/src/connectivity/manager.rs:470-476
Timestamp: 2025-04-03T15:16:27.829Z
Learning: In the Tari codebase's connectivity manager, connections with uncertain state (like timed-out disconnects) are deliberately kept in the pool rather than forcibly removed. The connection pool refresh function runs on a regular interval, which handles any inconsistencies between the actual connection state and how it's tracked in the pool.
comms/dht/src/network_discovery/seed_strap.rs (10)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.184Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#6963
File: comms/core/src/peer_manager/manager.rs:60-68
Timestamp: 2025-05-26T02:40:23.812Z
Learning: PeerDatabaseSql in the Tari codebase has been specifically refactored to handle concurrent access and mitigate blocking I/O concerns on async executor threads. The implementation has been tested under high load at both system level and through unit tests like test_concurrent_add_or_update_and_get_closest_peers which validates concurrent read/write operations.
Learnt from: hansieodendaal
PR: tari-project/tari#7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the `hard_delete_all_stale_peers` method in `comms/core/src/peer_manager/storage/database.rs`, the SQL query intentionally uses exact equality (`peers.features = ?`) rather than bitwise operations (`peers.features & ? != 0`) when matching `COMMUNICATION_NODE` features. This is the intended behavior to match only peers with exactly the `COMMUNICATION_NODE` feature, excluding those with additional feature flags.
Learnt from: hansieodendaal
PR: tari-project/tari#7280
File: base_layer/core/src/chain_storage/blockchain_database.rs:361-438
Timestamp: 2025-07-04T10:56:46.039Z
Learning: The combination of `tokio::task::spawn(async move {` with `tokio::task::spawn_blocking().await` in the payref rebuild background task works well and shuts down properly with the tokio environment, as confirmed by testing in the Tari codebase.
Learnt from: hansieodendaal
PR: tari-project/tari#6963
File: comms/dht/src/proto/mod.rs:141-142
Timestamp: 2025-05-02T07:12:23.985Z
Learning: The `PeerFeatures::from_bits_u32_truncate` method truncates a u32 to u8 bits but can still return `None` if the resulting bits don't match any valid flags, making the error handling with `.ok_or_else()` necessary even after truncation.
🧬 Code Graph Analysis (1)
comms/core/src/connection_manager/peer_connection.rs (3)
comms/core/src/test_utils/mocks/peer_connection.rs (1)
  • disconnect (174-181)
comms/core/src/multiplexing/yamux.rs (3)
  • substream_count (114-116)
  • substream_count (173-175)
  • substream_count (478-514)
comms/core/src/peer_manager/node_id.rs (1)
  • short_str (132-134)
🔇 Additional comments (29)
base_layer/core/src/consensus/consensus_manager.rs (1)

23-25: LGTM: Improved conditional import organization.

The change correctly separates the conditional compilation of TryFrom from Arc. Since TryFrom is only used in the new_target_difficulty method (line 140) which is also behind the #[cfg(feature = "base_node")] flag, this conditional import is appropriate and follows Rust best practices of only importing what you need when you need it.

base_layer/p2p/src/services/liveness/service.rs (1)

390-392: LGTM: Requester context added to disconnect operation.

The addition of the requester string "LivenessService disconnect failed peers" provides clear traceability for disconnect operations initiated by the liveness service, which aligns with the PR objective of adding disconnect requester statistics.

comms/core/tests/tests/rpc.rs (1)

320-320: LGTM: Test updated to match new disconnect signature.

The test correctly uses the updated disconnect method signature with the requester string "unit tests", ensuring compatibility with the enhanced disconnect API.

comms/core/src/connectivity/test.rs (2)

424-424: LGTM: Test updated to match new disconnect signature.

The test correctly uses the updated disconnect method signature with the requester string "unit test".


445-448: LGTM: Test updated to match new disconnect signature.

The test correctly uses the updated disconnect method signature with the requester string "unit test".

comms/core/src/protocol/rpc/client/tests.rs (1)

174-174: LGTM: Test updated to match new disconnect signature.

The test correctly uses the updated disconnect method signature with the requester string "unit test".

base_layer/core/src/base_node/tari_pulse_service/mod.rs (1)

376-376: LGTM: Implements soft disconnect with requester context.

This change effectively implements both PR objectives:

  1. Soft disconnect: Uses disconnect_if_unused instead of disconnect to avoid dropping connections that are still in use
  2. Requester context: Adds "Health check" string for disconnect traceability

This is particularly appropriate for health checks where forcible disconnection of active connections should be avoided.

base_layer/p2p/src/services/monitor_peers/service.rs (1)

314-318: LGTM: Good descriptive requester context added.

The addition of the requester parameter with "Monitor peers disconnect unresponsive" provides clear context for the disconnect operation and aligns with the broader refactor to improve traceability of peer disconnections.

comms/core/src/connection_manager/tests/listener_dialer.rs (1)

185-185: LGTM: Test properly updated for new disconnect signature.

The addition of the "unit test" requester parameter correctly updates the test to match the new disconnect method signature.

comms/core/src/test_utils/mocks/peer_connection.rs (1)

223-223: LGTM: Mock correctly updated to match new Disconnect variant signature.

The pattern match has been properly updated to include the new _requester parameter, maintaining compatibility with the updated PeerConnectionRequest::Disconnect variant.

comms/core/src/connectivity/manager.rs (9)

428-435: LGTM: Appropriate requester context added for disconnect all operation.

The requester string "ConnectivityManagerActor disconnect all" clearly identifies this disconnect operation during shutdown cleanup.


556-563: LGTM: Clear context for connection minimization disconnects.

The requester string "ConnectivityManagerActor maintain closest" provides good context for disconnections during connection pool optimization.


615-622: LGTM: Descriptive context for inactive connection cleanup.

The requester string "ConnectivityManagerActor reap inactive" clearly indicates this disconnect is part of the connection reaping process.


852-858: LGTM: Appropriate context for connection failure handling.

The requester string "ConnectivityManagerActor peer connect failed" provides clear context for disconnections triggered by connection failures.


982-988: LGTM: Consistent tie-break context provided.

Both tie-break disconnect operations use the same requester string "ConnectivityManagerActor tie break", providing consistent context for these connection resolution scenarios.

Also applies to: 1004-1010


1188-1188: LGTM: Clear context for peer ban disconnects.

The requester string "ConnectivityManagerActor ban peer" provides appropriate context for disconnections during peer banning operations.


1290-1304: LGTM: Function signature correctly updated with requester parameter.

The disconnect_with_timeout function has been properly updated to accept and pass through the requester parameter to the underlying disconnect call.


1306-1329: LGTM: Function signature correctly updated with requester parameter.

The disconnect_if_unused_with_timeout function has been properly updated to accept and pass through the requester parameter to the underlying disconnect call.


1331-1354: LGTM: Function signature correctly updated with requester parameter.

The disconnect_silent_with_timeout function has been properly updated to accept and pass through the requester parameter to the underlying disconnect call.

comms/core/src/connection_manager/peer_connection.rs (5)

123-128: LGTM! Good addition for disconnect traceability.

Adding the requester string to the Disconnect variant enables better debugging and monitoring of which components are requesting disconnections.


307-337: Excellent logging enhancement for disconnect operations.

The trace log provides comprehensive information about the disconnect request, including who requested it and the current connection state (RPC clients and substreams). This will be invaluable for debugging connection management issues.


339-360: Well-implemented soft disconnect functionality.

The disconnect_if_unused method correctly implements the PR's objective by:

  1. Checking for active RPC clients and substreams
  2. Preserving connections that are in use
  3. Providing clear trace logging to indicate the decision made

This addresses the issue of connections being dropped while still in use.


362-388: Consistent implementation with disconnect method.

The silent disconnect method correctly propagates the requester information while maintaining its silent behavior.


515-515: Proper propagation of requester context through the actor.

All PeerConnectionActor updates correctly handle the requester string:

  • Clear requester identification for internal shutdown
  • Proper extraction and forwarding in request handling
  • Enhanced logging at the point of actual disconnection

These changes maintain consistency throughout the disconnect flow.

Also applies to: 540-551, 646-651, 677-681

comms/dht/src/network_discovery/seed_strap.rs (5)

288-343: Clean extraction of early exit logic.

Extracting the early exit conditions into a dedicated method improves code readability and makes the logic easier to test and maintain. The two distinct exit conditions are clearly documented with informative logging.


351-460: Well-structured peer fetching with proper retry logic.

The get_peers function correctly implements the retry logic as per the learnings - it only retries when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which specifically handles mid-stream disconnections. The disconnect calls properly include descriptive requester strings ("SeedStrap disconnect seed on ok" and "SeedStrap disconnect seed on error").

Good implementation that addresses the known issue of connections being closed during RPC streaming.


461-615: Thorough peer validation and processing.

The peer processing logic includes comprehensive validation:

  • Self peer check
  • Seed peer check
  • Proper peer validation using PeerValidator
  • Graceful error handling that continues processing
  • Detailed logging for debugging

The implementation correctly tracks new vs duplicate peers for accurate statistics.


617-697: Good refactoring for testability and reusability.

Moving fetch_peers_from_connection out of the impl block and making it accept explicit parameters improves:

  • Testability by removing hidden dependencies
  • Reusability by making the function more generic
  • Clarity by making all inputs explicit

The implementation correctly handles the RPC setup and request parameters.


699-809: Excellent stream collection with timeout handling.

The collect_peer_stream function properly addresses the known issue mentioned in the learnings where RPC streams can hang when connections are closed. The per-item timeout of 10 seconds prevents indefinite blocking, and the comprehensive logging helps diagnose stream issues.

The function correctly handles all stream termination cases:

  • Timeout (returns collected peers so far)
  • Error (returns error)
  • Normal completion (returns all collected peers)

This implementation aligns with the documented need for aggressive timeouts in seedstrap operations.

Fixed seedstrap seeds disconnecting during seedstrap phase. This was caused by the
health service that obtained conenctions to all seed peers and disconnected them
while seedstrap was busy. This was fixed by adding a soft disconnect alongside the
hard disconnect in `peer_connection.rs::PeerConnection` so that parallel spawned
process can politely request connections to be broken if not in use. Also added
better logging information for peer disconnects in the the requester is also logged.
@hansieodendaal hansieodendaal changed the title feat: add soft disconnect for peer connections with requester stats fix: seed peers being disconnected while seedstrap is in progress Jul 10, 2025
@hansieodendaal hansieodendaal marked this pull request as ready for review July 10, 2025 12:28
@hansieodendaal
Copy link
Copy Markdown
Contributor Author

#7303 replaces this PR

@hansieodendaal hansieodendaal deleted the ho_soft_disconnect branch July 16, 2025 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants