fix: seed peers being disconnected while seedstrap is in progress#7297
fix: seed peers being disconnected while seedstrap is in progress#7297hansieodendaal wants to merge 1 commit intotari-project:developmentfrom
Conversation
WalkthroughThis change propagates a requester/context string through all peer disconnection operations, updating function signatures and method calls to include this parameter for improved logging and traceability. It introduces a conditional disconnect method ( Changes
Sequence Diagram(s)sequenceDiagram
participant Service as Service (e.g., Pulse, Liveness)
participant ConnMgr as ConnectivityManagerActor
participant PeerConn as PeerConnection
Service->>ConnMgr: Request peer disconnect with context string
ConnMgr->>PeerConn: Call disconnect/minimized/if_unused with requester string
PeerConn->>PeerConn: Log disconnect with requester context
PeerConn-->>ConnMgr: Return disconnect result
ConnMgr-->>Service: Return operation result
Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (11)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (10)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Test Results (Integration tests)0 tests 0 ✅ 0s ⏱️ Results for commit 7dda561. ♻️ This comment has been updated with latest results. |
Test Results (CI)0 tests 0 ✅ 0s ⏱️ Results for commit 7dda561. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
comms/dht/src/network_discovery/seed_strap.rs (1)
161-244: Excellent refactoring to concurrent seed peer processing.The change from sequential to concurrent processing will significantly improve performance when discovering peers from multiple seeds. The dynamic adjustment of
permitted_concurrentbased on whether seeds return empty peer lists is a clever optimization.However, consider handling task panics more gracefully:
- let Ok((peers_from_seed, new_peers_this_seed, duplicates_this_seed)) = result else { - warn!(target: LOG_TARGET, "SeedStrap: get_peers task panicked or was cancelled"); - permitted_concurrent += 1; - continue; - }; + match result { + Ok((peers_from_seed, new_peers_this_seed, duplicates_this_seed)) => { + // Process results... + }, + Err(e) => { + warn!(target: LOG_TARGET, "SeedStrap: get_peers task failed: {:?}", e); + permitted_concurrent += 1; + continue; + } + }This would provide more diagnostic information if a task fails.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
base_layer/core/src/base_node/tari_pulse_service/mod.rs(1 hunks)base_layer/core/src/consensus/consensus_manager.rs(1 hunks)base_layer/p2p/src/services/liveness/service.rs(1 hunks)base_layer/p2p/src/services/monitor_peers/service.rs(1 hunks)comms/core/src/connection_manager/peer_connection.rs(7 hunks)comms/core/src/connection_manager/tests/listener_dialer.rs(1 hunks)comms/core/src/connectivity/manager.rs(9 hunks)comms/core/src/connectivity/test.rs(2 hunks)comms/core/src/protocol/rpc/client/tests.rs(1 hunks)comms/core/src/test_utils/mocks/peer_connection.rs(1 hunks)comms/core/tests/tests/rpc.rs(1 hunks)comms/dht/src/network_discovery/discovering.rs(0 hunks)comms/dht/src/network_discovery/on_connect.rs(0 hunks)comms/dht/src/network_discovery/seed_strap.rs(3 hunks)comms/dht/src/network_discovery/state_machine.rs(0 hunks)
💤 Files with no reviewable changes (3)
- comms/dht/src/network_discovery/discovering.rs
- comms/dht/src/network_discovery/on_connect.rs
- comms/dht/src/network_discovery/state_machine.rs
🧰 Additional context used
🧠 Learnings (13)
📓 Common learnings
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.184Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
base_layer/core/src/base_node/tari_pulse_service/mod.rs (8)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.184Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Learnt from: hansieodendaal
PR: tari-project/tari#6921
File: comms/core/src/connectivity/manager.rs:470-476
Timestamp: 2025-04-03T15:16:27.829Z
Learning: In the Tari codebase's connectivity manager, connections with uncertain state (like timed-out disconnects) are deliberately kept in the pool rather than forcibly removed. The connection pool refresh function runs on a regular interval, which handles any inconsistencies between the actual connection state and how it's tracked in the pool.
Learnt from: hansieodendaal
PR: tari-project/tari#7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the `hard_delete_all_stale_peers` method in `comms/core/src/peer_manager/storage/database.rs`, the SQL query intentionally uses exact equality (`peers.features = ?`) rather than bitwise operations (`peers.features & ? != 0`) when matching `COMMUNICATION_NODE` features. This is the intended behavior to match only peers with exactly the `COMMUNICATION_NODE` feature, excluding those with additional feature flags.
base_layer/p2p/src/services/liveness/service.rs (6)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.184Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
base_layer/p2p/src/services/monitor_peers/service.rs (7)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.184Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the `hard_delete_all_stale_peers` method in `comms/core/src/peer_manager/storage/database.rs`, the SQL query intentionally uses exact equality (`peers.features = ?`) rather than bitwise operations (`peers.features & ? != 0`) when matching `COMMUNICATION_NODE` features. This is the intended behavior to match only peers with exactly the `COMMUNICATION_NODE` feature, excluding those with additional feature flags.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
comms/core/src/connectivity/test.rs (5)
Learnt from: hansieodendaal
PR: tari-project/tari#6921
File: comms/core/src/connectivity/manager.rs:470-476
Timestamp: 2025-04-03T15:16:27.829Z
Learning: In the Tari codebase's connectivity manager, connections with uncertain state (like timed-out disconnects) are deliberately kept in the pool rather than forcibly removed. The connection pool refresh function runs on a regular interval, which handles any inconsistencies between the actual connection state and how it's tracked in the pool.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the `hard_delete_all_stale_peers` method in `comms/core/src/peer_manager/storage/database.rs`, the SQL query intentionally uses exact equality (`peers.features = ?`) rather than bitwise operations (`peers.features & ? != 0`) when matching `COMMUNICATION_NODE` features. This is the intended behavior to match only peers with exactly the `COMMUNICATION_NODE` feature, excluding those with additional feature flags.
comms/core/src/protocol/rpc/client/tests.rs (6)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.184Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
comms/core/src/connection_manager/tests/listener_dialer.rs (3)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
base_layer/core/src/consensus/consensus_manager.rs (3)
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
comms/core/src/test_utils/mocks/peer_connection.rs (3)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
comms/core/tests/tests/rpc.rs (3)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
comms/core/src/connection_manager/peer_connection.rs (5)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
comms/core/src/connectivity/manager.rs (7)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.184Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#6921
File: comms/core/src/connectivity/manager.rs:470-476
Timestamp: 2025-04-03T15:16:27.829Z
Learning: In the Tari codebase's connectivity manager, connections with uncertain state (like timed-out disconnects) are deliberately kept in the pool rather than forcibly removed. The connection pool refresh function runs on a regular interval, which handles any inconsistencies between the actual connection state and how it's tracked in the pool.
comms/dht/src/network_discovery/seed_strap.rs (10)
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.184Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.309Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#6963
File: comms/core/src/peer_manager/manager.rs:60-68
Timestamp: 2025-05-26T02:40:23.812Z
Learning: PeerDatabaseSql in the Tari codebase has been specifically refactored to handle concurrent access and mitigate blocking I/O concerns on async executor threads. The implementation has been tested under high load at both system level and through unit tests like test_concurrent_add_or_update_and_get_closest_peers which validates concurrent read/write operations.
Learnt from: hansieodendaal
PR: tari-project/tari#7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the `hard_delete_all_stale_peers` method in `comms/core/src/peer_manager/storage/database.rs`, the SQL query intentionally uses exact equality (`peers.features = ?`) rather than bitwise operations (`peers.features & ? != 0`) when matching `COMMUNICATION_NODE` features. This is the intended behavior to match only peers with exactly the `COMMUNICATION_NODE` feature, excluding those with additional feature flags.
Learnt from: hansieodendaal
PR: tari-project/tari#7280
File: base_layer/core/src/chain_storage/blockchain_database.rs:361-438
Timestamp: 2025-07-04T10:56:46.039Z
Learning: The combination of `tokio::task::spawn(async move {` with `tokio::task::spawn_blocking().await` in the payref rebuild background task works well and shuts down properly with the tokio environment, as confirmed by testing in the Tari codebase.
Learnt from: hansieodendaal
PR: tari-project/tari#6963
File: comms/dht/src/proto/mod.rs:141-142
Timestamp: 2025-05-02T07:12:23.985Z
Learning: The `PeerFeatures::from_bits_u32_truncate` method truncates a u32 to u8 bits but can still return `None` if the resulting bits don't match any valid flags, making the error handling with `.ok_or_else()` necessary even after truncation.
🧬 Code Graph Analysis (1)
comms/core/src/connection_manager/peer_connection.rs (3)
comms/core/src/test_utils/mocks/peer_connection.rs (1)
disconnect(174-181)comms/core/src/multiplexing/yamux.rs (3)
substream_count(114-116)substream_count(173-175)substream_count(478-514)comms/core/src/peer_manager/node_id.rs (1)
short_str(132-134)
🔇 Additional comments (29)
base_layer/core/src/consensus/consensus_manager.rs (1)
23-25: LGTM: Improved conditional import organization.The change correctly separates the conditional compilation of
TryFromfromArc. SinceTryFromis only used in thenew_target_difficultymethod (line 140) which is also behind the#[cfg(feature = "base_node")]flag, this conditional import is appropriate and follows Rust best practices of only importing what you need when you need it.base_layer/p2p/src/services/liveness/service.rs (1)
390-392: LGTM: Requester context added to disconnect operation.The addition of the requester string
"LivenessService disconnect failed peers"provides clear traceability for disconnect operations initiated by the liveness service, which aligns with the PR objective of adding disconnect requester statistics.comms/core/tests/tests/rpc.rs (1)
320-320: LGTM: Test updated to match new disconnect signature.The test correctly uses the updated disconnect method signature with the requester string
"unit tests", ensuring compatibility with the enhanced disconnect API.comms/core/src/connectivity/test.rs (2)
424-424: LGTM: Test updated to match new disconnect signature.The test correctly uses the updated disconnect method signature with the requester string
"unit test".
445-448: LGTM: Test updated to match new disconnect signature.The test correctly uses the updated disconnect method signature with the requester string
"unit test".comms/core/src/protocol/rpc/client/tests.rs (1)
174-174: LGTM: Test updated to match new disconnect signature.The test correctly uses the updated disconnect method signature with the requester string
"unit test".base_layer/core/src/base_node/tari_pulse_service/mod.rs (1)
376-376: LGTM: Implements soft disconnect with requester context.This change effectively implements both PR objectives:
- Soft disconnect: Uses
disconnect_if_unusedinstead ofdisconnectto avoid dropping connections that are still in use- Requester context: Adds
"Health check"string for disconnect traceabilityThis is particularly appropriate for health checks where forcible disconnection of active connections should be avoided.
base_layer/p2p/src/services/monitor_peers/service.rs (1)
314-318: LGTM: Good descriptive requester context added.The addition of the requester parameter with "Monitor peers disconnect unresponsive" provides clear context for the disconnect operation and aligns with the broader refactor to improve traceability of peer disconnections.
comms/core/src/connection_manager/tests/listener_dialer.rs (1)
185-185: LGTM: Test properly updated for new disconnect signature.The addition of the "unit test" requester parameter correctly updates the test to match the new disconnect method signature.
comms/core/src/test_utils/mocks/peer_connection.rs (1)
223-223: LGTM: Mock correctly updated to match new Disconnect variant signature.The pattern match has been properly updated to include the new
_requesterparameter, maintaining compatibility with the updatedPeerConnectionRequest::Disconnectvariant.comms/core/src/connectivity/manager.rs (9)
428-435: LGTM: Appropriate requester context added for disconnect all operation.The requester string "ConnectivityManagerActor disconnect all" clearly identifies this disconnect operation during shutdown cleanup.
556-563: LGTM: Clear context for connection minimization disconnects.The requester string "ConnectivityManagerActor maintain closest" provides good context for disconnections during connection pool optimization.
615-622: LGTM: Descriptive context for inactive connection cleanup.The requester string "ConnectivityManagerActor reap inactive" clearly indicates this disconnect is part of the connection reaping process.
852-858: LGTM: Appropriate context for connection failure handling.The requester string "ConnectivityManagerActor peer connect failed" provides clear context for disconnections triggered by connection failures.
982-988: LGTM: Consistent tie-break context provided.Both tie-break disconnect operations use the same requester string "ConnectivityManagerActor tie break", providing consistent context for these connection resolution scenarios.
Also applies to: 1004-1010
1188-1188: LGTM: Clear context for peer ban disconnects.The requester string "ConnectivityManagerActor ban peer" provides appropriate context for disconnections during peer banning operations.
1290-1304: LGTM: Function signature correctly updated with requester parameter.The
disconnect_with_timeoutfunction has been properly updated to accept and pass through the requester parameter to the underlying disconnect call.
1306-1329: LGTM: Function signature correctly updated with requester parameter.The
disconnect_if_unused_with_timeoutfunction has been properly updated to accept and pass through the requester parameter to the underlying disconnect call.
1331-1354: LGTM: Function signature correctly updated with requester parameter.The
disconnect_silent_with_timeoutfunction has been properly updated to accept and pass through the requester parameter to the underlying disconnect call.comms/core/src/connection_manager/peer_connection.rs (5)
123-128: LGTM! Good addition for disconnect traceability.Adding the requester string to the Disconnect variant enables better debugging and monitoring of which components are requesting disconnections.
307-337: Excellent logging enhancement for disconnect operations.The trace log provides comprehensive information about the disconnect request, including who requested it and the current connection state (RPC clients and substreams). This will be invaluable for debugging connection management issues.
339-360: Well-implemented soft disconnect functionality.The
disconnect_if_unusedmethod correctly implements the PR's objective by:
- Checking for active RPC clients and substreams
- Preserving connections that are in use
- Providing clear trace logging to indicate the decision made
This addresses the issue of connections being dropped while still in use.
362-388: Consistent implementation with disconnect method.The silent disconnect method correctly propagates the requester information while maintaining its silent behavior.
515-515: Proper propagation of requester context through the actor.All PeerConnectionActor updates correctly handle the requester string:
- Clear requester identification for internal shutdown
- Proper extraction and forwarding in request handling
- Enhanced logging at the point of actual disconnection
These changes maintain consistency throughout the disconnect flow.
Also applies to: 540-551, 646-651, 677-681
comms/dht/src/network_discovery/seed_strap.rs (5)
288-343: Clean extraction of early exit logic.Extracting the early exit conditions into a dedicated method improves code readability and makes the logic easier to test and maintain. The two distinct exit conditions are clearly documented with informative logging.
351-460: Well-structured peer fetching with proper retry logic.The
get_peersfunction correctly implements the retry logic as per the learnings - it only retries whenpeers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which specifically handles mid-stream disconnections. The disconnect calls properly include descriptive requester strings ("SeedStrap disconnect seed on ok" and "SeedStrap disconnect seed on error").Good implementation that addresses the known issue of connections being closed during RPC streaming.
461-615: Thorough peer validation and processing.The peer processing logic includes comprehensive validation:
- Self peer check
- Seed peer check
- Proper peer validation using PeerValidator
- Graceful error handling that continues processing
- Detailed logging for debugging
The implementation correctly tracks new vs duplicate peers for accurate statistics.
617-697: Good refactoring for testability and reusability.Moving
fetch_peers_from_connectionout of the impl block and making it accept explicit parameters improves:
- Testability by removing hidden dependencies
- Reusability by making the function more generic
- Clarity by making all inputs explicit
The implementation correctly handles the RPC setup and request parameters.
699-809: Excellent stream collection with timeout handling.The
collect_peer_streamfunction properly addresses the known issue mentioned in the learnings where RPC streams can hang when connections are closed. The per-item timeout of 10 seconds prevents indefinite blocking, and the comprehensive logging helps diagnose stream issues.The function correctly handles all stream termination cases:
- Timeout (returns collected peers so far)
- Error (returns error)
- Normal completion (returns all collected peers)
This implementation aligns with the documented need for aggressive timeouts in seedstrap operations.
75dec99 to
1c4a9ef
Compare
Fixed seedstrap seeds disconnecting during seedstrap phase. This was caused by the health service that obtained conenctions to all seed peers and disconnected them while seedstrap was busy. This was fixed by adding a soft disconnect alongside the hard disconnect in `peer_connection.rs::PeerConnection` so that parallel spawned process can politely request connections to be broken if not in use. Also added better logging information for peer disconnects in the the requester is also logged.
1c4a9ef to
7dda561
Compare
|
#7303 replaces this PR |
Description
peer_connection.rs::PeerConnectionso that parallel spawned process can politely request connections to be broken if not in use.Motivation and Context
Seed peer connections are being dropped while still busy streaming peer info via RPC durign seedstrap.
How Has This Been Tested?
System-level testing
Seedstrap can now finish within 3s if there are no seed peer connection issues:
Previously when the connections were being closed:
What process can a PR reviewer use to test or verify this change?
Code review
System-level testing
Breaking Changes
Summary by CodeRabbit
New Features
Refactor
Bug Fixes
Chores