fix: race condition in getting seed peers in connectivity manager at startup#7474
Conversation
This PR fixes a race condition between adding seed peers at startup and retrieving them from
the peer manager in the connectivity manager.
The warning message `WARN Failed to get seed peers from PeerManager, using empty list
// comms/core/src/connectivity/manager.rs:207` is a timing issue. In this context the seed
peer list is passed to the proactive dialer to exclude them from being proactively dialed
if other connections exist. On a node with proper connectivity, seed peer dial requests
are coming through straight after.
- `2025-09-04 16:28:05.638210600 [p2p::initialization] DEBUG Adding seed peer [PeerFlags(SEED)
[0984896e74022c44] `
- `2025-09-04 16:28:05.648016500 [comms::connectivity::manager] DEBUG ConnectivityManager started`
- `2025-09-04 16:28:05.648239900 [comms::connectivity::manager] WARN Failed to get seed peers
from PeerManager, using empty list`
- `2025-09-04 16:28:05.648850400 [comms::dht::connectivity] DEBUG Adding 5 neighbouring peer(s),
removing 0 peers: 0984896e74022c442c1034852c, `
- `2025-09-04 16:28:05.648912900 [comms::connectivity::manager] TRACE Request (11267568784269984977):
DialPeer { node_id: NodeId(0984896e74022c442c1034852c), reply_tx: None }`
...
- `2025-09-04 16:31:05.655859300 [comms::connectivity::manager] DEBUG (5619097807157888458) Starting
proactive dialing execution - current connections: 10, target: 8`
wip
WalkthroughMoves seed initialization from startup to lazy loading during proactive dialing; adds logging and error handling when fetching seed peers; DNS TXT seed parsing now logs records and per-record parse errors. No public API signatures changed. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Scheduler as Scheduler (tick)
participant CMA as ConnectivityManagerActor
participant PM as PeerManager
participant DNS as DnsSeedResolver
Scheduler->>CMA: execute_proactive_dialing()
alt self.seeds is empty
CMA->>PM: get_seed_peers().await
PM-->>CMA: Result<Vec<Peer>>
alt Ok
CMA->>CMA: self.seeds = peers.map(node_id.clone)
else Err
Note over CMA: warn: failed to fetch seeds\nfallback to empty list
CMA->>CMA: self.seeds = []
end
else seeds cached
Note over CMA: reuse cached seeds
end
CMA->>CMA: proceed with proactive dialing using self.seeds
Note over DNS: DNS resolver logs TXT records and per-record parse errors during seed parsing
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
comms/core/src/peer_manager/peer_storage_sql.rs (1)
268-276: Gate trace logging to avoid unnecessary allocations; include count for readability
trace!arguments are evaluated even when the level is disabled; building a Vec for logging does extra work on hot paths. Guard the formatting and include a count.- trace!( - target: LOG_TARGET, - "Get seed peers: {:?}", - seed_peers.iter().map(|p| p.node_id.short_str()).collect::<Vec<_>>(), - ); + if log::log_enabled!(target: LOG_TARGET, log::Level::Trace) { + let ids: Vec<_> = seed_peers.iter().map(|p| p.node_id.short_str()).collect(); + trace!(target: LOG_TARGET, "Get seed peers ({}): [{}]", ids.len(), ids.join(",")); + }comms/core/src/connectivity/manager.rs (2)
1208-1224: Avoid clones; use into_iter + unwrap_or_defaultMinor micro-opt: move NodeIds out of Peers and use
unwrap_or_default().- if self.seeds.is_empty() { - self.seeds = self - .peer_manager - .get_seed_peers() - .await - .inspect_err(|err| { - warn!( - target: LOG_TARGET, - "Failed to get seed peers from PeerManager, using empty list for proactive dialing, seed peers \ - will not be excluded as a first pass. ({})", err - ); - }) - .unwrap_or(vec![]) - .iter() - .map(|s| s.node_id.clone()) - .collect(); - } + if self.seeds.is_empty() { + self.seeds = self + .peer_manager + .get_seed_peers() + .await + .inspect_err(|err| { + warn!( + target: LOG_TARGET, + "Failed to get seed peers from PeerManager, using empty list for proactive dialing; seeds \ + won't be excluded on first pass. ({})", + err + ); + }) + .unwrap_or_default() + .into_iter() + .map(|s| s.node_id) + .collect(); + }
1208-1224: Throttle re-fetch when there are legitimately zero seedsAs written, if the DB truly has no seeds, this block runs every refresh tick. Consider tracking a boolean (e.g.,
seeds_initialized: bool) or a last-fetch timestamp to avoid repeated DB hits, with a periodic refresh (e.g., every few minutes).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
comms/core/src/connectivity/manager.rs(1 hunks)comms/core/src/peer_manager/peer_storage_sql.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Applied to files:
comms/core/src/connectivity/manager.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Applied to files:
comms/core/src/connectivity/manager.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Applied to files:
comms/core/src/connectivity/manager.rs
🧬 Code graph analysis (1)
comms/core/src/peer_manager/peer_storage_sql.rs (3)
comms/core/src/peer_manager/manager.rs (1)
get_seed_peers(127-129)comms/core/src/peer_manager/storage/database.rs (1)
get_seed_peers(1632-1642)comms/core/src/peer_manager/node_id.rs (1)
short_str(132-134)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: cargo check with stable
- GitHub Check: test (mainnet, stagenet)
- GitHub Check: test (nextnet, nextnet)
- GitHub Check: test (testnet, esmeralda)
- GitHub Check: wasm build tests
- GitHub Check: Cucumber tests / FFI
- GitHub Check: Cucumber tests / Base Layer
- GitHub Check: ci
- GitHub Check: ledger build tests
🔇 Additional comments (2)
comms/core/src/connectivity/manager.rs (2)
1208-1224: LGTM: Lazy seed fetch fixes the startup race without changing external behaviorFetching seeds on first proactive-dial pass eliminates the timing window and still retries on subsequent passes if empty. Nicely scoped.
1208-1224: Drop MSRV inspection suggestion
Result::inspect_erris already used throughout the codebase and is supported by the stable toolchain.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
base_layer/p2p/src/peer_seeds.rs (4)
79-79: Cap DNS TXT log verbosity and sanitize values.Dumping full, untrusted TXT records at trace can be noisy and risk log-forging if records contain control chars. Log count + a truncated sample instead.
- trace!(target: LOG_TARGET, "DNS records: {:?}", records); + trace!( + target: LOG_TARGET, + "DNS TXT record count: {}, sample: {:?}", + records.len(), + records + .iter() + .take(3) + .map(|s| if s.len() > 128 { format!("{}…", &s[..128]) } else { s.clone() }) + .collect::<Vec<_>>() + );
80-84: Be explicit about types to avoid inference surprises.Minor clarity: explicitly annotate the vector element type and turbofish parse.
- let peers: Vec<_> = records + let peers: Vec<SeedPeer> = records .into_iter() .filter_map(|txt| { - txt.parse() + txt.parse::<SeedPeer>() .inspect_err(|err| {
93-94: Avoid allocating a Vec just for logging; log count + small sample.Reduces alloc/formatting overhead on hot paths while keeping visibility.
- trace!(target: LOG_TARGET, "Seed peers: {:?}", peers.iter().map(|p| format!("{}", p)).collect::<Vec<_>>()); + trace!( + target: LOG_TARGET, + "Parsed {} seed peer(s); sample: {:?}", + peers.len(), + peers.iter().take(3).map(|p| p.to_string()).collect::<Vec<_>>() + );
143-156: Display/FromStr are not round-trippable (multi-address delimiter mismatch).Display joins multiple addresses with "::", but FromStr expects exactly one "::" delimiter (public_key::address). This can mislead downstream tooling and logs that assume round-tripping. Consider a different address separator (e.g., comma) in Display or allowing multi-address parsing.
Can you confirm no code path relies on parsing String::from(SeedPeer) back via FromStr?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
base_layer/p2p/src/peer_seeds.rs(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Applied to files:
base_layer/p2p/src/peer_seeds.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Applied to files:
base_layer/p2p/src/peer_seeds.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Cucumber tests / Base Layer
- GitHub Check: Cucumber tests / FFI
- GitHub Check: test (mainnet, stagenet)
- GitHub Check: test (testnet, esmeralda)
- GitHub Check: test (nextnet, nextnet)
- GitHub Check: cargo check with stable
- GitHub Check: ci
- GitHub Check: ledger build tests
🔇 Additional comments (2)
base_layer/p2p/src/peer_seeds.rs (2)
80-90: Nice: per-record parse error logging via inspect_err.Good observability improvement while still short-circuiting invalid entries.
80-90: inspect_err supported by current MSRV
Result::inspect_err was stabilized in Rust 1.76.0 (Feb 8, 2024), and our root rust-toolchain.toml pins the build to the stable channel (≥1.76.0). No action needed.
Description
This PR fixes a race condition between adding seed peers at startup and retrieving them from the peer manager in the connectivity manager.
Motivation and Context
The warning message
WARN Failed to get seed peers from PeerManager, using empty list // comms/core/src/connectivity/manager.rs:207is a timing issue. In this context the seed peer list is passed to the proactive dialer to exclude them from being proactively dialed if other connections exist. On a node with proper connectivity, seed peer dial requests are coming through straight after.2025-09-04 16:28:05.638210600 [p2p::initialization] DEBUG Adding seed peer [PeerFlags(SEED)[0984896e74022c44]2025-09-04 16:28:05.648016500 [comms::connectivity::manager] DEBUG ConnectivityManager started2025-09-04 16:28:05.648239900 [comms::connectivity::manager] WARN Failed to get seed peers from PeerManager, using empty list2025-09-04 16:28:05.648850400 [comms::dht::connectivity] DEBUG Adding 5 neighbouring peer(s), removing 0 peers: 0984896e74022c442c1034852c,2025-09-04 16:28:05.648912900 [comms::connectivity::manager] TRACE Request (11267568784269984977): DialPeer { node_id: NodeId(0984896e74022c442c1034852c), reply_tx: None }...2025-09-04 16:31:05.655859300 [comms::connectivity::manager] DEBUG (5619097807157888458) Starting proactive dialing execution - current connections: 10, target: 8How Has This Been Tested?
System-level testing
What process can a PR reviewer use to test or verify this change?
Code review
Breaking Changes
Summary by CodeRabbit
Refactor
Chores