Skip to content

fix: race condition in getting seed peers in connectivity manager at startup#7474

Merged
SWvheerden merged 2 commits intotari-project:developmentfrom
hansieodendaal:ho_fix_false_negative_connectivity_manager
Sep 5, 2025
Merged

fix: race condition in getting seed peers in connectivity manager at startup#7474
SWvheerden merged 2 commits intotari-project:developmentfrom
hansieodendaal:ho_fix_false_negative_connectivity_manager

Conversation

@hansieodendaal
Copy link
Copy Markdown
Contributor

@hansieodendaal hansieodendaal commented Sep 5, 2025

Description

This PR fixes a race condition between adding seed peers at startup and retrieving them from the peer manager in the connectivity manager.

Motivation and Context

The warning message WARN Failed to get seed peers from PeerManager, using empty list // comms/core/src/connectivity/manager.rs:207 is a timing issue. In this context the seed peer list is passed to the proactive dialer to exclude them from being proactively dialed if other connections exist. On a node with proper connectivity, seed peer dial requests are coming through straight after.

  • 2025-09-04 16:28:05.638210600 [p2p::initialization] DEBUG Adding seed peer [PeerFlags(SEED)[0984896e74022c44]
  • 2025-09-04 16:28:05.648016500 [comms::connectivity::manager] DEBUG ConnectivityManager started
  • 2025-09-04 16:28:05.648239900 [comms::connectivity::manager] WARN Failed to get seed peers from PeerManager, using empty list
  • 2025-09-04 16:28:05.648850400 [comms::dht::connectivity] DEBUG Adding 5 neighbouring peer(s), removing 0 peers: 0984896e74022c442c1034852c,
  • 2025-09-04 16:28:05.648912900 [comms::connectivity::manager] TRACE Request (11267568784269984977): DialPeer { node_id: NodeId(0984896e74022c442c1034852c), reply_tx: None } ...
  • 2025-09-04 16:31:05.655859300 [comms::connectivity::manager] DEBUG (5619097807157888458) Starting proactive dialing execution - current connections: 10, target: 8

How Has This Been Tested?

System-level testing

What process can a PR reviewer use to test or verify this change?

Code review

Breaking Changes

  • None
  • Requires data directory on base node to be deleted
  • Requires hard fork
  • Other - Please specify

Summary by CodeRabbit

  • Refactor

    • Seed peers now load on-demand instead of at startup, reducing startup overhead and improving initial responsiveness.
  • Chores

    • Improved diagnostic logging around seed peer retrieval and DNS seed parsing, including warnings for parse failures and failed seed lookups to aid troubleshooting.

This PR fixes a race condition between adding seed peers at startup and retrieving them from
the peer manager in the connectivity manager.

The warning message `WARN  Failed to get seed peers from PeerManager, using empty list
// comms/core/src/connectivity/manager.rs:207` is a timing issue. In this context the seed
peer list is passed to the proactive dialer to exclude them from being proactively dialed
if other connections exist. On a node with proper connectivity, seed peer dial requests
are coming through straight after.

- `2025-09-04 16:28:05.638210600 [p2p::initialization] DEBUG Adding seed peer [PeerFlags(SEED)
  [0984896e74022c44] `
- `2025-09-04 16:28:05.648016500 [comms::connectivity::manager] DEBUG ConnectivityManager started`
- `2025-09-04 16:28:05.648239900 [comms::connectivity::manager] WARN  Failed to get seed peers
  from PeerManager, using empty list`
- `2025-09-04 16:28:05.648850400 [comms::dht::connectivity] DEBUG Adding 5 neighbouring peer(s),
  removing 0 peers: 0984896e74022c442c1034852c, `
- `2025-09-04 16:28:05.648912900 [comms::connectivity::manager] TRACE Request (11267568784269984977):
  DialPeer { node_id: NodeId(0984896e74022c442c1034852c), reply_tx: None }`
...
- `2025-09-04 16:31:05.655859300 [comms::connectivity::manager] DEBUG (5619097807157888458) Starting
  proactive dialing execution - current connections: 10, target: 8`

wip
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Sep 5, 2025

Walkthrough

Moves seed initialization from startup to lazy loading during proactive dialing; adds logging and error handling when fetching seed peers; DNS TXT seed parsing now logs records and per-record parse errors. No public API signatures changed.

Changes

Cohort / File(s) Summary
Connectivity Manager (lazy seed init)
comms/core/src/connectivity/manager.rs
Removed eager startup fetch of seed peers; added lazy fetch in execute_proactive_dialing() using get_seed_peers().await; logs a warning on fetch error and falls back to empty seeds; caches node IDs into self.seeds; GetSeeds handler unchanged.
Peer Storage SQL (trace logging)
comms/core/src/peer_manager/peer_storage_sql.rs
Added trace logging of retrieved seed peers (short node_id strings) via a local seed_peers variable; added minor doc comment; behavior and signature unchanged.
DNS Seed parsing (logging & per-record errors)
base_layer/p2p/src/peer_seeds.rs
Logs DNS TXT records before parsing and logs parsed SeedPeer list after parsing; adds per-record parse error logging via inspect_err and an explicit Vec type annotation; parsing/filtering logic unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Scheduler as Scheduler (tick)
  participant CMA as ConnectivityManagerActor
  participant PM as PeerManager
  participant DNS as DnsSeedResolver

  Scheduler->>CMA: execute_proactive_dialing()
  alt self.seeds is empty
    CMA->>PM: get_seed_peers().await
    PM-->>CMA: Result<Vec<Peer>>
    alt Ok
      CMA->>CMA: self.seeds = peers.map(node_id.clone)
    else Err
      Note over CMA: warn: failed to fetch seeds\nfallback to empty list
      CMA->>CMA: self.seeds = []
    end
  else seeds cached
    Note over CMA: reuse cached seeds
  end
  CMA->>CMA: proceed with proactive dialing using self.seeds
  Note over DNS: DNS resolver logs TXT records and per-record parse errors during seed parsing
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • MCozhusheck
  • SWvheerden

Poem

"A rabbit sniffs the network breeze,
Delays the seeds until they please.
Logs the lines, each TXT and bit,
If parsing trips, it notes the hit.
Hops on—dialing starts, seeds in paw, a quiet tease." 🐇

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@hansieodendaal hansieodendaal changed the title fix: false negative retrieving seed peers in the connectivity manager at starup fix: false negative retrieving seed peers in connectivity manager at startup Sep 5, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
comms/core/src/peer_manager/peer_storage_sql.rs (1)

268-276: Gate trace logging to avoid unnecessary allocations; include count for readability

trace! arguments are evaluated even when the level is disabled; building a Vec for logging does extra work on hot paths. Guard the formatting and include a count.

-        trace!(
-            target: LOG_TARGET,
-            "Get seed peers: {:?}",
-            seed_peers.iter().map(|p| p.node_id.short_str()).collect::<Vec<_>>(),
-        );
+        if log::log_enabled!(target: LOG_TARGET, log::Level::Trace) {
+            let ids: Vec<_> = seed_peers.iter().map(|p| p.node_id.short_str()).collect();
+            trace!(target: LOG_TARGET, "Get seed peers ({}): [{}]", ids.len(), ids.join(","));
+        }
comms/core/src/connectivity/manager.rs (2)

1208-1224: Avoid clones; use into_iter + unwrap_or_default

Minor micro-opt: move NodeIds out of Peers and use unwrap_or_default().

-        if self.seeds.is_empty() {
-            self.seeds = self
-                .peer_manager
-                .get_seed_peers()
-                .await
-                .inspect_err(|err| {
-                    warn!(
-                        target: LOG_TARGET,
-                        "Failed to get seed peers from PeerManager, using empty list for proactive dialing, seed peers \
-                        will not be excluded as a first pass. ({})", err
-                    );
-                })
-                .unwrap_or(vec![])
-                .iter()
-                .map(|s| s.node_id.clone())
-                .collect();
-        }
+        if self.seeds.is_empty() {
+            self.seeds = self
+                .peer_manager
+                .get_seed_peers()
+                .await
+                .inspect_err(|err| {
+                    warn!(
+                        target: LOG_TARGET,
+                        "Failed to get seed peers from PeerManager, using empty list for proactive dialing; seeds \
+                         won't be excluded on first pass. ({})",
+                        err
+                    );
+                })
+                .unwrap_or_default()
+                .into_iter()
+                .map(|s| s.node_id)
+                .collect();
+        }

1208-1224: Throttle re-fetch when there are legitimately zero seeds

As written, if the DB truly has no seeds, this block runs every refresh tick. Consider tracking a boolean (e.g., seeds_initialized: bool) or a last-fetch timestamp to avoid repeated DB hits, with a periodic refresh (e.g., every few minutes).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b0cfe30 and 1b96af5.

📒 Files selected for processing (2)
  • comms/core/src/connectivity/manager.rs (1 hunks)
  • comms/core/src/peer_manager/peer_storage_sql.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.

Applied to files:

  • comms/core/src/connectivity/manager.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.

Applied to files:

  • comms/core/src/connectivity/manager.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.

Applied to files:

  • comms/core/src/connectivity/manager.rs
🧬 Code graph analysis (1)
comms/core/src/peer_manager/peer_storage_sql.rs (3)
comms/core/src/peer_manager/manager.rs (1)
  • get_seed_peers (127-129)
comms/core/src/peer_manager/storage/database.rs (1)
  • get_seed_peers (1632-1642)
comms/core/src/peer_manager/node_id.rs (1)
  • short_str (132-134)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: cargo check with stable
  • GitHub Check: test (mainnet, stagenet)
  • GitHub Check: test (nextnet, nextnet)
  • GitHub Check: test (testnet, esmeralda)
  • GitHub Check: wasm build tests
  • GitHub Check: Cucumber tests / FFI
  • GitHub Check: Cucumber tests / Base Layer
  • GitHub Check: ci
  • GitHub Check: ledger build tests
🔇 Additional comments (2)
comms/core/src/connectivity/manager.rs (2)

1208-1224: LGTM: Lazy seed fetch fixes the startup race without changing external behavior

Fetching seeds on first proactive-dial pass eliminates the timing window and still retries on subsequent passes if empty. Nicely scoped.


1208-1224: Drop MSRV inspection suggestion
Result::inspect_err is already used throughout the codebase and is supported by the stable toolchain.

Likely an incorrect or invalid review comment.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
base_layer/p2p/src/peer_seeds.rs (4)

79-79: Cap DNS TXT log verbosity and sanitize values.

Dumping full, untrusted TXT records at trace can be noisy and risk log-forging if records contain control chars. Log count + a truncated sample instead.

-        trace!(target: LOG_TARGET, "DNS records: {:?}", records);
+        trace!(
+            target: LOG_TARGET,
+            "DNS TXT record count: {}, sample: {:?}",
+            records.len(),
+            records
+                .iter()
+                .take(3)
+                .map(|s| if s.len() > 128 { format!("{}…", &s[..128]) } else { s.clone() })
+                .collect::<Vec<_>>()
+        );

80-84: Be explicit about types to avoid inference surprises.

Minor clarity: explicitly annotate the vector element type and turbofish parse.

-        let peers: Vec<_> = records
+        let peers: Vec<SeedPeer> = records
             .into_iter()
             .filter_map(|txt| {
-                txt.parse()
+                txt.parse::<SeedPeer>()
                     .inspect_err(|err| {

93-94: Avoid allocating a Vec just for logging; log count + small sample.

Reduces alloc/formatting overhead on hot paths while keeping visibility.

-        trace!(target: LOG_TARGET, "Seed peers: {:?}", peers.iter().map(|p| format!("{}", p)).collect::<Vec<_>>());
+        trace!(
+            target: LOG_TARGET,
+            "Parsed {} seed peer(s); sample: {:?}",
+            peers.len(),
+            peers.iter().take(3).map(|p| p.to_string()).collect::<Vec<_>>()
+        );

143-156: Display/FromStr are not round-trippable (multi-address delimiter mismatch).

Display joins multiple addresses with "::", but FromStr expects exactly one "::" delimiter (public_key::address). This can mislead downstream tooling and logs that assume round-tripping. Consider a different address separator (e.g., comma) in Display or allowing multi-address parsing.

Can you confirm no code path relies on parsing String::from(SeedPeer) back via FromStr?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1b96af5 and 2520c6c.

📒 Files selected for processing (1)
  • base_layer/p2p/src/peer_seeds.rs (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.

Applied to files:

  • base_layer/p2p/src/peer_seeds.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.

Applied to files:

  • base_layer/p2p/src/peer_seeds.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Cucumber tests / Base Layer
  • GitHub Check: Cucumber tests / FFI
  • GitHub Check: test (mainnet, stagenet)
  • GitHub Check: test (testnet, esmeralda)
  • GitHub Check: test (nextnet, nextnet)
  • GitHub Check: cargo check with stable
  • GitHub Check: ci
  • GitHub Check: ledger build tests
🔇 Additional comments (2)
base_layer/p2p/src/peer_seeds.rs (2)

80-90: Nice: per-record parse error logging via inspect_err.

Good observability improvement while still short-circuiting invalid entries.


80-90: inspect_err supported by current MSRV
Result::inspect_err was stabilized in Rust 1.76.0 (Feb 8, 2024), and our root rust-toolchain.toml pins the build to the stable channel (≥1.76.0). No action needed.

@SWvheerden SWvheerden changed the title fix: false negative retrieving seed peers in connectivity manager at startup fix: race condition in getting seed peers in connectivity manager at startup Sep 5, 2025
@SWvheerden SWvheerden merged commit 36fd93f into tari-project:development Sep 5, 2025
38 of 48 checks passed
@hansieodendaal hansieodendaal deleted the ho_fix_false_negative_connectivity_manager branch September 5, 2025 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants