Skip to content

feat: add timeouts to health check#7482

Merged
SWvheerden merged 6 commits intotari-project:developmentfrom
hansieodendaal:ho_health_check
Sep 17, 2025
Merged

feat: add timeouts to health check#7482
SWvheerden merged 6 commits intotari-project:developmentfrom
hansieodendaal:ho_health_check

Conversation

@hansieodendaal
Copy link
Copy Markdown
Contributor

@hansieodendaal hansieodendaal commented Sep 8, 2025

Description

  • This PR fixes an interlock in the health check whereby either discover-peer or ping-peer locks up and does not return. This culminated in the health check being interlocked for the remainder of the base node's uptime.
  • The discover-peer and ping-peer health checks are now run in parallel.
  • The ping-peer health check is now preceded by a dial-peer to ensure we have a connection before we proceed.
  • Changed the gRPC liveness_results in the GetNetworkState method to return ms instead of s and documented it as such; this was previously undocumented. When the network is well connected, discovery and ping latencies would be less than 1 s, resulting in zero values to be returned.

Fixes #7380.

Motivation and Context

See #7380.

How Has This Been Tested?

  • System-level testing (c::bn::tari_pulse logged at least trace level)
  • When running the gRPC GetNetworkState method, liveness_results should not be an empty array if seed peers are contactable.

What process can a PR reviewer use to test or verify this change?

Code review.
System-level testing (to verify in the log files change c::bn::tari_pulse to at least debug level).

Breaking Changes

  • None
  • Requires data directory on base node to be deleted
  • Requires hard fork
  • Other - Please specify

BREAKING CHANGE: Changed the gRPC liveness_results in the GetNetworkState method to return ms instead of s and documented it as such; this was previously undocumented.

Summary by CodeRabbit

  • New Features

    • Health-check interval is now optional and disabled by default.
  • Bug Fixes

    • More reliable per-peer health checks with clearer timeouts, improved connection cleanup, and fewer hangs/false negatives.
    • Service startup now waits for the first liveness event to ensure initial health state.
  • Refactor

    • Discovery and ping checks run in parallel for faster reporting.
    • Latency measurements standardized to milliseconds.
  • Documentation

    • Config comment updated to indicate health checks default to disabled.

@hansieodendaal hansieodendaal requested a review from a team as a code owner September 8, 2025 13:05
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Sep 8, 2025

Walkthrough

Clarified proto comments for LivenessResult; gRPC now reports discover and ping latencies in milliseconds with u64 conversion/fallback; TariPulseService refactored to make health-check interval optional, wait for first liveness event on startup, run per-peer discovery/ping in parallel with explicit timeouts, aggregate results, and perform timed disconnects.

Changes

Cohort / File(s) Summary
Proto comment updates
applications/minotari_app_grpc/proto/base_node.proto
Adjusted LivenessResult field comments to specify milliseconds and timeout sentinel; no schema/field changes.
gRPC latency conversion (ms)
applications/minotari_node/src/grpc/base_node_grpc_server.rs
Latency extraction switched from seconds to milliseconds via Duration::as_millis()u64::try_from, falling back to u64::MAX on overflow/None; no public signature changes.
Pulse health-check refactor & config
base_layer/core/src/base_node/tari_pulse_service/mod.rs, applications/minotari_node/src/config.rs, common/config/presets/c_base_node_c.toml
Made health-check interval Option<Duration>, defaulted to None, updated serde to optional_seconds; TariPulseService waits for first liveness event, uses spawn_discovery/spawn_ping helpers with timeouts, aggregates per-peer results, adds disconnect_peer, expands imports and logging; public API changed for TariPulseConfig/initializer signatures.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Pulse as TariPulseService
  participant Seeds as SeedPeers
  participant Dht as DhtDiscoveryRequester
  participant Conn as ConnectivityRequester
  participant Live as LivenessService

  Note over Pulse: Startup waits for first liveness event
  Pulse->>Seeds: fetch_seed_peers (with timeout)
  alt seeds fetched
    loop per-seed
      par Discovery (<=90s)
        Pulse->>Dht: discover(peer_pk) (timeout)
        Dht-->>Pulse: discovery_duration | timeout
      and Ping (<=60s window)
        Pulse->>Conn: dial(peer) (<=60s)
        Pulse->>Live: send_ping (<=15s)
        Live-->>Pulse: matching_pong (<=60s)
      end
      Pulse->>Pulse: aggregate {discover_ms?, ping_ms?}
      Pulse->>Conn: disconnect_peer (timed)
    end
  else fetch failed
    Note over Pulse: abort health-check tick
  end
  Note over Pulse: record/log results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • SWvheerden
  • MCozhusheck

Poem

In burrows where the pings now play,
I count the millis and hop away.
Timeouts keep each check in line,
Discovery, ping—both finished fine.
A tidy disconnect, and all is well. 🐇

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Out of Scope Changes Check ⚠️ Warning While the core timeout and logging work is in scope, the PR also introduces public API and config signature changes (e.g., TariPulseConfig.liveness_interval: Duration → Option, TariPulseServiceInitializer::new signature change, and BaseNodeConfig.tari_pulse_health_check becoming Option) and a documented gRPC unit change; these are broader, breaking-surface changes beyond the narrow scope of simply adding timeouts around awaits. Such API/config changes affect downstream consumers and should be treated as separate, documented breaking changes rather than incidental edits to the health-check fix. Split or isolate the API and default-config signature changes into a separate PR or clearly document them as breaking changes in this PR (update callers/tests, add migration notes and changelog entries, and highlight the gRPC unit change) so reviewers and downstream consumers can assess and adapt to the wider impact.
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "feat: add timeouts to health check" succinctly and accurately captures the primary change in the PR—adding timeout guards to the base-node health check to prevent lockups—and aligns with the linked issue (#7380) and implemented changes. It is concise, specific, and readable for a teammate scanning history.
Linked Issues Check ✅ Passed The changes implement bounded per-peer operations (spawn_discovery with a ~90s timeout, spawn_ping with dial/ping/pong timeouts, and disconnect with its own timeouts), run discovery and ping in parallel, and add increased debug logging; these directly address the linked issue's goals to add timeout guards, surface stuck calls via logs, and ensure health-check runs complete within the intended timeout budget (per-peer worst-case ≈135s, which is < 180s). The gRPC latency unit change to milliseconds and the refactor to isolated helpers further support reliable timing and observability. Based on the provided summaries, the coding requirements from issue #7380 appear satisfied.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hansieodendaal hansieodendaal marked this pull request as draft September 8, 2025 13:05
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)

392-422: Consider logging the received unrelated pong details for debugging.

When receiving unrelated pongs (lines 394-398), it might be helpful to log which peer the pong came from to aid in debugging connectivity issues.

 if let LivenessEvent::ReceivedPong(pong) = &*event {
     if pong.node_id == peer_node_id && pong.nonce == nonce {
         return Some(start.elapsed());
+    } else {
+        trace!(
+            target: LOG_TARGET,
+            "check_health: received pong from {} (nonce: {}) while waiting for {} (nonce: {})",
+            pong.node_id,
+            pong.nonce,
+            peer_node_id,
+            nonce
+        );
     }
 }

332-507: Consider adding a top-level timeout as a safety net.

While individual operations have timeouts, consider wrapping the entire check_health function with an overall timeout (e.g., 3 minutes) to guarantee it never exceeds the expected duration, even if there are unexpected delays in task spawning or joining.

At the beginning of the function, wrap the entire logic:

 async fn check_health(
     mut node_comms: ConnectivityRequester,
     liveness_handle: LivenessHandle,
     node_discovery: DhtDiscoveryRequester,
     notify_comms_health: watch::Sender<Vec<LivenessCheckResult>>,
 ) {
+    // Ensure the entire health check completes within 3 minutes
+    match tokio::time::timeout(Duration::from_secs(180), check_health_inner(
+        node_comms,
+        liveness_handle,
+        node_discovery,
+        notify_comms_health,
+    )).await {
+        Ok(()) => {},
+        Err(_) => {
+            warn!(target: LOG_TARGET, "check_health: overall timeout exceeded (3 minutes)");
+            notify_comms_health.send(vec![]).expect("Channel should be open");
+        }
+    }
+}
+
+async fn check_health_inner(
+    mut node_comms: ConnectivityRequester,
+    liveness_handle: LivenessHandle,
+    node_discovery: DhtDiscoveryRequester,
+    notify_comms_health: watch::Sender<Vec<LivenessCheckResult>>,
+) {
     let results = Arc::new(RwLock::new(Vec::new()));
     // ... rest of the current implementation
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab82dd1 and d2aef08.

📒 Files selected for processing (1)
  • base_layer/core/src/base_node/tari_pulse_service/mod.rs (2 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.
📚 Learning: 2025-07-12T03:43:22.545Z
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:13:37.206Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: test (mainnet, stagenet)
  • GitHub Check: ci
  • GitHub Check: ledger build tests
  • GitHub Check: test (testnet, esmeralda)
  • GitHub Check: test (nextnet, nextnet)
  • GitHub Check: wasm build tests
  • GitHub Check: cargo check with stable
  • GitHub Check: Cucumber tests / FFI
  • GitHub Check: Cucumber tests / Base Layer
🔇 Additional comments (7)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (7)

325-325: Good addition of the lint attribute for function length.

The #[allow(clippy::too_many_lines)] attribute is appropriate here as the function's complexity is justified by the need for comprehensive timeout handling and error logging.


334-344: Excellent implementation of timeout for seed peer retrieval.

The 10-second timeout for get_seeds() with proper error handling prevents the health check from hanging indefinitely. The early return pattern is appropriate when seed peers cannot be retrieved.


365-381: Consider the previous learning about discovery_peer timeouts.

Based on the retrieved learnings, discovery_peer has built-in timeout mechanisms. However, given the specific issue #7380 about health check locks not being released, adding an explicit 90-second timeout is justified as a safeguard. This aligns with the PR objectives to ensure the health check completes within bounded time.


388-437: Well-structured ping/pong timeout implementation.

The implementation correctly addresses the core issue from #7380:

  • 15-second timeout for sending the ping
  • 15-second timeout for receiving the pong
  • Protection against waiting indefinitely for unrelated liveness events (max 5 events)

This ensures that pongs that are never returned don't cause the health check to hang.


440-462: Good use of concurrent task execution.

Using tokio::join! to run discovery and ping tasks in parallel improves efficiency and ensures both complete or timeout independently. The error handling for task join failures is comprehensive.


465-499: Connection cleanup implementation aligns with known requirements.

The disconnect logic correctly uses parameters (0, 2) as per the retrieved learning that tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks. The 15-second timeouts for both getting the connection and disconnecting ensure bounded execution time.


334-344: Verify that the cumulative timeout bounds meet the 3-minute requirement.

The PR objectives state that each health check should complete within ≤ 3 minutes. Let's verify the maximum possible execution time:

  • Seed retrieval: 10s
  • Per peer (in parallel):
    • Discovery: 90s
    • Ping/pong: 15s + 15s = 30s
    • Connection cleanup: 15s + 15s = 30s

Since discovery and ping run in parallel, the maximum per-peer time is max(90s, 30s) + 30s = 120s. This is within the 3-minute (180s) requirement.

Also applies to: 365-381, 388-437, 465-499

- This PR fixes an interlock in the health check whereby either discover-peer or ping-peer
  locks up and do not return. This culminated in the health check being interlocked for
  the remainder of the base node's uptime.
- The discover-peer and ping-peer tests are now run in parallel.
- The ping-peer test is now preceded by a dial-peer to ensure we have a connection before
  we proceed.
- Changed the gRPC result to return `ms` instead of `s` and documented it as such; this
  was previously undocumented. When the network is well connected, discovery and ping
  latencies would be less than `1 s`, resulting in zero values to be returned.
@hansieodendaal hansieodendaal marked this pull request as ready for review September 9, 2025 06:11
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
applications/minotari_node/src/grpc/base_node_grpc_server.rs (1)

560-570: Avoid duplication: extract “Option -> u64 ms (or u64::MAX)” helper.

Same mapping is repeated for both fields; a small helper improves readability and keeps proto/server semantics in one place.

-                discover_latency: data
-                    .discovery_latency
-                    .map(|v| u64::try_from(v.as_millis()).unwrap_or(u64::MAX))
-                    .unwrap_or_else(|| u64::MAX),
-                ping_latency: data
-                    .ping_latency
-                    .map(|v| u64::try_from(v.as_millis()).unwrap_or(u64::MAX))
-                    .unwrap_or_else(|| u64::MAX),
+                discover_latency: {
+                    let to_ms_or_max = |d: Option<std::time::Duration>| {
+                        d.map(|v| u64::try_from(v.as_millis()).unwrap_or(u64::MAX)).unwrap_or(u64::MAX)
+                    };
+                    to_ms_or_max(data.discovery_latency)
+                },
+                ping_latency: {
+                    let to_ms_or_max = |d: Option<std::time::Duration>| {
+                        d.map(|v| u64::try_from(v.as_millis()).unwrap_or(u64::MAX)).unwrap_or(u64::MAX)
+                    };
+                    to_ms_or_max(data.ping_latency)
+                },
base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)

127-128: Startup delay should be configurable.

Hard-coding 30s can be problematic across environments. Prefer a config knob with a sensible default.

-        tokio::time::sleep(Duration::from_secs(30)).await; // Wait for the node to start up properly
+        tokio::time::sleep(self.config.startup_delay.unwrap_or(Duration::from_secs(30))).await;

Language-only support (outside this hunk):

  • Add pub startup_delay: Option<Duration> to TariPulseConfig defaulting to None.

336-347: Seeds fetch timeout: good guard, but consider reporting an empty snapshot instead of early return.

Returning early drops the whole health snapshot for this tick. Optionally publish an empty liveness_checks to unblock consumers and make behavior explicit.

-            warn!(target: LOG_TARGET, "check_health: timeout getting seed peers");
-            return;
+            warn!(target: LOG_TARGET, "check_health: timeout getting seed peers");
+            let _ = notify_comms_health.send(vec![]);
+            return;

349-401: Bound fan-out to avoid spawning N tasks unbounded on large seed sets.

A semaphore (e.g., 32–64 permits) prevents bursty load on connectivity/liveness subsystems.

-    let mut handles = vec![];
+    let mut handles = vec![];
+    let semaphore = Arc::new(tokio::sync::Semaphore::new(64)); // cap concurrency

     for peer in &peers {
+        let permit = semaphore.clone().acquire_owned().await.unwrap();
         let result_clone = results.clone();
         let mut result = LivenessCheckResult { /* ... */ };

         handles.push(task::spawn(async move {
+            let _permit = permit;
             let discovery_handle = spawn_discovery(discovery, dest_key);
             let ping_handle = spawn_ping(comms_clone.clone(), liveness, liveness_events, result.peer.clone());
             // Await both inner tasks
             let (discovery_result, ping_result) = tokio::join!(discovery_handle, ping_handle);
             /* ... */
         }));
     }

409-432: Unify discovery timeouts
DhtDiscoveryRequester::discover_peer already applies its configured discovery_timeout, so wrapping it again in spawn_discovery with a hard-coded 90 s is redundant and risks them falling out of sync. Replace the Duration::from_secs(90) here with the requester’s own timeout (or remove the outer wrapper) so all discovery calls share the same configured deadline.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab82dd1 and 94dd061.

📒 Files selected for processing (3)
  • applications/minotari_app_grpc/proto/base_node.proto (1 hunks)
  • applications/minotari_node/src/grpc/base_node_grpc_server.rs (1 hunks)
  • base_layer/core/src/base_node/tari_pulse_service/mod.rs (5 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.
Learnt from: hansieodendaal
PR: tari-project/tari#6963
File: comms/core/src/peer_manager/manager.rs:60-68
Timestamp: 2025-05-26T02:40:23.812Z
Learning: PeerDatabaseSql in the Tari codebase has been specifically refactored to handle concurrent access and mitigate blocking I/O concerns on async executor threads. The implementation has been tested under high load at both system level and through unit tests like test_concurrent_add_or_update_and_get_closest_peers which validates concurrent read/write operations.
Learnt from: MCozhusheck
PR: tari-project/tari#7262
File: applications/minotari_node/src/grpc/base_node_grpc_server.rs:533-536
Timestamp: 2025-06-30T06:54:18.170Z
Learning: In the Tari codebase, readiness status reporting is split between two gRPC servers: base_node_grpc_server's GetNetworkState method always reports READY when it can respond (indicating network-level readiness), while readiness_grpc_server handles granular readiness states like database initialization and migration progress via ReadinessStatusHandler.
📚 Learning: 2025-08-26T06:28:13.374Z
Learnt from: hansieodendaal
PR: tari-project/tari#7432
File: applications/minotari_app_grpc/proto/wallet.proto:1933-1948
Timestamp: 2025-08-26T06:28:13.374Z
Learning: In PR #7432 "feat!: remove comms from wallet", the HttpPeer message in the wallet.proto should only include url, last_latency, and is_online fields. The base node public_key and node_id fields mentioned in the original PR summary were removed from the scope and should not be included.

Applied to files:

  • applications/minotari_app_grpc/proto/base_node.proto
📚 Learning: 2025-07-12T03:43:22.545Z
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
🧬 Code graph analysis (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (3)
base_layer/p2p/src/services/liveness/service.rs (2)
  • connectivity (346-355)
  • connectivity (359-368)
comms/core/src/builder/comms_node.rs (2)
  • connectivity (257-259)
  • connectivity (325-327)
comms/core/src/connectivity/requester.rs (1)
  • peers (170-173)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: test (testnet, esmeralda)
  • GitHub Check: test (mainnet, stagenet)
  • GitHub Check: test (nextnet, nextnet)
  • GitHub Check: cargo check with stable
  • GitHub Check: ci
  • GitHub Check: Cucumber tests / Base Layer
  • GitHub Check: Cucumber tests / FFI
🔇 Additional comments (2)
applications/minotari_app_grpc/proto/base_node.proto (1)

645-651: Docs align with server sentinel semantics (u64::MAX).

Comment updates match the server-side fallback to u64::MAX for timeouts/overflow. No wire/schema changes introduced.

base_layer/core/src/base_node/tari_pulse_service/mod.rs (1)

503-539: Disconnect parameters are correct for health-check usage.

Using (Minimized::No, 0, 2) matches the two substreams/zero RPC sessions created by the health check, per prior guidance.

Comment on lines +434 to +501
fn spawn_ping(
comms: ConnectivityRequester,
mut liveness: LivenessHandle,
mut liveness_events: broadcast::Receiver<Arc<LivenessEvent>>,
peer_node_id: NodeId,
) -> JoinHandle<Option<std::time::Duration>> {
task::spawn(async move {
// Ensure we have a connection first
match tokio::time::timeout(Duration::from_secs(60), comms.dial_peer(peer_node_id.clone())).await {
Ok(Ok(_)) => {},
Ok(Err(err)) => {
// This error is not treated as a warning or error as the result will be passed back to the caller
debug!(target: LOG_TARGET, "check_health: dial {} failed: {}", peer_node_id, err);
return None;
},
Err(_) => {
// This error is not treated as a warning or error as the result will be passed back to the caller
debug!(target: LOG_TARGET, "check_health: dial {} timed out", peer_node_id);
return None;
},
}

// Now send the ping
let start = Instant::now();
let nonce = match tokio::time::timeout(Duration::from_secs(15), liveness.send_ping(peer_node_id.clone())).await
{
Ok(Ok(n)) => n,
Ok(Err(err)) => {
// This error is not treated as a warning or error as the result will be passed back to the caller
debug!(target: LOG_TARGET, "check_health: send_ping {} failed: {}", peer_node_id, err);
return None;
},
Err(_) => {
// This error is not treated as a warning or error as the result will be passed back to the caller
debug!(target: LOG_TARGET, "check_health: send_ping {} timed out", peer_node_id);
return None;
},
};

// Wait for matching pong with rolling timeouts
loop {
match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await {
Ok(Ok(event)) => {
if let LivenessEvent::ReceivedPong(pong) = &*event {
if pong.node_id == peer_node_id && pong.nonce == nonce {
return Some(start.elapsed());
}
}
// Give up after ~60s total waiting
if start.elapsed() >= Duration::from_secs(60) {
debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id);
return None;
}
},
Ok(Err(RecvError::Closed)) => {
// This error is not treated as a warning or error as the result will be passed back to the caller
return None;
},
Ok(Err(RecvError::Lagged(_))) => { /* keep waiting within total window */ },
Err(_) => {
// This error is not treated as a warning or error as the result will be passed back to the caller
debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id);
return None;
},
}
}
})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Pong wait exits after the first 30s timeout (contrary to “~60s total” intent).

If no events arrive, the Err(_) branch returns immediately after 30s. Keep waiting up to the 60s overall window.

-        loop {
-            match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await {
+        loop {
+            match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await {
                 Ok(Ok(event)) => {
                     if let LivenessEvent::ReceivedPong(pong) = &*event {
                         if pong.node_id == peer_node_id && pong.nonce == nonce {
                             return Some(start.elapsed());
                         }
                     }
-                    // Give up after ~60s total waiting
-                    if start.elapsed() >= Duration::from_secs(60) {
+                    if start.elapsed() >= Duration::from_secs(60) {
                         debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id);
                         return None;
                     }
                 },
                 Ok(Err(RecvError::Closed)) => {
                     // This error is not treated as a warning or error as the result will be passed back to the caller
                     return None;
                 },
-                Ok(Err(RecvError::Lagged(_))) => { /* keep waiting within total window */ },
-                Err(_) => {
-                    // This error is not treated as a warning or error as the result will be passed back to the caller
-                    debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id);
-                    return None;
-                },
+                Ok(Err(RecvError::Lagged(_))) => {
+                    // keep waiting within total window
+                    if start.elapsed() >= Duration::from_secs(60) {
+                        debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id);
+                        return None;
+                    }
+                },
+                Err(_) => {
+                    // per-recv timeout; continue until overall deadline
+                    if start.elapsed() >= Duration::from_secs(60) {
+                        debug!(target: LOG_TARGET, "check_health: recv pong {} timed out (overall)", peer_node_id);
+                        return None;
+                    }
+                },
             }
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn spawn_ping(
comms: ConnectivityRequester,
mut liveness: LivenessHandle,
mut liveness_events: broadcast::Receiver<Arc<LivenessEvent>>,
peer_node_id: NodeId,
) -> JoinHandle<Option<std::time::Duration>> {
task::spawn(async move {
// Ensure we have a connection first
match tokio::time::timeout(Duration::from_secs(60), comms.dial_peer(peer_node_id.clone())).await {
Ok(Ok(_)) => {},
Ok(Err(err)) => {
// This error is not treated as a warning or error as the result will be passed back to the caller
debug!(target: LOG_TARGET, "check_health: dial {} failed: {}", peer_node_id, err);
return None;
},
Err(_) => {
// This error is not treated as a warning or error as the result will be passed back to the caller
debug!(target: LOG_TARGET, "check_health: dial {} timed out", peer_node_id);
return None;
},
}
// Now send the ping
let start = Instant::now();
let nonce = match tokio::time::timeout(Duration::from_secs(15), liveness.send_ping(peer_node_id.clone())).await
{
Ok(Ok(n)) => n,
Ok(Err(err)) => {
// This error is not treated as a warning or error as the result will be passed back to the caller
debug!(target: LOG_TARGET, "check_health: send_ping {} failed: {}", peer_node_id, err);
return None;
},
Err(_) => {
// This error is not treated as a warning or error as the result will be passed back to the caller
debug!(target: LOG_TARGET, "check_health: send_ping {} timed out", peer_node_id);
return None;
},
};
// Wait for matching pong with rolling timeouts
loop {
match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await {
Ok(Ok(event)) => {
if let LivenessEvent::ReceivedPong(pong) = &*event {
if pong.node_id == peer_node_id && pong.nonce == nonce {
return Some(start.elapsed());
}
}
// Give up after ~60s total waiting
if start.elapsed() >= Duration::from_secs(60) {
debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id);
return None;
}
},
Ok(Err(RecvError::Closed)) => {
// This error is not treated as a warning or error as the result will be passed back to the caller
return None;
},
Ok(Err(RecvError::Lagged(_))) => { /* keep waiting within total window */ },
Err(_) => {
// This error is not treated as a warning or error as the result will be passed back to the caller
debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id);
return None;
},
}
}
})
}
// Wait for matching pong with rolling timeouts
loop {
match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await {
Ok(Ok(event)) => {
if let LivenessEvent::ReceivedPong(pong) = &*event {
if pong.node_id == peer_node_id && pong.nonce == nonce {
return Some(start.elapsed());
}
}
// Give up after ~60s total waiting
if start.elapsed() >= Duration::from_secs(60) {
debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id);
return None;
}
},
Ok(Err(RecvError::Closed)) => {
// This error is not treated as a warning or error as the result will be passed back to the caller
return None;
},
Ok(Err(RecvError::Lagged(_))) => {
// keep waiting within total window
if start.elapsed() >= Duration::from_secs(60) {
debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id);
return None;
}
},
Err(_) => {
// per-recv timeout; continue until overall deadline
if start.elapsed() >= Duration::from_secs(60) {
debug!(target: LOG_TARGET, "check_health: recv pong {} timed out (overall)", peer_node_id);
return None;
}
},
}
}
🤖 Prompt for AI Agents
In base_layer/core/src/base_node/tari_pulse_service/mod.rs around lines 434-501,
the loop currently returns None immediately when the per-iteration timeout (the
Err(_) branch of tokio::time::timeout for liveness_events.recv()) fires after
30s, which violates the intended "~60s total" wait; change that Err(_) branch to
check if start.elapsed() >= Duration::from_secs(60) and only then log and return
None, otherwise continue the loop (don’t return) so the code keeps waiting until
the overall ~60s window elapses.

Copy link
Copy Markdown
Collaborator

@SWvheerden SWvheerden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a config option to disable this and make it disabled by default

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)

409-411: Don’t panic if receiver dropped.

Gracefully handle closed watch receivers to avoid crashing the health check task. (Same concern raised previously.)

-    notify_comms_health.send(inner_result).expect("Channel should be open");
+    if let Err(e) = notify_comms_health.send(inner_result) {
+        warn!(target: LOG_TARGET, "check_health: liveness notify failed: {}", e);
+    }

478-501: Pong wait exits early; honor the full ~60s overall window.

On per-iteration timeout (Err(_)), the loop returns immediately after 30s instead of continuing until the 60s cap. (Same concern raised previously.)

-                Err(_) => {
-                    // This error is not treated as a warning or error as the result will be passed back to the caller
-                    debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id);
-                    return None;
-                },
+                Err(_) => {
+                    // per-recv timeout; continue until overall deadline
+                    if start.elapsed() >= Duration::from_secs(60) {
+                        debug!(target: LOG_TARGET, "check_health: recv pong {} timed out (overall)", peer_node_id);
+                        return None;
+                    }
+                },
🧹 Nitpick comments (5)
applications/minotari_node/src/config.rs (1)

157-161: Clarify docs: explicitly state how to disable and tighten wording.

Make it clear that None disables the health check and fix phrasing.

-    /// Interval to check if the seed nodes comms responses are healthy. (Recommended '60 * 10 = 600 s' if you need
-    /// this)
+    /// Interval to check if seed peers' comms responses are healthy.
+    /// Set to `None` to disable.
+    /// Recommended: 600 s (10 minutes) if enabled.
base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)

127-129: Avoid hardcoded startup delay; make it a constant or configurable.

A fixed 30s sleep can be surprising in different environments.

-        tokio::time::sleep(Duration::from_secs(30)).await; // Wait for the node to start up properly
+        const STARTUP_DELAY: Duration = Duration::from_secs(30);
+        tokio::time::sleep(STARTUP_DELAY).await; // Wait for the node to start up properly

Optionally, consider plumbing this via config for tuning.


134-139: Typo in comment.

“Health chack” → “Health check”.

-        // Health chack interval (optional)
+        // Health check interval (optional)

202-213: DNS fetch timeout may be too aggressive.

A 1s DNS TXT query timeout can be flaky on congested networks. Consider 3–5s or making it configurable; you already bound overall cadence via dns_check_interval.


353-371: Cap per-tick concurrency to protect connection manager.

If seed lists get large, spawning unbounded tasks can cause bursts. Consider a small semaphore (e.g., 8–16).

Example sketch (not a drop-in):

let sem = Arc::new(Semaphore::new(16));
for peer in &peers {
    let permit = sem.clone().acquire_owned().await.unwrap();
    handles.push(task::spawn(async move {
        let _p = permit;
        // existing per-peer logic...
    }));
}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94dd061 and 78ba8c1.

📒 Files selected for processing (3)
  • applications/minotari_node/src/config.rs (3 hunks)
  • base_layer/core/src/base_node/tari_pulse_service/mod.rs (8 hunks)
  • common/config/presets/c_base_node_c.toml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • common/config/presets/c_base_node_c.toml
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.
Learnt from: MCozhusheck
PR: tari-project/tari#7262
File: applications/minotari_node/src/grpc/base_node_grpc_server.rs:533-536
Timestamp: 2025-06-30T06:54:18.170Z
Learning: In the Tari codebase, readiness status reporting is split between two gRPC servers: base_node_grpc_server's GetNetworkState method always reports READY when it can respond (indicating network-level readiness), while readiness_grpc_server handles granular readiness states like database initialization and migration progress via ReadinessStatusHandler.
Learnt from: MCozhusheck
PR: tari-project/tari#7456
File: applications/minotari_node/src/lib.rs:134-145
Timestamp: 2025-09-01T07:22:09.742Z
Learning: ReadinessGrpcServer in the Tari codebase implements the tari_rpc::base_node_server::BaseNode trait, providing a readiness-focused implementation that returns "unavailable" status for most methods while properly handling get_network_state to report readiness status during node startup.
Learnt from: hansieodendaal
PR: tari-project/tari#6963
File: comms/core/src/peer_manager/manager.rs:60-68
Timestamp: 2025-05-26T02:40:23.812Z
Learning: PeerDatabaseSql in the Tari codebase has been specifically refactored to handle concurrent access and mitigate blocking I/O concerns on async executor threads. The implementation has been tested under high load at both system level and through unit tests like test_concurrent_add_or_update_and_get_closest_peers which validates concurrent read/write operations.
📚 Learning: 2025-08-11T10:49:01.479Z
Learnt from: hansieodendaal
PR: tari-project/tari#7401
File: base_layer/core/src/base_node/sync/config.rs:57-60
Timestamp: 2025-08-11T10:49:01.479Z
Learning: In the Tari codebase, the BlockchainSyncConfig struct works correctly without #[serde(default)] attributes on new fields when used within BaseNodeStateMachineConfig. The configuration loading mechanism or usage patterns ensure that missing fields don't cause deserialization failures in practice.

Applied to files:

  • applications/minotari_node/src/config.rs
📚 Learning: 2025-07-12T03:43:22.545Z
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:13:37.206Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-04-16T07:06:53.981Z
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
🧬 Code graph analysis (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)
comms/core/src/builder/comms_node.rs (2)
  • connectivity (257-259)
  • connectivity (325-327)
base_layer/p2p/src/peer_seeds.rs (1)
  • new (107-109)
comms/core/src/connectivity/requester.rs (1)
  • peers (170-173)
comms/dht/examples/memory_net/utilities.rs (1)
  • discovery (134-192)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: Cucumber tests / Base Layer
  • GitHub Check: Cucumber tests / FFI
  • GitHub Check: wasm build tests
  • GitHub Check: test (testnet, esmeralda)
  • GitHub Check: ledger build tests
  • GitHub Check: test (nextnet, nextnet)
  • GitHub Check: test (mainnet, stagenet)
  • GitHub Check: ci
  • GitHub Check: cargo check with stable
🔇 Additional comments (6)
applications/minotari_node/src/config.rs (2)

34-35: Serializer import for optional durations: LGTM.

Using serializers::optional_seconds is the right choice for Option in config.


224-225: Backward-compatible default: LGTM.

Defaulting tari_pulse_health_check to None avoids breaking existing configs and matches the new optional semantics.

base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)

52-57: Optional liveness interval: LGTM.

Making liveness_interval an Option cleanly disables health checks when unset.


147-166: Single-flight guard for health checks: LGTM.

Using a try_lock on Arc<Mutex<()>> to skip overlapping runs is appropriate.


507-543: Disconnect parameters: LGTM.

disconnect_if_unused(Minimized::No, 0, 2, ...) matches the health-check’s 0 RPC sessions and 2 substreams design.


545-552: tick_optional helper: LGTM.

A neat way to disable the select branch without extra branching in the loop.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)

451-453: Don’t panic if receiver dropped (duplicate of prior review).

Handle watch send errors gracefully to avoid crashing the health task.

-    notify_comms_health.send(inner_result).expect("Channel should be open");
+    if let Err(e) = notify_comms_health.send(inner_result) {
+        warn!(target: LOG_TARGET, "check_health: liveness notify failed: {}", e);
+    }

521-544: Pong wait exits after first 30s timeout; keep waiting up to 60s total (duplicate of prior review).

Currently returns None on the first per‑recv timeout; violates the “~60s total” intent and may miss late pongs.

-                Ok(Err(RecvError::Lagged(_))) => { /* keep waiting within total window */ },
-                Err(_) => {
-                    // This error is not treated as a warning or error as the result will be passed back to the caller
-                    debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id);
-                    return None;
-                },
+                Ok(Err(RecvError::Lagged(_))) => {
+                    if start.elapsed() >= Duration::from_secs(60) {
+                        debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id);
+                        return None;
+                    }
+                },
+                Err(_) => {
+                    // per-recv timeout; continue until overall deadline
+                    if start.elapsed() >= Duration::from_secs(60) {
+                        debug!(target: LOG_TARGET, "check_health: recv pong {} timed out (overall)", peer_node_id);
+                        return None;
+                    }
+                },
🧹 Nitpick comments (6)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (6)

137-143: Nit: fix typo in comment.

“Health chack interval” → “Health check interval”.

-        // Health chack interval (optional)
+        // Health check interval (optional)

380-381: Prefer Mutex over RwLock for push-only writes.

This path only performs writes (push) and a single final read; RwLock adds overhead without benefit.

-    let results = Arc::new(RwLock::new(Vec::new()));
+    let results = Arc::new(Mutex::new(Vec::new()));
@@
-            (*result_clone).write().await.push(result);
+            result_clone.lock().await.push(result);

Also drop RwLock from the imports if unused.

Also applies to: 446-447


439-445: Optional: log latencies in ms for readability and alignment with gRPC.

Formatting Option<Duration> with {:?} is noisy. Consider logging ms integers.

-            debug!(
-                target: LOG_TARGET,
-                "check_health: peer {} discovery: {:.2?}, ping: {:.2?}",
-                result.peer,
-                result.discovery_latency,
-                result.ping_latency
-            );
+            let d_ms = result.discovery_latency.map(|d| d.as_millis() as u64);
+            let p_ms = result.ping_latency.map(|d| d.as_millis() as u64);
+            debug!(target: LOG_TARGET,
+                   "check_health: peer {} discovery_ms: {:?}, ping_ms: {:?}",
+                   result.peer, d_ms, p_ms);

462-465: Unnecessary into() on CommsPublicKey.

NodeDestination::PublicKey already takes the same key type.

-            discovery.discover_peer(dest_key.clone(), NodeDestination::PublicKey(dest_key.clone().into())),
+            discovery.discover_peer(dest_key.clone(), NodeDestination::PublicKey(dest_key.clone())),

488-500: Double timeouts around discover/dial — verify intended behavior.

discover_peer/dial_peer may have built‑in timeouts. Wrapping them can obscure underlying errors or trigger earlier-than-expected timeouts.

Would you confirm the current internal timeouts of these APIs and whether the added 90s/60s outer bounds are stricter and required for this interlock? If redundant, consider relying on the inner timeout and only applying an overall per‑peer budget at the orchestration layer.

Also applies to: 461-465


409-414: Avoid extra task::spawn; use async helpers + join! for lower overhead.

You already spawn one task per peer; spawning two more per peer is unnecessary.

-            let discovery_handle = spawn_discovery(discovery, dest_key);
-            let ping_handle = spawn_ping(comms_clone.clone(), liveness, liveness_events, result.peer.clone());
-            // Await both inner tasks
-            let (discovery_result, ping_result) = tokio::join!(discovery_handle, ping_handle);
+            let (discovery_result, ping_result) = tokio::join!(
+                do_discovery(discovery, dest_key),
+                do_ping(comms_clone.clone(), liveness, liveness_events, result.peer.clone())
+            );

Add helpers (replace spawn_* with these):

async fn do_discovery(
    mut discovery: DhtDiscoveryRequester,
    dest_key: CommsPublicKey,
) -> Option<Duration> { /* same body as before, without outer task::spawn */ }

async fn do_ping(
    comms: ConnectivityRequester,
    mut liveness: LivenessHandle,
    mut liveness_events: broadcast::Receiver<Arc<LivenessEvent>>,
    peer_node_id: NodeId,
) -> Option<Duration> { /* same body as before, without outer task::spawn */ }

Also applies to: 455-478, 480-547

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78ba8c1 and 526c108.

📒 Files selected for processing (4)
  • applications/minotari_node/src/config.rs (3 hunks)
  • applications/minotari_node/src/grpc/base_node_grpc_server.rs (1 hunks)
  • base_layer/core/src/base_node/tari_pulse_service/mod.rs (9 hunks)
  • common/config/presets/c_base_node_c.toml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • common/config/presets/c_base_node_c.toml
🚧 Files skipped from review as they are similar to previous changes (2)
  • applications/minotari_node/src/config.rs
  • applications/minotari_node/src/grpc/base_node_grpc_server.rs
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.
📚 Learning: 2025-07-12T03:43:22.545Z
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:13:37.206Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-04-16T07:06:53.981Z
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.

Applied to files:

  • base_layer/core/src/base_node/tari_pulse_service/mod.rs
🧬 Code graph analysis (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)
base_layer/p2p/src/services/liveness/service.rs (2)
  • connectivity (346-355)
  • connectivity (359-368)
comms/core/src/builder/comms_node.rs (2)
  • connectivity (257-259)
  • connectivity (325-327)
base_layer/p2p/src/peer_seeds.rs (1)
  • new (107-109)
comms/core/src/connectivity/requester.rs (1)
  • peers (170-173)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: test (nextnet, nextnet)
  • GitHub Check: test (mainnet, stagenet)
  • GitHub Check: test (testnet, esmeralda)
  • GitHub Check: file licenses
  • GitHub Check: wasm build tests
  • GitHub Check: cargo check with stable
  • GitHub Check: ci
  • GitHub Check: ledger build tests
  • GitHub Check: Cucumber tests / FFI
  • GitHub Check: Cucumber tests / Base Layer
🔇 Additional comments (2)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)

553-555: Disconnect parameters match expected usage. LGTM.

disconnect_if_unused(Minimized::No, 0, 2, ...) aligns with health check opening 0 RPC sessions and 2 substreams.

If this changes in future (e.g., more substreams), update the counts to avoid disconnecting active peers.


320-334: Confirm consumers updated to Option

  • Found one call site: applications/minotari_node/src/bootstrap.rs calling TariPulseServiceInitializer::new(base_node_config.tari_pulse_interval, base_node_config.tari_pulse_health_check). Verify those two fields are now Option in the config struct; if they are still Duration/bool, update the call site to pass Option (e.g., Some(...)) or adjust the config to match the new API.
  • Also ensure any other callers (none found by the search) are updated.

Comment on lines +127 to +131
// Wait for the node comms to start up properly
if let Ok(false) = self.wait_for_first_liveness_event().await {
return;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t block startup indefinitely waiting for a liveness event. Add an overall timeout.

If no liveness events are emitted (e.g., liveness service not started/quiet), run() can stall forever. Bound the wait and proceed.

Apply this diff:

-        // Wait for the node comms to start up properly
-        if let Ok(false) = self.wait_for_first_liveness_event().await {
-            return;
-        }
+        // Wait for the node comms to start up properly (bounded)
+        match tokio::time::timeout(Duration::from_secs(60), self.wait_for_first_liveness_event()).await {
+            Ok(Ok(true)) => {},
+            Ok(Ok(false)) => return, // shutdown requested
+            Ok(Err(e)) => {
+                warn!(target: LOG_TARGET, "wait_for_first_liveness_event error: {}", e);
+            },
+            Err(_) => {
+                info!(target: LOG_TARGET, "Timeout waiting for first liveness event; continuing startup");
+            },
+        }
🤖 Prompt for AI Agents
In base_layer/core/src/base_node/tari_pulse_service/mod.rs around lines 127 to
131, the current await on wait_for_first_liveness_event() can block forever;
wrap that await in a timeout (e.g., tokio::time::timeout with a sensible
duration like a few seconds), then match the result: if timeout occurs or the
call returns Err or Ok(false), log a warning and proceed rather than
returning/blocked; if it returns Ok(true) continue as before. Ensure the timeout
future is awaited correctly and any errors are handled and logged.

@SWvheerden SWvheerden merged commit a5cb9c0 into tari-project:development Sep 17, 2025
15 of 16 checks passed
@hansieodendaal hansieodendaal deleted the ho_health_check branch September 17, 2025 13:07
sdbondi added a commit to sdbondi/tari that referenced this pull request Sep 18, 2025
* development:
  fix: always cancel transactions (tari-project#7500)
  feat: throttle background rebuild tasks (tari-project#7502)
  chore: clearer logic for is_micro_tari (tari-project#7499)
  feat: add sqlite concurrency and pool hardening (tari-project#7492)
  feat: add timeouts to health check (tari-project#7482)
  feat: performance improvement of wallet output (tari-project#7490)
  fix: discovery syncing regression bug (tari-project#7488)
  fix: update peer database after re-initializing node identity config file (tari-project#7497)
  fix: peer_db null update where none should be written as null (tari-project#7489)
  fix: filter node identity onion address when running TCP node (tari-project#7486)
  chore: refactor multi sig and offline sign (tari-project#7487)
  feat: add http cache config (tari-project#7477)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix seed peer health check lock not being released

3 participants