feat: add timeouts to health check by hansieodendaal · Pull Request #7482 · tari-project/tari

hansieodendaal · 2025-09-08T13:05:29Z

Description

This PR fixes an interlock in the health check whereby either discover-peer or ping-peer locks up and does not return. This culminated in the health check being interlocked for the remainder of the base node's uptime.
The discover-peer and ping-peer health checks are now run in parallel.
The ping-peer health check is now preceded by a dial-peer to ensure we have a connection before we proceed.
Changed the gRPC liveness_results in the GetNetworkState method to return ms instead of s and documented it as such; this was previously undocumented. When the network is well connected, discovery and ping latencies would be less than 1 s, resulting in zero values to be returned.

Fixes #7380.

Motivation and Context

See #7380.

How Has This Been Tested?

System-level testing (c::bn::tari_pulse logged at least trace level)
When running the gRPC GetNetworkState method, liveness_results should not be an empty array if seed peers are contactable.

What process can a PR reviewer use to test or verify this change?

Code review.
System-level testing (to verify in the log files change c::bn::tari_pulse to at least debug level).

Breaking Changes

None
Requires data directory on base node to be deleted
Requires hard fork
Other - Please specify

BREAKING CHANGE: Changed the gRPC liveness_results in the GetNetworkState method to return ms instead of s and documented it as such; this was previously undocumented.

Summary by CodeRabbit

New Features
- Health-check interval is now optional and disabled by default.
Bug Fixes
- More reliable per-peer health checks with clearer timeouts, improved connection cleanup, and fewer hangs/false negatives.
- Service startup now waits for the first liveness event to ensure initial health state.
Refactor
- Discovery and ping checks run in parallel for faster reporting.
- Latency measurements standardized to milliseconds.
Documentation
- Config comment updated to indicate health checks default to disabled.

coderabbitai · 2025-09-08T13:05:38Z

Walkthrough

Clarified proto comments for LivenessResult; gRPC now reports discover and ping latencies in milliseconds with u64 conversion/fallback; TariPulseService refactored to make health-check interval optional, wait for first liveness event on startup, run per-peer discovery/ping in parallel with explicit timeouts, aggregate results, and perform timed disconnects.

Changes

Cohort / File(s)	Summary
Proto comment updates `applications/minotari_app_grpc/proto/base_node.proto`	Adjusted `LivenessResult` field comments to specify milliseconds and timeout sentinel; no schema/field changes.
gRPC latency conversion (ms) `applications/minotari_node/src/grpc/base_node_grpc_server.rs`	Latency extraction switched from seconds to milliseconds via `Duration::as_millis()` → `u64::try_from`, falling back to `u64::MAX` on overflow/None; no public signature changes.
Pulse health-check refactor & config `base_layer/core/src/base_node/tari_pulse_service/mod.rs`, `applications/minotari_node/src/config.rs`, `common/config/presets/c_base_node_c.toml`	Made health-check interval `Option<Duration>`, defaulted to `None`, updated serde to `optional_seconds`; TariPulseService waits for first liveness event, uses `spawn_discovery`/`spawn_ping` helpers with timeouts, aggregates per-peer results, adds `disconnect_peer`, expands imports and logging; public API changed for `TariPulseConfig`/initializer signatures.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Pulse as TariPulseService
  participant Seeds as SeedPeers
  participant Dht as DhtDiscoveryRequester
  participant Conn as ConnectivityRequester
  participant Live as LivenessService

  Note over Pulse: Startup waits for first liveness event
  Pulse->>Seeds: fetch_seed_peers (with timeout)
  alt seeds fetched
    loop per-seed
      par Discovery (<=90s)
        Pulse->>Dht: discover(peer_pk) (timeout)
        Dht-->>Pulse: discovery_duration | timeout
      and Ping (<=60s window)
        Pulse->>Conn: dial(peer) (<=60s)
        Pulse->>Live: send_ping (<=15s)
        Live-->>Pulse: matching_pong (<=60s)
      end
      Pulse->>Pulse: aggregate {discover_ms?, ping_ms?}
      Pulse->>Conn: disconnect_peer (timed)
    end
  else fetch failed
    Note over Pulse: abort health-check tick
  end
  Note over Pulse: record/log results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat: peer health check #6951 — Related modifications to liveness/latency reporting and proto mapping to milliseconds.
feat: modify soft disconnect criteria #7307 — Overlaps in refactoring health-check/disconnect handling in the TariPulse service.
feat: add transport dial timeout #7312 — Related dial timeout adjustments impacting per-peer dial/ping behavior.

Suggested reviewers

SWvheerden
MCozhusheck

Poem

In burrows where the pings now play,
I count the millis and hop away.
Timeouts keep each check in line,
Discovery, ping—both finished fine.
A tidy disconnect, and all is well. 🐇

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Out of Scope Changes Check	⚠️ Warning	While the core timeout and logging work is in scope, the PR also introduces public API and config signature changes (e.g., TariPulseConfig.liveness_interval: Duration → Option, TariPulseServiceInitializer::new signature change, and BaseNodeConfig.tari_pulse_health_check becoming Option) and a documented gRPC unit change; these are broader, breaking-surface changes beyond the narrow scope of simply adding timeouts around awaits. Such API/config changes affect downstream consumers and should be treated as separate, documented breaking changes rather than incidental edits to the health-check fix.	Split or isolate the API and default-config signature changes into a separate PR or clearly document them as breaking changes in this PR (update callers/tests, add migration notes and changelog entries, and highlight the gRPC unit change) so reviewers and downstream consumers can assess and adapt to the wider impact.
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "feat: add timeouts to health check" succinctly and accurately captures the primary change in the PR—adding timeout guards to the base-node health check to prevent lockups—and aligns with the linked issue (#7380) and implemented changes. It is concise, specific, and readable for a teammate scanning history.
Linked Issues Check	✅ Passed	The changes implement bounded per-peer operations (spawn_discovery with a ~90s timeout, spawn_ping with dial/ping/pong timeouts, and disconnect with its own timeouts), run discovery and ping in parallel, and add increased debug logging; these directly address the linked issue's goals to add timeout guards, surface stuck calls via logs, and ensure health-check runs complete within the intended timeout budget (per-peer worst-case ≈135s, which is < 180s). The gRPC latency unit change to milliseconds and the refactor to isolated helpers further support reliable timing and observability. Based on the provided summaries, the coding requirements from issue #7380 appear satisfied.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)

392-422: Consider logging the received unrelated pong details for debugging.

When receiving unrelated pongs (lines 394-398), it might be helpful to log which peer the pong came from to aid in debugging connectivity issues.

 if let LivenessEvent::ReceivedPong(pong) = &*event {
     if pong.node_id == peer_node_id && pong.nonce == nonce {
         return Some(start.elapsed());
+    } else {
+        trace!(
+            target: LOG_TARGET,
+            "check_health: received pong from {} (nonce: {}) while waiting for {} (nonce: {})",
+            pong.node_id,
+            pong.nonce,
+            peer_node_id,
+            nonce
+        );
     }
 }

332-507: Consider adding a top-level timeout as a safety net.

While individual operations have timeouts, consider wrapping the entire check_health function with an overall timeout (e.g., 3 minutes) to guarantee it never exceeds the expected duration, even if there are unexpected delays in task spawning or joining.

At the beginning of the function, wrap the entire logic:

 async fn check_health(
     mut node_comms: ConnectivityRequester,
     liveness_handle: LivenessHandle,
     node_discovery: DhtDiscoveryRequester,
     notify_comms_health: watch::Sender<Vec<LivenessCheckResult>>,
 ) {
+    // Ensure the entire health check completes within 3 minutes
+    match tokio::time::timeout(Duration::from_secs(180), check_health_inner(
+        node_comms,
+        liveness_handle,
+        node_discovery,
+        notify_comms_health,
+    )).await {
+        Ok(()) => {},
+        Err(_) => {
+            warn!(target: LOG_TARGET, "check_health: overall timeout exceeded (3 minutes)");
+            notify_comms_health.send(vec![]).expect("Channel should be open");
+        }
+    }
+}
+
+async fn check_health_inner(
+    mut node_comms: ConnectivityRequester,
+    liveness_handle: LivenessHandle,
+    node_discovery: DhtDiscoveryRequester,
+    notify_comms_health: watch::Sender<Vec<LivenessCheckResult>>,
+) {
     let results = Arc::new(RwLock::new(Vec::new()));
     // ... rest of the current implementation

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab82dd1 and d2aef08.

📒 Files selected for processing (1)

base_layer/core/src/base_node/tari_pulse_service/mod.rs (2 hunks)

🧰 Additional context used

🧠 Learnings (6)

📓 Common learnings

Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.

Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.

📚 Learning: 2025-07-12T03:43:22.545Z

Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

📚 Learning: 2025-07-09T08:13:37.206Z

Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

📚 Learning: 2025-07-09T08:33:29.320Z

Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

📚 Learning: 2025-07-09T08:33:29.320Z

Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

📚 Learning: 2025-07-09T08:33:29.320Z

Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: test (mainnet, stagenet)
GitHub Check: ci
GitHub Check: ledger build tests
GitHub Check: test (testnet, esmeralda)
GitHub Check: test (nextnet, nextnet)
GitHub Check: wasm build tests
GitHub Check: cargo check with stable
GitHub Check: Cucumber tests / FFI
GitHub Check: Cucumber tests / Base Layer

🔇 Additional comments (7)

base_layer/core/src/base_node/tari_pulse_service/mod.rs (7)

325-325: Good addition of the lint attribute for function length.

The #[allow(clippy::too_many_lines)] attribute is appropriate here as the function's complexity is justified by the need for comprehensive timeout handling and error logging.

334-344: Excellent implementation of timeout for seed peer retrieval.

The 10-second timeout for get_seeds() with proper error handling prevents the health check from hanging indefinitely. The early return pattern is appropriate when seed peers cannot be retrieved.

365-381: Consider the previous learning about discovery_peer timeouts.

Based on the retrieved learnings, discovery_peer has built-in timeout mechanisms. However, given the specific issue #7380 about health check locks not being released, adding an explicit 90-second timeout is justified as a safeguard. This aligns with the PR objectives to ensure the health check completes within bounded time.

388-437: Well-structured ping/pong timeout implementation.

The implementation correctly addresses the core issue from #7380:

15-second timeout for sending the ping

15-second timeout for receiving the pong

Protection against waiting indefinitely for unrelated liveness events (max 5 events)

This ensures that pongs that are never returned don't cause the health check to hang.

440-462: Good use of concurrent task execution.

Using tokio::join! to run discovery and ping tasks in parallel improves efficiency and ensures both complete or timeout independently. The error handling for task join failures is comprehensive.

465-499: Connection cleanup implementation aligns with known requirements.

The disconnect logic correctly uses parameters (0, 2) as per the retrieved learning that tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks. The 15-second timeouts for both getting the connection and disconnecting ensure bounded execution time.

334-344: Verify that the cumulative timeout bounds meet the 3-minute requirement.

The PR objectives state that each health check should complete within ≤ 3 minutes. Let's verify the maximum possible execution time:

Seed retrieval: 10s

Per peer (in parallel):

Discovery: 90s

Ping/pong: 15s + 15s = 30s

Connection cleanup: 15s + 15s = 30s

Since discovery and ping run in parallel, the maximum per-peer time is max(90s, 30s) + 30s = 120s. This is within the 3-minute (180s) requirement.

Also applies to: 365-381, 388-437, 465-499

- This PR fixes an interlock in the health check whereby either discover-peer or ping-peer locks up and do not return. This culminated in the health check being interlocked for the remainder of the base node's uptime. - The discover-peer and ping-peer tests are now run in parallel. - The ping-peer test is now preceded by a dial-peer to ensure we have a connection before we proceed. - Changed the gRPC result to return `ms` instead of `s` and documented it as such; this was previously undocumented. When the network is well connected, discovery and ping latencies would be less than `1 s`, resulting in zero values to be returned.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (5)

applications/minotari_node/src/grpc/base_node_grpc_server.rs (1)

560-570: Avoid duplication: extract “Option -> u64 ms (or u64::MAX)” helper.

Same mapping is repeated for both fields; a small helper improves readability and keeps proto/server semantics in one place.

-                discover_latency: data
-                    .discovery_latency
-                    .map(|v| u64::try_from(v.as_millis()).unwrap_or(u64::MAX))
-                    .unwrap_or_else(|| u64::MAX),
-                ping_latency: data
-                    .ping_latency
-                    .map(|v| u64::try_from(v.as_millis()).unwrap_or(u64::MAX))
-                    .unwrap_or_else(|| u64::MAX),
+                discover_latency: {
+                    let to_ms_or_max = |d: Option<std::time::Duration>| {
+                        d.map(|v| u64::try_from(v.as_millis()).unwrap_or(u64::MAX)).unwrap_or(u64::MAX)
+                    };
+                    to_ms_or_max(data.discovery_latency)
+                },
+                ping_latency: {
+                    let to_ms_or_max = |d: Option<std::time::Duration>| {
+                        d.map(|v| u64::try_from(v.as_millis()).unwrap_or(u64::MAX)).unwrap_or(u64::MAX)
+                    };
+                    to_ms_or_max(data.ping_latency)
+                },

base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)

127-128: Startup delay should be configurable.

Hard-coding 30s can be problematic across environments. Prefer a config knob with a sensible default.
-        tokio::time::sleep(Duration::from_secs(30)).await; // Wait for the node to start up properly
+        tokio::time::sleep(self.config.startup_delay.unwrap_or(Duration::from_secs(30))).await;
Language-only support (outside this hunk):

Add pub startup_delay: Option<Duration> to TariPulseConfig defaulting to None.

336-347: Seeds fetch timeout: good guard, but consider reporting an empty snapshot instead of early return.

Returning early drops the whole health snapshot for this tick. Optionally publish an empty liveness_checks to unblock consumers and make behavior explicit.
-            warn!(target: LOG_TARGET, "check_health: timeout getting seed peers");
-            return;
+            warn!(target: LOG_TARGET, "check_health: timeout getting seed peers");
+            let _ = notify_comms_health.send(vec![]);
+            return;
349-401: Bound fan-out to avoid spawning N tasks unbounded on large seed sets.

A semaphore (e.g., 32–64 permits) prevents bursty load on connectivity/liveness subsystems.
-    let mut handles = vec![];
+    let mut handles = vec![];
+    let semaphore = Arc::new(tokio::sync::Semaphore::new(64)); // cap concurrency

     for peer in &peers {
+        let permit = semaphore.clone().acquire_owned().await.unwrap();
         let result_clone = results.clone();
         let mut result = LivenessCheckResult { /* ... */ };

         handles.push(task::spawn(async move {
+            let _permit = permit;
             let discovery_handle = spawn_discovery(discovery, dest_key);
             let ping_handle = spawn_ping(comms_clone.clone(), liveness, liveness_events, result.peer.clone());
             // Await both inner tasks
             let (discovery_result, ping_result) = tokio::join!(discovery_handle, ping_handle);
             /* ... */
         }));
     }
409-432: Unify discovery timeouts
DhtDiscoveryRequester::discover_peer already applies its configured discovery_timeout, so wrapping it again in spawn_discovery with a hard-coded 90 s is redundant and risks them falling out of sync. Replace the Duration::from_secs(90) here with the requester’s own timeout (or remove the outer wrapper) so all discovery calls share the same configured deadline.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab82dd1 and 94dd061.

📒 Files selected for processing (3)

applications/minotari_app_grpc/proto/base_node.proto (1 hunks)
applications/minotari_node/src/grpc/base_node_grpc_server.rs (1 hunks)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (5 hunks)

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.

Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.

Learnt from: hansieodendaal
PR: tari-project/tari#6963
File: comms/core/src/peer_manager/manager.rs:60-68
Timestamp: 2025-05-26T02:40:23.812Z
Learning: PeerDatabaseSql in the Tari codebase has been specifically refactored to handle concurrent access and mitigate blocking I/O concerns on async executor threads. The implementation has been tested under high load at both system level and through unit tests like test_concurrent_add_or_update_and_get_closest_peers which validates concurrent read/write operations.

Learnt from: MCozhusheck
PR: tari-project/tari#7262
File: applications/minotari_node/src/grpc/base_node_grpc_server.rs:533-536
Timestamp: 2025-06-30T06:54:18.170Z
Learning: In the Tari codebase, readiness status reporting is split between two gRPC servers: base_node_grpc_server's GetNetworkState method always reports READY when it can respond (indicating network-level readiness), while readiness_grpc_server handles granular readiness states like database initialization and migration progress via ReadinessStatusHandler.

📚 Learning: 2025-08-26T06:28:13.374Z

Learnt from: hansieodendaal
PR: tari-project/tari#7432
File: applications/minotari_app_grpc/proto/wallet.proto:1933-1948
Timestamp: 2025-08-26T06:28:13.374Z
Learning: In PR #7432 "feat!: remove comms from wallet", the HttpPeer message in the wallet.proto should only include url, last_latency, and is_online fields. The base node public_key and node_id fields mentioned in the original PR summary were removed from the scope and should not be included.

Applied to files:

applications/minotari_app_grpc/proto/base_node.proto

📚 Learning: 2025-07-12T03:43:22.545Z

Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

📚 Learning: 2025-07-09T08:33:29.320Z

Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

📚 Learning: 2025-07-09T08:33:29.320Z

Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

🧬 Code graph analysis (1)

base_layer/core/src/base_node/tari_pulse_service/mod.rs (3)

base_layer/p2p/src/services/liveness/service.rs (2)

connectivity (346-355)

connectivity (359-368)

comms/core/src/builder/comms_node.rs (2)

connectivity (257-259)

connectivity (325-327)

comms/core/src/connectivity/requester.rs (1)

peers (170-173)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: test (testnet, esmeralda)
GitHub Check: test (mainnet, stagenet)
GitHub Check: test (nextnet, nextnet)
GitHub Check: cargo check with stable
GitHub Check: ci
GitHub Check: Cucumber tests / Base Layer
GitHub Check: Cucumber tests / FFI

🔇 Additional comments (2)

applications/minotari_app_grpc/proto/base_node.proto (1)

645-651: Docs align with server sentinel semantics (u64::MAX).

Comment updates match the server-side fallback to u64::MAX for timeouts/overflow. No wire/schema changes introduced.

base_layer/core/src/base_node/tari_pulse_service/mod.rs (1)

503-539: Disconnect parameters are correct for health-check usage.

Using (Minimized::No, 0, 2) matches the two substreams/zero RPC sessions created by the health check, per prior guidance.

base_layer/core/src/base_node/tari_pulse_service/mod.rs

coderabbitai · 2025-09-09T06:22:29Z

base_layer/core/src/base_node/tari_pulse_service/mod.rs

+fn spawn_ping(
+    comms: ConnectivityRequester,
+    mut liveness: LivenessHandle,
+    mut liveness_events: broadcast::Receiver<Arc<LivenessEvent>>,
+    peer_node_id: NodeId,
+) -> JoinHandle<Option<std::time::Duration>> {
+    task::spawn(async move {
+        // Ensure we have a connection first
+        match tokio::time::timeout(Duration::from_secs(60), comms.dial_peer(peer_node_id.clone())).await {
+            Ok(Ok(_)) => {},
+            Ok(Err(err)) => {
+                // This error is not treated as a warning or error as the result will be passed back to the caller
+                debug!(target: LOG_TARGET, "check_health: dial {} failed: {}", peer_node_id, err);
+                return None;
+            },
+            Err(_) => {
+                // This error is not treated as a warning or error as the result will be passed back to the caller
+                debug!(target: LOG_TARGET, "check_health: dial {} timed out", peer_node_id);
+                return None;
+            },
+        }
+
+        // Now send the ping
+        let start = Instant::now();
+        let nonce = match tokio::time::timeout(Duration::from_secs(15), liveness.send_ping(peer_node_id.clone())).await
+        {
+            Ok(Ok(n)) => n,
+            Ok(Err(err)) => {
+                // This error is not treated as a warning or error as the result will be passed back to the caller
+                debug!(target: LOG_TARGET, "check_health: send_ping {} failed: {}", peer_node_id, err);
+                return None;
+            },
+            Err(_) => {
+                // This error is not treated as a warning or error as the result will be passed back to the caller
+                debug!(target: LOG_TARGET, "check_health: send_ping {} timed out", peer_node_id);
+                return None;
+            },
+        };
+
+        // Wait for matching pong with rolling timeouts
+        loop {
+            match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await {
+                Ok(Ok(event)) => {
+                    if let LivenessEvent::ReceivedPong(pong) = &*event {
+                        if pong.node_id == peer_node_id && pong.nonce == nonce {
+                            return Some(start.elapsed());
+                        }
+                    }
+                    // Give up after ~60s total waiting
+                    if start.elapsed() >= Duration::from_secs(60) {
+                        debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id);
+                        return None;
+                    }
+                },
+                Ok(Err(RecvError::Closed)) => {
+                    // This error is not treated as a warning or error as the result will be passed back to the caller
+                    return None;
+                },
+                Ok(Err(RecvError::Lagged(_))) => { /* keep waiting within total window */ },
+                Err(_) => {
+                    // This error is not treated as a warning or error as the result will be passed back to the caller
+                    debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id);
+                    return None;
+                },
+            }
+        }
+    })
+}


🛠️ Refactor suggestion

Pong wait exits after the first 30s timeout (contrary to “~60s total” intent).

If no events arrive, the Err(_) branch returns immediately after 30s. Keep waiting up to the 60s overall window.

- loop { - match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await { + loop { + match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await { Ok(Ok(event)) => { if let LivenessEvent::ReceivedPong(pong) = &*event { if pong.node_id == peer_node_id && pong.nonce == nonce { return Some(start.elapsed()); } } - // Give up after ~60s total waiting - if start.elapsed() >= Duration::from_secs(60) { + if start.elapsed() >= Duration::from_secs(60) { debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id); return None; } }, Ok(Err(RecvError::Closed)) => { // This error is not treated as a warning or error as the result will be passed back to the caller return None; }, - Ok(Err(RecvError::Lagged(_))) => { /* keep waiting within total window */ }, - Err(_) => { - // This error is not treated as a warning or error as the result will be passed back to the caller - debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id); - return None; - }, + Ok(Err(RecvError::Lagged(_))) => { + // keep waiting within total window + if start.elapsed() >= Duration::from_secs(60) { + debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id); + return None; + } + }, + Err(_) => { + // per-recv timeout; continue until overall deadline + if start.elapsed() >= Duration::from_secs(60) { + debug!(target: LOG_TARGET, "check_health: recv pong {} timed out (overall)", peer_node_id); + return None; + } + }, } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fn spawn_ping(

comms: ConnectivityRequester,

mut liveness: LivenessHandle,

mut liveness_events: broadcast::Receiver<Arc<LivenessEvent>>,

peer_node_id: NodeId,

) -> JoinHandle<Option<std::time::Duration>> {

task::spawn(async move {

// Ensure we have a connection first

match tokio::time::timeout(Duration::from_secs(60), comms.dial_peer(peer_node_id.clone())).await {

Ok(Ok(_)) => {},

Ok(Err(err)) => {

// This error is not treated as a warning or error as the result will be passed back to the caller

debug!(target: LOG_TARGET, "check_health: dial {} failed: {}", peer_node_id, err);

return None;

},

Err(_) => {

// This error is not treated as a warning or error as the result will be passed back to the caller

debug!(target: LOG_TARGET, "check_health: dial {} timed out", peer_node_id);

return None;

},

}

// Now send the ping

let start = Instant::now();

let nonce = match tokio::time::timeout(Duration::from_secs(15), liveness.send_ping(peer_node_id.clone())).await

{

Ok(Ok(n)) => n,

Ok(Err(err)) => {

// This error is not treated as a warning or error as the result will be passed back to the caller

debug!(target: LOG_TARGET, "check_health: send_ping {} failed: {}", peer_node_id, err);

return None;

},

Err(_) => {

// This error is not treated as a warning or error as the result will be passed back to the caller

debug!(target: LOG_TARGET, "check_health: send_ping {} timed out", peer_node_id);

return None;

},

};

// Wait for matching pong with rolling timeouts

loop {

match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await {

Ok(Ok(event)) => {

if let LivenessEvent::ReceivedPong(pong) = &*event {

if pong.node_id == peer_node_id && pong.nonce == nonce {

return Some(start.elapsed());

}

}

// Give up after ~60s total waiting

if start.elapsed() >= Duration::from_secs(60) {

debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id);

return None;

}

},

Ok(Err(RecvError::Closed)) => {

// This error is not treated as a warning or error as the result will be passed back to the caller

return None;

},

Ok(Err(RecvError::Lagged(_))) => { /* keep waiting within total window */ },

Err(_) => {

// This error is not treated as a warning or error as the result will be passed back to the caller

debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id);

return None;

},

}

}

})

}

// Wait for matching pong with rolling timeouts

loop {

match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await {

Ok(Ok(event)) => {

if let LivenessEvent::ReceivedPong(pong) = &*event {

if pong.node_id == peer_node_id && pong.nonce == nonce {

return Some(start.elapsed());

}

}

// Give up after ~60s total waiting

if start.elapsed() >= Duration::from_secs(60) {

debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id);

return None;

}

},

Ok(Err(RecvError::Closed)) => {

// This error is not treated as a warning or error as the result will be passed back to the caller

return None;

},

Ok(Err(RecvError::Lagged(_))) => {

// keep waiting within total window

if start.elapsed() >= Duration::from_secs(60) {

debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id);

return None;

}

},

Err(_) => {

// per-recv timeout; continue until overall deadline

if start.elapsed() >= Duration::from_secs(60) {

debug!(target: LOG_TARGET, "check_health: recv pong {} timed out (overall)", peer_node_id);

return None;

}

},

}

}

🤖 Prompt for AI Agents

In base_layer/core/src/base_node/tari_pulse_service/mod.rs around lines 434-501, the loop currently returns None immediately when the per-iteration timeout (the Err(_) branch of tokio::time::timeout for liveness_events.recv()) fires after 30s, which violates the intended "~60s total" wait; change that Err(_) branch to check if start.elapsed() >= Duration::from_secs(60) and only then log and return None, otherwise continue the loop (don’t return) so the code keeps waiting until the overall ~60s window elapses.

base_layer/core/src/base_node/tari_pulse_service/mod.rs

SWvheerden

Can we add a config option to disable this and make it disabled by default

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)

409-411: Don’t panic if receiver dropped.

Gracefully handle closed watch receivers to avoid crashing the health check task. (Same concern raised previously.)

-    notify_comms_health.send(inner_result).expect("Channel should be open");
+    if let Err(e) = notify_comms_health.send(inner_result) {
+        warn!(target: LOG_TARGET, "check_health: liveness notify failed: {}", e);
+    }

478-501: Pong wait exits early; honor the full ~60s overall window.

On per-iteration timeout (Err(_)), the loop returns immediately after 30s instead of continuing until the 60s cap. (Same concern raised previously.)

-                Err(_) => {
-                    // This error is not treated as a warning or error as the result will be passed back to the caller
-                    debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id);
-                    return None;
-                },
+                Err(_) => {
+                    // per-recv timeout; continue until overall deadline
+                    if start.elapsed() >= Duration::from_secs(60) {
+                        debug!(target: LOG_TARGET, "check_health: recv pong {} timed out (overall)", peer_node_id);
+                        return None;
+                    }
+                },

🧹 Nitpick comments (5)

applications/minotari_node/src/config.rs (1)
157-161: Clarify docs: explicitly state how to disable and tighten wording.

Make it clear that None disables the health check and fix phrasing.
-    /// Interval to check if the seed nodes comms responses are healthy. (Recommended '60 * 10 = 600 s' if you need
-    /// this)
+    /// Interval to check if seed peers' comms responses are healthy.
+    /// Set to `None` to disable.
+    /// Recommended: 600 s (10 minutes) if enabled.
base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)
127-129: Avoid hardcoded startup delay; make it a constant or configurable.

A fixed 30s sleep can be surprising in different environments.
-        tokio::time::sleep(Duration::from_secs(30)).await; // Wait for the node to start up properly
+        const STARTUP_DELAY: Duration = Duration::from_secs(30);
+        tokio::time::sleep(STARTUP_DELAY).await; // Wait for the node to start up properly
Optionally, consider plumbing this via config for tuning.

134-139: Typo in comment.

“Health chack” → “Health check”.
-        // Health chack interval (optional)
+        // Health check interval (optional)
202-213: DNS fetch timeout may be too aggressive.

A 1s DNS TXT query timeout can be flaky on congested networks. Consider 3–5s or making it configurable; you already bound overall cadence via dns_check_interval.

353-371: Cap per-tick concurrency to protect connection manager.

If seed lists get large, spawning unbounded tasks can cause bursts. Consider a small semaphore (e.g., 8–16).

Example sketch (not a drop-in):
let sem = Arc::new(Semaphore::new(16));
for peer in &peers {
    let permit = sem.clone().acquire_owned().await.unwrap();
    handles.push(task::spawn(async move {
        let _p = permit;
        // existing per-peer logic...
    }));
}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94dd061 and 78ba8c1.

📒 Files selected for processing (3)

applications/minotari_node/src/config.rs (3 hunks)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (8 hunks)
common/config/presets/c_base_node_c.toml (1 hunks)

✅ Files skipped from review due to trivial changes (1)

common/config/presets/c_base_node_c.toml

🧰 Additional context used

🧠 Learnings (7)

📓 Common learnings

Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.

Learnt from: MCozhusheck
PR: tari-project/tari#7262
File: applications/minotari_node/src/grpc/base_node_grpc_server.rs:533-536
Timestamp: 2025-06-30T06:54:18.170Z
Learning: In the Tari codebase, readiness status reporting is split between two gRPC servers: base_node_grpc_server's GetNetworkState method always reports READY when it can respond (indicating network-level readiness), while readiness_grpc_server handles granular readiness states like database initialization and migration progress via ReadinessStatusHandler.

Learnt from: MCozhusheck
PR: tari-project/tari#7456
File: applications/minotari_node/src/lib.rs:134-145
Timestamp: 2025-09-01T07:22:09.742Z
Learning: ReadinessGrpcServer in the Tari codebase implements the tari_rpc::base_node_server::BaseNode trait, providing a readiness-focused implementation that returns "unavailable" status for most methods while properly handling get_network_state to report readiness status during node startup.

Learnt from: hansieodendaal
PR: tari-project/tari#6963
File: comms/core/src/peer_manager/manager.rs:60-68
Timestamp: 2025-05-26T02:40:23.812Z
Learning: PeerDatabaseSql in the Tari codebase has been specifically refactored to handle concurrent access and mitigate blocking I/O concerns on async executor threads. The implementation has been tested under high load at both system level and through unit tests like test_concurrent_add_or_update_and_get_closest_peers which validates concurrent read/write operations.

📚 Learning: 2025-08-11T10:49:01.479Z

Learnt from: hansieodendaal
PR: tari-project/tari#7401
File: base_layer/core/src/base_node/sync/config.rs:57-60
Timestamp: 2025-08-11T10:49:01.479Z
Learning: In the Tari codebase, the BlockchainSyncConfig struct works correctly without #[serde(default)] attributes on new fields when used within BaseNodeStateMachineConfig. The configuration loading mechanism or usage patterns ensure that missing fields don't cause deserialization failures in practice.

Applied to files:

applications/minotari_node/src/config.rs

📚 Learning: 2025-07-12T03:43:22.545Z

Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

📚 Learning: 2025-07-09T08:13:37.206Z

Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

📚 Learning: 2025-04-16T07:06:53.981Z

Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

📚 Learning: 2025-07-09T08:33:29.320Z

Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

📚 Learning: 2025-07-09T08:33:29.320Z

Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

🧬 Code graph analysis (1)

base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)

comms/core/src/builder/comms_node.rs (2)

connectivity (257-259)

connectivity (325-327)

base_layer/p2p/src/peer_seeds.rs (1)

new (107-109)

comms/core/src/connectivity/requester.rs (1)

peers (170-173)

comms/dht/examples/memory_net/utilities.rs (1)

discovery (134-192)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: Cucumber tests / Base Layer
GitHub Check: Cucumber tests / FFI
GitHub Check: wasm build tests
GitHub Check: test (testnet, esmeralda)
GitHub Check: ledger build tests
GitHub Check: test (nextnet, nextnet)
GitHub Check: test (mainnet, stagenet)
GitHub Check: ci
GitHub Check: cargo check with stable

🔇 Additional comments (6)

applications/minotari_node/src/config.rs (2)

34-35: Serializer import for optional durations: LGTM.

Using serializers::optional_seconds is the right choice for Option in config.

224-225: Backward-compatible default: LGTM.

Defaulting tari_pulse_health_check to None avoids breaking existing configs and matches the new optional semantics.

base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)

52-57: Optional liveness interval: LGTM.

Making liveness_interval an Option cleanly disables health checks when unset.

147-166: Single-flight guard for health checks: LGTM.

Using a try_lock on Arc<Mutex<()>> to skip overlapping runs is appropriate.

507-543: Disconnect parameters: LGTM.

disconnect_if_unused(Minimized::No, 0, 2, ...) matches the health-check’s 0 RPC sessions and 2 substreams design.

545-552: tick_optional helper: LGTM.

A neat way to disable the select branch without extra branching in the loop.

base_layer/core/src/base_node/tari_pulse_service/mod.rs

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)

451-453: Don’t panic if receiver dropped (duplicate of prior review).

Handle watch send errors gracefully to avoid crashing the health task.

-    notify_comms_health.send(inner_result).expect("Channel should be open");
+    if let Err(e) = notify_comms_health.send(inner_result) {
+        warn!(target: LOG_TARGET, "check_health: liveness notify failed: {}", e);
+    }

521-544: Pong wait exits after first 30s timeout; keep waiting up to 60s total (duplicate of prior review).

Currently returns None on the first per‑recv timeout; violates the “~60s total” intent and may miss late pongs.

-                Ok(Err(RecvError::Lagged(_))) => { /* keep waiting within total window */ },
-                Err(_) => {
-                    // This error is not treated as a warning or error as the result will be passed back to the caller
-                    debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id);
-                    return None;
-                },
+                Ok(Err(RecvError::Lagged(_))) => {
+                    if start.elapsed() >= Duration::from_secs(60) {
+                        debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id);
+                        return None;
+                    }
+                },
+                Err(_) => {
+                    // per-recv timeout; continue until overall deadline
+                    if start.elapsed() >= Duration::from_secs(60) {
+                        debug!(target: LOG_TARGET, "check_health: recv pong {} timed out (overall)", peer_node_id);
+                        return None;
+                    }
+                },

🧹 Nitpick comments (6)

base_layer/core/src/base_node/tari_pulse_service/mod.rs (6)
137-143: Nit: fix typo in comment.

“Health chack interval” → “Health check interval”.
-        // Health chack interval (optional)
+        // Health check interval (optional)
380-381: Prefer Mutex over RwLock for push-only writes.

This path only performs writes (push) and a single final read; RwLock adds overhead without benefit.
-    let results = Arc::new(RwLock::new(Vec::new()));
+    let results = Arc::new(Mutex::new(Vec::new()));
@@
-            (*result_clone).write().await.push(result);
+            result_clone.lock().await.push(result);
Also drop RwLock from the imports if unused.

Also applies to: 446-447

439-445: Optional: log latencies in ms for readability and alignment with gRPC.

Formatting Option<Duration> with {:?} is noisy. Consider logging ms integers.
-            debug!(
-                target: LOG_TARGET,
-                "check_health: peer {} discovery: {:.2?}, ping: {:.2?}",
-                result.peer,
-                result.discovery_latency,
-                result.ping_latency
-            );
+            let d_ms = result.discovery_latency.map(|d| d.as_millis() as u64);
+            let p_ms = result.ping_latency.map(|d| d.as_millis() as u64);
+            debug!(target: LOG_TARGET,
+                   "check_health: peer {} discovery_ms: {:?}, ping_ms: {:?}",
+                   result.peer, d_ms, p_ms);
462-465: Unnecessary into() on CommsPublicKey.

NodeDestination::PublicKey already takes the same key type.
-            discovery.discover_peer(dest_key.clone(), NodeDestination::PublicKey(dest_key.clone().into())),
+            discovery.discover_peer(dest_key.clone(), NodeDestination::PublicKey(dest_key.clone())),
488-500: Double timeouts around discover/dial — verify intended behavior.

discover_peer/dial_peer may have built‑in timeouts. Wrapping them can obscure underlying errors or trigger earlier-than-expected timeouts.

Would you confirm the current internal timeouts of these APIs and whether the added 90s/60s outer bounds are stricter and required for this interlock? If redundant, consider relying on the inner timeout and only applying an overall per‑peer budget at the orchestration layer.

Also applies to: 461-465

409-414: Avoid extra task::spawn; use async helpers + join! for lower overhead.

You already spawn one task per peer; spawning two more per peer is unnecessary.
-            let discovery_handle = spawn_discovery(discovery, dest_key);
-            let ping_handle = spawn_ping(comms_clone.clone(), liveness, liveness_events, result.peer.clone());
-            // Await both inner tasks
-            let (discovery_result, ping_result) = tokio::join!(discovery_handle, ping_handle);
+            let (discovery_result, ping_result) = tokio::join!(
+                do_discovery(discovery, dest_key),
+                do_ping(comms_clone.clone(), liveness, liveness_events, result.peer.clone())
+            );
Add helpers (replace spawn_* with these):
async fn do_discovery(
    mut discovery: DhtDiscoveryRequester,
    dest_key: CommsPublicKey,
) -> Option<Duration> { /* same body as before, without outer task::spawn */ }

async fn do_ping(
    comms: ConnectivityRequester,
    mut liveness: LivenessHandle,
    mut liveness_events: broadcast::Receiver<Arc<LivenessEvent>>,
    peer_node_id: NodeId,
) -> Option<Duration> { /* same body as before, without outer task::spawn */ }
Also applies to: 455-478, 480-547

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78ba8c1 and 526c108.

📒 Files selected for processing (4)

applications/minotari_node/src/config.rs (3 hunks)
applications/minotari_node/src/grpc/base_node_grpc_server.rs (1 hunks)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (9 hunks)
common/config/presets/c_base_node_c.toml (1 hunks)

✅ Files skipped from review due to trivial changes (1)

common/config/presets/c_base_node_c.toml

🚧 Files skipped from review as they are similar to previous changes (2)

applications/minotari_node/src/config.rs
applications/minotari_node/src/grpc/base_node_grpc_server.rs

🧰 Additional context used

🧠 Learnings (6)

📓 Common learnings

Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.

📚 Learning: 2025-07-12T03:43:22.545Z

Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

📚 Learning: 2025-07-09T08:13:37.206Z

Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

📚 Learning: 2025-04-16T07:06:53.981Z

Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

📚 Learning: 2025-07-09T08:33:29.320Z

Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

📚 Learning: 2025-07-09T08:33:29.320Z

Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.

Applied to files:

base_layer/core/src/base_node/tari_pulse_service/mod.rs

🧬 Code graph analysis (1)

base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)

base_layer/p2p/src/services/liveness/service.rs (2)

connectivity (346-355)

connectivity (359-368)

comms/core/src/builder/comms_node.rs (2)

connectivity (257-259)

connectivity (325-327)

base_layer/p2p/src/peer_seeds.rs (1)

new (107-109)

comms/core/src/connectivity/requester.rs (1)

peers (170-173)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: test (nextnet, nextnet)
GitHub Check: test (mainnet, stagenet)
GitHub Check: test (testnet, esmeralda)
GitHub Check: file licenses
GitHub Check: wasm build tests
GitHub Check: cargo check with stable
GitHub Check: ci
GitHub Check: ledger build tests
GitHub Check: Cucumber tests / FFI
GitHub Check: Cucumber tests / Base Layer

🔇 Additional comments (2)

base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)

553-555: Disconnect parameters match expected usage. LGTM.

disconnect_if_unused(Minimized::No, 0, 2, ...) aligns with health check opening 0 RPC sessions and 2 substreams.

If this changes in future (e.g., more substreams), update the counts to avoid disconnecting active peers.

320-334: Confirm consumers updated to Option

Found one call site: applications/minotari_node/src/bootstrap.rs calling TariPulseServiceInitializer::new(base_node_config.tari_pulse_interval, base_node_config.tari_pulse_health_check). Verify those two fields are now Option in the config struct; if they are still Duration/bool, update the call site to pass Option (e.g., Some(...)) or adjust the config to match the new API.

Also ensure any other callers (none found by the search) are updated.

coderabbitai · 2025-09-17T05:19:42Z

base_layer/core/src/base_node/tari_pulse_service/mod.rs

+        // Wait for the node comms to start up properly
+        if let Ok(false) = self.wait_for_first_liveness_event().await {
+            return;
+        }
+


⚠️ Potential issue

Don’t block startup indefinitely waiting for a liveness event. Add an overall timeout.

If no liveness events are emitted (e.g., liveness service not started/quiet), run() can stall forever. Bound the wait and proceed.

Apply this diff:

- // Wait for the node comms to start up properly - if let Ok(false) = self.wait_for_first_liveness_event().await { - return; - } + // Wait for the node comms to start up properly (bounded) + match tokio::time::timeout(Duration::from_secs(60), self.wait_for_first_liveness_event()).await { + Ok(Ok(true)) => {}, + Ok(Ok(false)) => return, // shutdown requested + Ok(Err(e)) => { + warn!(target: LOG_TARGET, "wait_for_first_liveness_event error: {}", e); + }, + Err(_) => { + info!(target: LOG_TARGET, "Timeout waiting for first liveness event; continuing startup"); + }, + }

🤖 Prompt for AI Agents

In base_layer/core/src/base_node/tari_pulse_service/mod.rs around lines 127 to 131, the current await on wait_for_first_liveness_event() can block forever; wrap that await in a timeout (e.g., tokio::time::timeout with a sensible duration like a few seconds), then match the result: if timeout occurs or the call returns Err or Ok(false), log a warning and proceed rather than returning/blocked; if it returns Ok(true) continue as before. Ensure the timeout future is awaited correctly and any errors are handled and logged.

* development: fix: always cancel transactions (tari-project#7500) feat: throttle background rebuild tasks (tari-project#7502) chore: clearer logic for is_micro_tari (tari-project#7499) feat: add sqlite concurrency and pool hardening (tari-project#7492) feat: add timeouts to health check (tari-project#7482) feat: performance improvement of wallet output (tari-project#7490) fix: discovery syncing regression bug (tari-project#7488) fix: update peer database after re-initializing node identity config file (tari-project#7497) fix: peer_db null update where none should be written as null (tari-project#7489) fix: filter node identity onion address when running TCP node (tari-project#7486) chore: refactor multi sig and offline sign (tari-project#7487) feat: add http cache config (tari-project#7477)

hansieodendaal requested a review from a team as a code owner September 8, 2025 13:05

hansieodendaal marked this pull request as draft September 8, 2025 13:05

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

hansieodendaal force-pushed the ho_health_check branch from 98ff2ee to 94dd061 Compare September 9, 2025 06:10

hansieodendaal marked this pull request as ready for review September 9, 2025 06:11

coderabbitai bot reviewed Sep 9, 2025

View reviewed changes

MCozhusheck reviewed Sep 9, 2025

View reviewed changes

base_layer/core/src/base_node/tari_pulse_service/mod.rs Show resolved Hide resolved

base_layer/core/src/base_node/tari_pulse_service/mod.rs Show resolved Hide resolved

SWvheerden reviewed Sep 9, 2025

View reviewed changes

Hansie Odendaal added 2 commits September 9, 2025 15:40

review comments

78ba8c1

Merge remote-tracking branch 'upstream/development' into ho_health_check

2dbaf17

coderabbitai bot reviewed Sep 9, 2025

View reviewed changes

SWvheerden reviewed Sep 10, 2025

View reviewed changes

base_layer/core/src/base_node/tari_pulse_service/mod.rs Outdated Show resolved Hide resolved

Hansie Odendaal added 3 commits September 16, 2025 17:36

review comments

5190abc

Merge remote-tracking branch 'upstream/development' into ho_health_check

b2b762d

improve log

526c108

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

SWvheerden approved these changes Sep 17, 2025

View reviewed changes

SWvheerden merged commit a5cb9c0 into tari-project:development Sep 17, 2025
15 of 16 checks passed

hansieodendaal deleted the ho_health_check branch September 17, 2025 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add timeouts to health check#7482

feat: add timeouts to health check#7482
SWvheerden merged 6 commits intotari-project:developmentfrom
hansieodendaal:ho_health_check

hansieodendaal commented Sep 8, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 8, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Sep 9, 2025

Uh oh!

Uh oh!

Uh oh!

SWvheerden left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hansieodendaal commented Sep 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

What process can a PR reviewer use to test or verify this change?

Breaking Changes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SWvheerden left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hansieodendaal commented Sep 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 8, 2025 •

edited

Loading