feat: add timeouts to health check#7482
Conversation
WalkthroughClarified proto comments for LivenessResult; gRPC now reports discover and ping latencies in milliseconds with u64 conversion/fallback; TariPulseService refactored to make health-check interval optional, wait for first liveness event on startup, run per-peer discovery/ping in parallel with explicit timeouts, aggregate results, and perform timed disconnects. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Pulse as TariPulseService
participant Seeds as SeedPeers
participant Dht as DhtDiscoveryRequester
participant Conn as ConnectivityRequester
participant Live as LivenessService
Note over Pulse: Startup waits for first liveness event
Pulse->>Seeds: fetch_seed_peers (with timeout)
alt seeds fetched
loop per-seed
par Discovery (<=90s)
Pulse->>Dht: discover(peer_pk) (timeout)
Dht-->>Pulse: discovery_duration | timeout
and Ping (<=60s window)
Pulse->>Conn: dial(peer) (<=60s)
Pulse->>Live: send_ping (<=15s)
Live-->>Pulse: matching_pong (<=60s)
end
Pulse->>Pulse: aggregate {discover_ms?, ping_ms?}
Pulse->>Conn: disconnect_peer (timed)
end
else fetch failed
Note over Pulse: abort health-check tick
end
Note over Pulse: record/log results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)
392-422: Consider logging the received unrelated pong details for debugging.When receiving unrelated pongs (lines 394-398), it might be helpful to log which peer the pong came from to aid in debugging connectivity issues.
if let LivenessEvent::ReceivedPong(pong) = &*event { if pong.node_id == peer_node_id && pong.nonce == nonce { return Some(start.elapsed()); + } else { + trace!( + target: LOG_TARGET, + "check_health: received pong from {} (nonce: {}) while waiting for {} (nonce: {})", + pong.node_id, + pong.nonce, + peer_node_id, + nonce + ); } }
332-507: Consider adding a top-level timeout as a safety net.While individual operations have timeouts, consider wrapping the entire
check_healthfunction with an overall timeout (e.g., 3 minutes) to guarantee it never exceeds the expected duration, even if there are unexpected delays in task spawning or joining.At the beginning of the function, wrap the entire logic:
async fn check_health( mut node_comms: ConnectivityRequester, liveness_handle: LivenessHandle, node_discovery: DhtDiscoveryRequester, notify_comms_health: watch::Sender<Vec<LivenessCheckResult>>, ) { + // Ensure the entire health check completes within 3 minutes + match tokio::time::timeout(Duration::from_secs(180), check_health_inner( + node_comms, + liveness_handle, + node_discovery, + notify_comms_health, + )).await { + Ok(()) => {}, + Err(_) => { + warn!(target: LOG_TARGET, "check_health: overall timeout exceeded (3 minutes)"); + notify_comms_health.send(vec![]).expect("Channel should be open"); + } + } +} + +async fn check_health_inner( + mut node_comms: ConnectivityRequester, + liveness_handle: LivenessHandle, + node_discovery: DhtDiscoveryRequester, + notify_comms_health: watch::Sender<Vec<LivenessCheckResult>>, +) { let results = Arc::new(RwLock::new(Vec::new())); // ... rest of the current implementation
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs(2 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.
📚 Learning: 2025-07-12T03:43:22.545Z
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:13:37.206Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: test (mainnet, stagenet)
- GitHub Check: ci
- GitHub Check: ledger build tests
- GitHub Check: test (testnet, esmeralda)
- GitHub Check: test (nextnet, nextnet)
- GitHub Check: wasm build tests
- GitHub Check: cargo check with stable
- GitHub Check: Cucumber tests / FFI
- GitHub Check: Cucumber tests / Base Layer
🔇 Additional comments (7)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (7)
325-325: Good addition of the lint attribute for function length.The
#[allow(clippy::too_many_lines)]attribute is appropriate here as the function's complexity is justified by the need for comprehensive timeout handling and error logging.
334-344: Excellent implementation of timeout for seed peer retrieval.The 10-second timeout for
get_seeds()with proper error handling prevents the health check from hanging indefinitely. The early return pattern is appropriate when seed peers cannot be retrieved.
365-381: Consider the previous learning about discovery_peer timeouts.Based on the retrieved learnings,
discovery_peerhas built-in timeout mechanisms. However, given the specific issue #7380 about health check locks not being released, adding an explicit 90-second timeout is justified as a safeguard. This aligns with the PR objectives to ensure the health check completes within bounded time.
388-437: Well-structured ping/pong timeout implementation.The implementation correctly addresses the core issue from #7380:
- 15-second timeout for sending the ping
- 15-second timeout for receiving the pong
- Protection against waiting indefinitely for unrelated liveness events (max 5 events)
This ensures that pongs that are never returned don't cause the health check to hang.
440-462: Good use of concurrent task execution.Using
tokio::join!to run discovery and ping tasks in parallel improves efficiency and ensures both complete or timeout independently. The error handling for task join failures is comprehensive.
465-499: Connection cleanup implementation aligns with known requirements.The disconnect logic correctly uses parameters (0, 2) as per the retrieved learning that tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks. The 15-second timeouts for both getting the connection and disconnecting ensure bounded execution time.
334-344: Verify that the cumulative timeout bounds meet the 3-minute requirement.The PR objectives state that each health check should complete within ≤ 3 minutes. Let's verify the maximum possible execution time:
- Seed retrieval: 10s
- Per peer (in parallel):
- Discovery: 90s
- Ping/pong: 15s + 15s = 30s
- Connection cleanup: 15s + 15s = 30s
Since discovery and ping run in parallel, the maximum per-peer time is max(90s, 30s) + 30s = 120s. This is within the 3-minute (180s) requirement.
Also applies to: 365-381, 388-437, 465-499
- This PR fixes an interlock in the health check whereby either discover-peer or ping-peer locks up and do not return. This culminated in the health check being interlocked for the remainder of the base node's uptime. - The discover-peer and ping-peer tests are now run in parallel. - The ping-peer test is now preceded by a dial-peer to ensure we have a connection before we proceed. - Changed the gRPC result to return `ms` instead of `s` and documented it as such; this was previously undocumented. When the network is well connected, discovery and ping latencies would be less than `1 s`, resulting in zero values to be returned.
98ff2ee to
94dd061
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (5)
applications/minotari_node/src/grpc/base_node_grpc_server.rs (1)
560-570: Avoid duplication: extract “Option -> u64 ms (or u64::MAX)” helper.Same mapping is repeated for both fields; a small helper improves readability and keeps proto/server semantics in one place.
- discover_latency: data - .discovery_latency - .map(|v| u64::try_from(v.as_millis()).unwrap_or(u64::MAX)) - .unwrap_or_else(|| u64::MAX), - ping_latency: data - .ping_latency - .map(|v| u64::try_from(v.as_millis()).unwrap_or(u64::MAX)) - .unwrap_or_else(|| u64::MAX), + discover_latency: { + let to_ms_or_max = |d: Option<std::time::Duration>| { + d.map(|v| u64::try_from(v.as_millis()).unwrap_or(u64::MAX)).unwrap_or(u64::MAX) + }; + to_ms_or_max(data.discovery_latency) + }, + ping_latency: { + let to_ms_or_max = |d: Option<std::time::Duration>| { + d.map(|v| u64::try_from(v.as_millis()).unwrap_or(u64::MAX)).unwrap_or(u64::MAX) + }; + to_ms_or_max(data.ping_latency) + },base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)
127-128: Startup delay should be configurable.Hard-coding 30s can be problematic across environments. Prefer a config knob with a sensible default.
- tokio::time::sleep(Duration::from_secs(30)).await; // Wait for the node to start up properly + tokio::time::sleep(self.config.startup_delay.unwrap_or(Duration::from_secs(30))).await;Language-only support (outside this hunk):
- Add
pub startup_delay: Option<Duration>toTariPulseConfigdefaulting toNone.
336-347: Seeds fetch timeout: good guard, but consider reporting an empty snapshot instead of early return.Returning early drops the whole health snapshot for this tick. Optionally publish an empty
liveness_checksto unblock consumers and make behavior explicit.- warn!(target: LOG_TARGET, "check_health: timeout getting seed peers"); - return; + warn!(target: LOG_TARGET, "check_health: timeout getting seed peers"); + let _ = notify_comms_health.send(vec![]); + return;
349-401: Bound fan-out to avoid spawning N tasks unbounded on large seed sets.A semaphore (e.g., 32–64 permits) prevents bursty load on connectivity/liveness subsystems.
- let mut handles = vec![]; + let mut handles = vec![]; + let semaphore = Arc::new(tokio::sync::Semaphore::new(64)); // cap concurrency for peer in &peers { + let permit = semaphore.clone().acquire_owned().await.unwrap(); let result_clone = results.clone(); let mut result = LivenessCheckResult { /* ... */ }; handles.push(task::spawn(async move { + let _permit = permit; let discovery_handle = spawn_discovery(discovery, dest_key); let ping_handle = spawn_ping(comms_clone.clone(), liveness, liveness_events, result.peer.clone()); // Await both inner tasks let (discovery_result, ping_result) = tokio::join!(discovery_handle, ping_handle); /* ... */ })); }
409-432: Unify discovery timeouts
DhtDiscoveryRequester::discover_peeralready applies its configureddiscovery_timeout, so wrapping it again inspawn_discoverywith a hard-coded 90 s is redundant and risks them falling out of sync. Replace theDuration::from_secs(90)here with the requester’s own timeout (or remove the outer wrapper) so all discovery calls share the same configured deadline.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
applications/minotari_app_grpc/proto/base_node.proto(1 hunks)applications/minotari_node/src/grpc/base_node_grpc_server.rs(1 hunks)base_layer/core/src/base_node/tari_pulse_service/mod.rs(5 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.
Learnt from: hansieodendaal
PR: tari-project/tari#6963
File: comms/core/src/peer_manager/manager.rs:60-68
Timestamp: 2025-05-26T02:40:23.812Z
Learning: PeerDatabaseSql in the Tari codebase has been specifically refactored to handle concurrent access and mitigate blocking I/O concerns on async executor threads. The implementation has been tested under high load at both system level and through unit tests like test_concurrent_add_or_update_and_get_closest_peers which validates concurrent read/write operations.
Learnt from: MCozhusheck
PR: tari-project/tari#7262
File: applications/minotari_node/src/grpc/base_node_grpc_server.rs:533-536
Timestamp: 2025-06-30T06:54:18.170Z
Learning: In the Tari codebase, readiness status reporting is split between two gRPC servers: base_node_grpc_server's GetNetworkState method always reports READY when it can respond (indicating network-level readiness), while readiness_grpc_server handles granular readiness states like database initialization and migration progress via ReadinessStatusHandler.
📚 Learning: 2025-08-26T06:28:13.374Z
Learnt from: hansieodendaal
PR: tari-project/tari#7432
File: applications/minotari_app_grpc/proto/wallet.proto:1933-1948
Timestamp: 2025-08-26T06:28:13.374Z
Learning: In PR #7432 "feat!: remove comms from wallet", the HttpPeer message in the wallet.proto should only include url, last_latency, and is_online fields. The base node public_key and node_id fields mentioned in the original PR summary were removed from the scope and should not be included.
Applied to files:
applications/minotari_app_grpc/proto/base_node.proto
📚 Learning: 2025-07-12T03:43:22.545Z
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
🧬 Code graph analysis (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (3)
base_layer/p2p/src/services/liveness/service.rs (2)
connectivity(346-355)connectivity(359-368)comms/core/src/builder/comms_node.rs (2)
connectivity(257-259)connectivity(325-327)comms/core/src/connectivity/requester.rs (1)
peers(170-173)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: test (testnet, esmeralda)
- GitHub Check: test (mainnet, stagenet)
- GitHub Check: test (nextnet, nextnet)
- GitHub Check: cargo check with stable
- GitHub Check: ci
- GitHub Check: Cucumber tests / Base Layer
- GitHub Check: Cucumber tests / FFI
🔇 Additional comments (2)
applications/minotari_app_grpc/proto/base_node.proto (1)
645-651: Docs align with server sentinel semantics (u64::MAX).Comment updates match the server-side fallback to u64::MAX for timeouts/overflow. No wire/schema changes introduced.
base_layer/core/src/base_node/tari_pulse_service/mod.rs (1)
503-539: Disconnect parameters are correct for health-check usage.Using
(Minimized::No, 0, 2)matches the two substreams/zero RPC sessions created by the health check, per prior guidance.
| fn spawn_ping( | ||
| comms: ConnectivityRequester, | ||
| mut liveness: LivenessHandle, | ||
| mut liveness_events: broadcast::Receiver<Arc<LivenessEvent>>, | ||
| peer_node_id: NodeId, | ||
| ) -> JoinHandle<Option<std::time::Duration>> { | ||
| task::spawn(async move { | ||
| // Ensure we have a connection first | ||
| match tokio::time::timeout(Duration::from_secs(60), comms.dial_peer(peer_node_id.clone())).await { | ||
| Ok(Ok(_)) => {}, | ||
| Ok(Err(err)) => { | ||
| // This error is not treated as a warning or error as the result will be passed back to the caller | ||
| debug!(target: LOG_TARGET, "check_health: dial {} failed: {}", peer_node_id, err); | ||
| return None; | ||
| }, | ||
| Err(_) => { | ||
| // This error is not treated as a warning or error as the result will be passed back to the caller | ||
| debug!(target: LOG_TARGET, "check_health: dial {} timed out", peer_node_id); | ||
| return None; | ||
| }, | ||
| } | ||
|
|
||
| // Now send the ping | ||
| let start = Instant::now(); | ||
| let nonce = match tokio::time::timeout(Duration::from_secs(15), liveness.send_ping(peer_node_id.clone())).await | ||
| { | ||
| Ok(Ok(n)) => n, | ||
| Ok(Err(err)) => { | ||
| // This error is not treated as a warning or error as the result will be passed back to the caller | ||
| debug!(target: LOG_TARGET, "check_health: send_ping {} failed: {}", peer_node_id, err); | ||
| return None; | ||
| }, | ||
| Err(_) => { | ||
| // This error is not treated as a warning or error as the result will be passed back to the caller | ||
| debug!(target: LOG_TARGET, "check_health: send_ping {} timed out", peer_node_id); | ||
| return None; | ||
| }, | ||
| }; | ||
|
|
||
| // Wait for matching pong with rolling timeouts | ||
| loop { | ||
| match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await { | ||
| Ok(Ok(event)) => { | ||
| if let LivenessEvent::ReceivedPong(pong) = &*event { | ||
| if pong.node_id == peer_node_id && pong.nonce == nonce { | ||
| return Some(start.elapsed()); | ||
| } | ||
| } | ||
| // Give up after ~60s total waiting | ||
| if start.elapsed() >= Duration::from_secs(60) { | ||
| debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id); | ||
| return None; | ||
| } | ||
| }, | ||
| Ok(Err(RecvError::Closed)) => { | ||
| // This error is not treated as a warning or error as the result will be passed back to the caller | ||
| return None; | ||
| }, | ||
| Ok(Err(RecvError::Lagged(_))) => { /* keep waiting within total window */ }, | ||
| Err(_) => { | ||
| // This error is not treated as a warning or error as the result will be passed back to the caller | ||
| debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id); | ||
| return None; | ||
| }, | ||
| } | ||
| } | ||
| }) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Pong wait exits after the first 30s timeout (contrary to “~60s total” intent).
If no events arrive, the Err(_) branch returns immediately after 30s. Keep waiting up to the 60s overall window.
- loop {
- match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await {
+ loop {
+ match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await {
Ok(Ok(event)) => {
if let LivenessEvent::ReceivedPong(pong) = &*event {
if pong.node_id == peer_node_id && pong.nonce == nonce {
return Some(start.elapsed());
}
}
- // Give up after ~60s total waiting
- if start.elapsed() >= Duration::from_secs(60) {
+ if start.elapsed() >= Duration::from_secs(60) {
debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id);
return None;
}
},
Ok(Err(RecvError::Closed)) => {
// This error is not treated as a warning or error as the result will be passed back to the caller
return None;
},
- Ok(Err(RecvError::Lagged(_))) => { /* keep waiting within total window */ },
- Err(_) => {
- // This error is not treated as a warning or error as the result will be passed back to the caller
- debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id);
- return None;
- },
+ Ok(Err(RecvError::Lagged(_))) => {
+ // keep waiting within total window
+ if start.elapsed() >= Duration::from_secs(60) {
+ debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id);
+ return None;
+ }
+ },
+ Err(_) => {
+ // per-recv timeout; continue until overall deadline
+ if start.elapsed() >= Duration::from_secs(60) {
+ debug!(target: LOG_TARGET, "check_health: recv pong {} timed out (overall)", peer_node_id);
+ return None;
+ }
+ },
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fn spawn_ping( | |
| comms: ConnectivityRequester, | |
| mut liveness: LivenessHandle, | |
| mut liveness_events: broadcast::Receiver<Arc<LivenessEvent>>, | |
| peer_node_id: NodeId, | |
| ) -> JoinHandle<Option<std::time::Duration>> { | |
| task::spawn(async move { | |
| // Ensure we have a connection first | |
| match tokio::time::timeout(Duration::from_secs(60), comms.dial_peer(peer_node_id.clone())).await { | |
| Ok(Ok(_)) => {}, | |
| Ok(Err(err)) => { | |
| // This error is not treated as a warning or error as the result will be passed back to the caller | |
| debug!(target: LOG_TARGET, "check_health: dial {} failed: {}", peer_node_id, err); | |
| return None; | |
| }, | |
| Err(_) => { | |
| // This error is not treated as a warning or error as the result will be passed back to the caller | |
| debug!(target: LOG_TARGET, "check_health: dial {} timed out", peer_node_id); | |
| return None; | |
| }, | |
| } | |
| // Now send the ping | |
| let start = Instant::now(); | |
| let nonce = match tokio::time::timeout(Duration::from_secs(15), liveness.send_ping(peer_node_id.clone())).await | |
| { | |
| Ok(Ok(n)) => n, | |
| Ok(Err(err)) => { | |
| // This error is not treated as a warning or error as the result will be passed back to the caller | |
| debug!(target: LOG_TARGET, "check_health: send_ping {} failed: {}", peer_node_id, err); | |
| return None; | |
| }, | |
| Err(_) => { | |
| // This error is not treated as a warning or error as the result will be passed back to the caller | |
| debug!(target: LOG_TARGET, "check_health: send_ping {} timed out", peer_node_id); | |
| return None; | |
| }, | |
| }; | |
| // Wait for matching pong with rolling timeouts | |
| loop { | |
| match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await { | |
| Ok(Ok(event)) => { | |
| if let LivenessEvent::ReceivedPong(pong) = &*event { | |
| if pong.node_id == peer_node_id && pong.nonce == nonce { | |
| return Some(start.elapsed()); | |
| } | |
| } | |
| // Give up after ~60s total waiting | |
| if start.elapsed() >= Duration::from_secs(60) { | |
| debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id); | |
| return None; | |
| } | |
| }, | |
| Ok(Err(RecvError::Closed)) => { | |
| // This error is not treated as a warning or error as the result will be passed back to the caller | |
| return None; | |
| }, | |
| Ok(Err(RecvError::Lagged(_))) => { /* keep waiting within total window */ }, | |
| Err(_) => { | |
| // This error is not treated as a warning or error as the result will be passed back to the caller | |
| debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id); | |
| return None; | |
| }, | |
| } | |
| } | |
| }) | |
| } | |
| // Wait for matching pong with rolling timeouts | |
| loop { | |
| match tokio::time::timeout(Duration::from_secs(30), liveness_events.recv()).await { | |
| Ok(Ok(event)) => { | |
| if let LivenessEvent::ReceivedPong(pong) = &*event { | |
| if pong.node_id == peer_node_id && pong.nonce == nonce { | |
| return Some(start.elapsed()); | |
| } | |
| } | |
| // Give up after ~60s total waiting | |
| if start.elapsed() >= Duration::from_secs(60) { | |
| debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id); | |
| return None; | |
| } | |
| }, | |
| Ok(Err(RecvError::Closed)) => { | |
| // This error is not treated as a warning or error as the result will be passed back to the caller | |
| return None; | |
| }, | |
| Ok(Err(RecvError::Lagged(_))) => { | |
| // keep waiting within total window | |
| if start.elapsed() >= Duration::from_secs(60) { | |
| debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id); | |
| return None; | |
| } | |
| }, | |
| Err(_) => { | |
| // per-recv timeout; continue until overall deadline | |
| if start.elapsed() >= Duration::from_secs(60) { | |
| debug!(target: LOG_TARGET, "check_health: recv pong {} timed out (overall)", peer_node_id); | |
| return None; | |
| } | |
| }, | |
| } | |
| } |
🤖 Prompt for AI Agents
In base_layer/core/src/base_node/tari_pulse_service/mod.rs around lines 434-501,
the loop currently returns None immediately when the per-iteration timeout (the
Err(_) branch of tokio::time::timeout for liveness_events.recv()) fires after
30s, which violates the intended "~60s total" wait; change that Err(_) branch to
check if start.elapsed() >= Duration::from_secs(60) and only then log and return
None, otherwise continue the loop (don’t return) so the code keeps waiting until
the overall ~60s window elapses.
SWvheerden
left a comment
There was a problem hiding this comment.
Can we add a config option to disable this and make it disabled by default
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)
409-411: Don’t panic if receiver dropped.Gracefully handle closed watch receivers to avoid crashing the health check task. (Same concern raised previously.)
- notify_comms_health.send(inner_result).expect("Channel should be open"); + if let Err(e) = notify_comms_health.send(inner_result) { + warn!(target: LOG_TARGET, "check_health: liveness notify failed: {}", e); + }
478-501: Pong wait exits early; honor the full ~60s overall window.On per-iteration timeout (Err(_)), the loop returns immediately after 30s instead of continuing until the 60s cap. (Same concern raised previously.)
- Err(_) => { - // This error is not treated as a warning or error as the result will be passed back to the caller - debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id); - return None; - }, + Err(_) => { + // per-recv timeout; continue until overall deadline + if start.elapsed() >= Duration::from_secs(60) { + debug!(target: LOG_TARGET, "check_health: recv pong {} timed out (overall)", peer_node_id); + return None; + } + },
🧹 Nitpick comments (5)
applications/minotari_node/src/config.rs (1)
157-161: Clarify docs: explicitly state how to disable and tighten wording.Make it clear that None disables the health check and fix phrasing.
- /// Interval to check if the seed nodes comms responses are healthy. (Recommended '60 * 10 = 600 s' if you need - /// this) + /// Interval to check if seed peers' comms responses are healthy. + /// Set to `None` to disable. + /// Recommended: 600 s (10 minutes) if enabled.base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)
127-129: Avoid hardcoded startup delay; make it a constant or configurable.A fixed 30s sleep can be surprising in different environments.
- tokio::time::sleep(Duration::from_secs(30)).await; // Wait for the node to start up properly + const STARTUP_DELAY: Duration = Duration::from_secs(30); + tokio::time::sleep(STARTUP_DELAY).await; // Wait for the node to start up properlyOptionally, consider plumbing this via config for tuning.
134-139: Typo in comment.“Health chack” → “Health check”.
- // Health chack interval (optional) + // Health check interval (optional)
202-213: DNS fetch timeout may be too aggressive.A 1s DNS TXT query timeout can be flaky on congested networks. Consider 3–5s or making it configurable; you already bound overall cadence via dns_check_interval.
353-371: Cap per-tick concurrency to protect connection manager.If seed lists get large, spawning unbounded tasks can cause bursts. Consider a small semaphore (e.g., 8–16).
Example sketch (not a drop-in):
let sem = Arc::new(Semaphore::new(16)); for peer in &peers { let permit = sem.clone().acquire_owned().await.unwrap(); handles.push(task::spawn(async move { let _p = permit; // existing per-peer logic... })); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
applications/minotari_node/src/config.rs(3 hunks)base_layer/core/src/base_node/tari_pulse_service/mod.rs(8 hunks)common/config/presets/c_base_node_c.toml(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- common/config/presets/c_base_node_c.toml
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.
Learnt from: MCozhusheck
PR: tari-project/tari#7262
File: applications/minotari_node/src/grpc/base_node_grpc_server.rs:533-536
Timestamp: 2025-06-30T06:54:18.170Z
Learning: In the Tari codebase, readiness status reporting is split between two gRPC servers: base_node_grpc_server's GetNetworkState method always reports READY when it can respond (indicating network-level readiness), while readiness_grpc_server handles granular readiness states like database initialization and migration progress via ReadinessStatusHandler.
Learnt from: MCozhusheck
PR: tari-project/tari#7456
File: applications/minotari_node/src/lib.rs:134-145
Timestamp: 2025-09-01T07:22:09.742Z
Learning: ReadinessGrpcServer in the Tari codebase implements the tari_rpc::base_node_server::BaseNode trait, providing a readiness-focused implementation that returns "unavailable" status for most methods while properly handling get_network_state to report readiness status during node startup.
Learnt from: hansieodendaal
PR: tari-project/tari#6963
File: comms/core/src/peer_manager/manager.rs:60-68
Timestamp: 2025-05-26T02:40:23.812Z
Learning: PeerDatabaseSql in the Tari codebase has been specifically refactored to handle concurrent access and mitigate blocking I/O concerns on async executor threads. The implementation has been tested under high load at both system level and through unit tests like test_concurrent_add_or_update_and_get_closest_peers which validates concurrent read/write operations.
📚 Learning: 2025-08-11T10:49:01.479Z
Learnt from: hansieodendaal
PR: tari-project/tari#7401
File: base_layer/core/src/base_node/sync/config.rs:57-60
Timestamp: 2025-08-11T10:49:01.479Z
Learning: In the Tari codebase, the BlockchainSyncConfig struct works correctly without #[serde(default)] attributes on new fields when used within BaseNodeStateMachineConfig. The configuration loading mechanism or usage patterns ensure that missing fields don't cause deserialization failures in practice.
Applied to files:
applications/minotari_node/src/config.rs
📚 Learning: 2025-07-12T03:43:22.545Z
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:13:37.206Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-04-16T07:06:53.981Z
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
🧬 Code graph analysis (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)
comms/core/src/builder/comms_node.rs (2)
connectivity(257-259)connectivity(325-327)base_layer/p2p/src/peer_seeds.rs (1)
new(107-109)comms/core/src/connectivity/requester.rs (1)
peers(170-173)comms/dht/examples/memory_net/utilities.rs (1)
discovery(134-192)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: Cucumber tests / Base Layer
- GitHub Check: Cucumber tests / FFI
- GitHub Check: wasm build tests
- GitHub Check: test (testnet, esmeralda)
- GitHub Check: ledger build tests
- GitHub Check: test (nextnet, nextnet)
- GitHub Check: test (mainnet, stagenet)
- GitHub Check: ci
- GitHub Check: cargo check with stable
🔇 Additional comments (6)
applications/minotari_node/src/config.rs (2)
34-35: Serializer import for optional durations: LGTM.Using serializers::optional_seconds is the right choice for Option in config.
224-225: Backward-compatible default: LGTM.Defaulting tari_pulse_health_check to None avoids breaking existing configs and matches the new optional semantics.
base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)
52-57: Optional liveness interval: LGTM.Making liveness_interval an Option cleanly disables health checks when unset.
147-166: Single-flight guard for health checks: LGTM.Using a try_lock on Arc<Mutex<()>> to skip overlapping runs is appropriate.
507-543: Disconnect parameters: LGTM.disconnect_if_unused(Minimized::No, 0, 2, ...) matches the health-check’s 0 RPC sessions and 2 substreams design.
545-552: tick_optional helper: LGTM.A neat way to disable the select branch without extra branching in the loop.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)
451-453: Don’t panic if receiver dropped (duplicate of prior review).Handle watch send errors gracefully to avoid crashing the health task.
- notify_comms_health.send(inner_result).expect("Channel should be open"); + if let Err(e) = notify_comms_health.send(inner_result) { + warn!(target: LOG_TARGET, "check_health: liveness notify failed: {}", e); + }
521-544: Pong wait exits after first 30s timeout; keep waiting up to 60s total (duplicate of prior review).Currently returns
Noneon the first per‑recv timeout; violates the “~60s total” intent and may miss late pongs.- Ok(Err(RecvError::Lagged(_))) => { /* keep waiting within total window */ }, - Err(_) => { - // This error is not treated as a warning or error as the result will be passed back to the caller - debug!(target: LOG_TARGET, "check_health: recv pong {} timed out", peer_node_id); - return None; - }, + Ok(Err(RecvError::Lagged(_))) => { + if start.elapsed() >= Duration::from_secs(60) { + debug!(target: LOG_TARGET, "check_health: waited too long for pong from {}", peer_node_id); + return None; + } + }, + Err(_) => { + // per-recv timeout; continue until overall deadline + if start.elapsed() >= Duration::from_secs(60) { + debug!(target: LOG_TARGET, "check_health: recv pong {} timed out (overall)", peer_node_id); + return None; + } + },
🧹 Nitpick comments (6)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (6)
137-143: Nit: fix typo in comment.“Health chack interval” → “Health check interval”.
- // Health chack interval (optional) + // Health check interval (optional)
380-381: Prefer Mutex over RwLock for push-only writes.This path only performs writes (push) and a single final read;
RwLockadds overhead without benefit.- let results = Arc::new(RwLock::new(Vec::new())); + let results = Arc::new(Mutex::new(Vec::new())); @@ - (*result_clone).write().await.push(result); + result_clone.lock().await.push(result);Also drop
RwLockfrom the imports if unused.Also applies to: 446-447
439-445: Optional: log latencies in ms for readability and alignment with gRPC.Formatting
Option<Duration>with{:?}is noisy. Consider logging ms integers.- debug!( - target: LOG_TARGET, - "check_health: peer {} discovery: {:.2?}, ping: {:.2?}", - result.peer, - result.discovery_latency, - result.ping_latency - ); + let d_ms = result.discovery_latency.map(|d| d.as_millis() as u64); + let p_ms = result.ping_latency.map(|d| d.as_millis() as u64); + debug!(target: LOG_TARGET, + "check_health: peer {} discovery_ms: {:?}, ping_ms: {:?}", + result.peer, d_ms, p_ms);
462-465: Unnecessary into() on CommsPublicKey.
NodeDestination::PublicKeyalready takes the same key type.- discovery.discover_peer(dest_key.clone(), NodeDestination::PublicKey(dest_key.clone().into())), + discovery.discover_peer(dest_key.clone(), NodeDestination::PublicKey(dest_key.clone())),
488-500: Double timeouts around discover/dial — verify intended behavior.
discover_peer/dial_peermay have built‑in timeouts. Wrapping them can obscure underlying errors or trigger earlier-than-expected timeouts.Would you confirm the current internal timeouts of these APIs and whether the added 90s/60s outer bounds are stricter and required for this interlock? If redundant, consider relying on the inner timeout and only applying an overall per‑peer budget at the orchestration layer.
Also applies to: 461-465
409-414: Avoid extra task::spawn; use async helpers + join! for lower overhead.You already spawn one task per peer; spawning two more per peer is unnecessary.
- let discovery_handle = spawn_discovery(discovery, dest_key); - let ping_handle = spawn_ping(comms_clone.clone(), liveness, liveness_events, result.peer.clone()); - // Await both inner tasks - let (discovery_result, ping_result) = tokio::join!(discovery_handle, ping_handle); + let (discovery_result, ping_result) = tokio::join!( + do_discovery(discovery, dest_key), + do_ping(comms_clone.clone(), liveness, liveness_events, result.peer.clone()) + );Add helpers (replace spawn_* with these):
async fn do_discovery( mut discovery: DhtDiscoveryRequester, dest_key: CommsPublicKey, ) -> Option<Duration> { /* same body as before, without outer task::spawn */ } async fn do_ping( comms: ConnectivityRequester, mut liveness: LivenessHandle, mut liveness_events: broadcast::Receiver<Arc<LivenessEvent>>, peer_node_id: NodeId, ) -> Option<Duration> { /* same body as before, without outer task::spawn */ }Also applies to: 455-478, 480-547
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
applications/minotari_node/src/config.rs(3 hunks)applications/minotari_node/src/grpc/base_node_grpc_server.rs(1 hunks)base_layer/core/src/base_node/tari_pulse_service/mod.rs(9 hunks)common/config/presets/c_base_node_c.toml(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- common/config/presets/c_base_node_c.toml
🚧 Files skipped from review as they are similar to previous changes (2)
- applications/minotari_node/src/config.rs
- applications/minotari_node/src/grpc/base_node_grpc_server.rs
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.
📚 Learning: 2025-07-12T03:43:22.545Z
Learnt from: hansieodendaal
PR: tari-project/tari#7307
File: comms/core/src/connection_manager/peer_connection.rs:356-357
Timestamp: 2025-07-12T03:43:22.545Z
Learning: In base_layer/core/src/base_node/tari_pulse_service/mod.rs, the disconnect_if_unused call uses parameters (0, 2) because the tari_pulse_service creates exactly 2 substreams and 0 RPC sessions during health checks.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:13:37.206Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-04-16T07:06:53.981Z
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
📚 Learning: 2025-07-09T08:33:29.320Z
Learnt from: hansieodendaal
PR: tari-project/tari#7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.
Applied to files:
base_layer/core/src/base_node/tari_pulse_service/mod.rs
🧬 Code graph analysis (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)
base_layer/p2p/src/services/liveness/service.rs (2)
connectivity(346-355)connectivity(359-368)comms/core/src/builder/comms_node.rs (2)
connectivity(257-259)connectivity(325-327)base_layer/p2p/src/peer_seeds.rs (1)
new(107-109)comms/core/src/connectivity/requester.rs (1)
peers(170-173)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: test (nextnet, nextnet)
- GitHub Check: test (mainnet, stagenet)
- GitHub Check: test (testnet, esmeralda)
- GitHub Check: file licenses
- GitHub Check: wasm build tests
- GitHub Check: cargo check with stable
- GitHub Check: ci
- GitHub Check: ledger build tests
- GitHub Check: Cucumber tests / FFI
- GitHub Check: Cucumber tests / Base Layer
🔇 Additional comments (2)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)
553-555: Disconnect parameters match expected usage. LGTM.
disconnect_if_unused(Minimized::No, 0, 2, ...)aligns with health check opening 0 RPC sessions and 2 substreams.If this changes in future (e.g., more substreams), update the counts to avoid disconnecting active peers.
320-334: Confirm consumers updated to Option
- Found one call site: applications/minotari_node/src/bootstrap.rs calling TariPulseServiceInitializer::new(base_node_config.tari_pulse_interval, base_node_config.tari_pulse_health_check). Verify those two fields are now Option in the config struct; if they are still Duration/bool, update the call site to pass Option (e.g., Some(...)) or adjust the config to match the new API.
- Also ensure any other callers (none found by the search) are updated.
| // Wait for the node comms to start up properly | ||
| if let Ok(false) = self.wait_for_first_liveness_event().await { | ||
| return; | ||
| } | ||
|
|
There was a problem hiding this comment.
Don’t block startup indefinitely waiting for a liveness event. Add an overall timeout.
If no liveness events are emitted (e.g., liveness service not started/quiet), run() can stall forever. Bound the wait and proceed.
Apply this diff:
- // Wait for the node comms to start up properly
- if let Ok(false) = self.wait_for_first_liveness_event().await {
- return;
- }
+ // Wait for the node comms to start up properly (bounded)
+ match tokio::time::timeout(Duration::from_secs(60), self.wait_for_first_liveness_event()).await {
+ Ok(Ok(true)) => {},
+ Ok(Ok(false)) => return, // shutdown requested
+ Ok(Err(e)) => {
+ warn!(target: LOG_TARGET, "wait_for_first_liveness_event error: {}", e);
+ },
+ Err(_) => {
+ info!(target: LOG_TARGET, "Timeout waiting for first liveness event; continuing startup");
+ },
+ }🤖 Prompt for AI Agents
In base_layer/core/src/base_node/tari_pulse_service/mod.rs around lines 127 to
131, the current await on wait_for_first_liveness_event() can block forever;
wrap that await in a timeout (e.g., tokio::time::timeout with a sensible
duration like a few seconds), then match the result: if timeout occurs or the
call returns Err or Ok(false), log a warning and proceed rather than
returning/blocked; if it returns Ok(true) continue as before. Ensure the timeout
future is awaited correctly and any errors are handled and logged.
* development: fix: always cancel transactions (tari-project#7500) feat: throttle background rebuild tasks (tari-project#7502) chore: clearer logic for is_micro_tari (tari-project#7499) feat: add sqlite concurrency and pool hardening (tari-project#7492) feat: add timeouts to health check (tari-project#7482) feat: performance improvement of wallet output (tari-project#7490) fix: discovery syncing regression bug (tari-project#7488) fix: update peer database after re-initializing node identity config file (tari-project#7497) fix: peer_db null update where none should be written as null (tari-project#7489) fix: filter node identity onion address when running TCP node (tari-project#7486) chore: refactor multi sig and offline sign (tari-project#7487) feat: add http cache config (tari-project#7477)
Description
liveness_resultsin theGetNetworkStatemethod to returnmsinstead ofsand documented it as such; this was previously undocumented. When the network is well connected, discovery and ping latencies would be less than1 s, resulting in zero values to be returned.Fixes #7380.
Motivation and Context
See #7380.
How Has This Been Tested?
c::bn::tari_pulselogged at least trace level)GetNetworkStatemethod,liveness_resultsshould not be an empty array if seed peers are contactable.What process can a PR reviewer use to test or verify this change?
Code review.
System-level testing (to verify in the log files change
c::bn::tari_pulseto at least debug level).Breaking Changes
BREAKING CHANGE: Changed the gRPC
liveness_resultsin theGetNetworkStatemethod to returnmsinstead ofsand documented it as such; this was previously undocumented.Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Documentation