feat: peer health check#6951
Conversation
WalkthroughThis change introduces a new liveness health check mechanism to the Tari Pulse service, enabling the system to periodically measure and report discovery and dial latencies for seed peers. The configuration and service initialization are updated to support separate intervals for DNS checkpoint checks and liveness checks. The results of these liveness checks are made available via a new watch channel and are included in the gRPC Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant BaseNodeGrpcServer
participant TariPulseService
participant ConnectivityManager
participant PeerManager
Client->>BaseNodeGrpcServer: get_network_state()
BaseNodeGrpcServer->>TariPulseService: get_liveness_checks()
TariPulseService->>ConnectivityManager: get_seeds()
ConnectivityManager->>PeerManager: get_seed_peers()
PeerManager-->>ConnectivityManager: seed peers list
ConnectivityManager-->>TariPulseService: seed peers list
TariPulseService->>TariPulseService: measure discovery/dial latencies
TariPulseService-->>BaseNodeGrpcServer: liveness results
BaseNodeGrpcServer-->>Client: GetNetworkStateResponse (with liveness_results)
Poem
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (1)base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)⏰ Context from checks skipped due to timeout of 90000ms (1)
🔇 Additional comments (32)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (3)
57-63: Consider storing additional failure informationCurrently, liveness checks store only latencies. If an operation fails, there's no dedicated place for error details. Including an optional field for error messages or result codes might help diagnose connectivity issues.
138-156: Lock-based concurrencySkipping the entire tick if the lock is already held might cause health checks to be repeatedly missed under high load. Evaluate if queuing or a limited concurrency approach is more appropriate.
318-361: Parallel checks and debug statements
- Unrestricted parallel tasks: Spawning a separate task for each seed peer could become expensive if the seed list is large. Consider adding concurrency limits or pooling to avoid resource contention.
dbg!usage: Relying ondbg!is typically discouraged in production. Consider using thelogmacros (debug!,warn!, etc.) for consistency and configurability.- More detailed error recording: You may store or record the specific failure reasons (discovery error, dial timeout, etc.) in
LivenessCheckResultfor improved diagnostics.Example diff for removing
dbg!:- dbg!("failed discovery"); + debug!(target: LOG_TARGET, "Failed to discover peer");- dbg!("failed dial"); + debug!(target: LOG_TARGET, "Failed to dial peer");
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
applications/minotari_app_grpc/proto/base_node.proto(1 hunks)applications/minotari_node/src/bootstrap.rs(1 hunks)applications/minotari_node/src/config.rs(2 hunks)applications/minotari_node/src/grpc/base_node_grpc_server.rs(2 hunks)base_layer/core/src/base_node/tari_pulse_service/mod.rs(8 hunks)base_layer/core/src/blocks/pre_mine/mod.rs(1 hunks)common/config/presets/c_base_node_c.toml(1 hunks)comms/core/src/connection_manager/listener.rs(1 hunks)comms/core/src/connectivity/manager.rs(1 hunks)comms/core/src/connectivity/requester.rs(2 hunks)comms/core/src/peer_manager/manager.rs(1 hunks)comms/core/src/peer_manager/peer_storage.rs(2 hunks)comms/core/src/test_utils/mocks/connectivity_manager.rs(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
comms/core/src/peer_manager/manager.rs (1)
comms/core/src/peer_manager/peer_storage.rs (1)
get_seed_peers(347-352)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: test (nextnet, nextnet)
- GitHub Check: test (mainnet, stagenet)
- GitHub Check: test (testnet, esmeralda)
- GitHub Check: cargo check with stable
- GitHub Check: ci
- GitHub Check: Cucumber tests / Base Layer
- GitHub Check: Cucumber tests / FFI
🔇 Additional comments (31)
base_layer/core/src/blocks/pre_mine/mod.rs (1)
1359-1359: Marking the test as ignored is appropriate.The use of
#[ignore]on this async test is standard practice for tests that are slow, resource-intensive, or require special conditions. No issues found.common/config/presets/c_base_node_c.toml (1)
51-52: Clean and well-documented configuration addition.The new
tari_pulse_health_checkconfiguration parameter is well-placed next to the existingtari_pulse_intervalparameter, with a clear description and reasonable default value (10 minutes). This aligns perfectly with the PR's objective of implementing a peer health check mechanism.comms/core/src/connection_manager/listener.rs (1)
290-296: Appropriate log level adjustment.Changing the log level from
warntodebugfor invalid wire format bytes makes sense, as these are often not critical issues that require warning-level visibility. This will help reduce noise in the logs while maintaining the information for debugging purposes.applications/minotari_node/src/bootstrap.rs (1)
173-177: Proper integration of the new health check parameter.The addition of
base_node_config.tari_pulse_health_checkto theTariPulseServiceInitializer::newconstructor correctly integrates the new configuration parameter with the service initializer, enabling the periodic discovery and dialing of seed peers as part of the new health check mechanism.comms/core/src/peer_manager/manager.rs (1)
122-125: Clean implementation of seed peer retrieval function.The new
get_seed_peers()method follows the established pattern in the codebase for getting filtered lists of peers. The implementation correctly uses a read lock on the peer storage and delegates to the underlying storage implementation.comms/core/src/connectivity/manager.rs (1)
335-341: Robust seed peer retrieval with appropriate error handling.The implementation handles errors gracefully by returning an empty vector when the peer manager fails to retrieve seed peers, which prevents cascading failures while properly logging the error. This is consistent with the error handling pattern used throughout the file.
applications/minotari_node/src/config.rs (2)
147-150: Health check interval configuration correctly implemented.The new
tari_pulse_health_checkconfiguration parameter is properly annotated with the seconds serializer, maintaining consistency with the existingtari_pulse_intervalfield. This allows separate configuration of the DNS checkpoint interval and the new liveness health check interval.
189-189: Reasonable default value for health check interval.The default value of 10 minutes (600 seconds) for the health check interval provides a good balance - frequent enough to detect network issues in a timely manner but not so frequent as to generate excessive network traffic.
applications/minotari_app_grpc/proto/base_node.proto (2)
545-547: Good addition of liveness results to network state response.Adding the liveness check results to the existing GetNetworkStateResponse is a sensible approach. This avoids creating a new endpoint just for liveness information and allows clients to get a complete view of the network state including connectivity health in a single request.
549-556: Well-structured LivenessResult message definition.The LivenessResult message contains exactly the information needed to assess peer connectivity health:
- The peer's node ID for identification
- Discovery latency for DNS resolution performance
- Dial latency for connection establishment performance
This captures the key metrics for monitoring and diagnosing network connectivity issues.
applications/minotari_node/src/grpc/base_node_grpc_server.rs (2)
465-474: Appropriate handling of liveness checks for the GetNetworkStateResponseThe code correctly fetches liveness check results from the TariPulse service and transforms them into the format required by the gRPC response. A nice touch is handling the potential missing latency measurements with a default value of
u64::MAX.
485-485: Liveness results are properly included in the network state responseThis correctly adds the liveness results to the GetNetworkStateResponse struct, ensuring the health check data will be accessible to API clients.
comms/core/src/connectivity/requester.rs (2)
107-107: Appropriate addition of GetSeeds variantThe new GetSeeds variant to the ConnectivityRequest enum follows the existing pattern with a oneshot channel for response delivery.
301-308: Well-implemented get_seeds methodThe get_seeds method follows the established pattern for async communication with the connectivity manager. It properly handles error cases by mapping send errors to ActorDisconnected and receive errors to ActorResponseCancelled, consistent with other methods in this struct.
comms/core/src/peer_manager/peer_storage.rs (2)
347-352: Clear implementation of get_seed_peersThe get_seed_peers method efficiently filters the peer database for seed peers, consistent with other filtering methods in this class. It follows the standard result pattern, wrapping database errors properly.
948-976: Comprehensive test for get_seed_peersThis test thoroughly validates the get_seed_peers functionality by:
- Creating a mix of seed and non-seed peers
- Verifying the correct number of seed peers is returned
- Confirming each returned peer is actually a seed peer
The test is well-structured and covers the essential functionality.
base_layer/core/src/base_node/tari_pulse_service/mod.rs (15)
22-26: No major concernsThese added imports appear correctly used for concurrency (
Arc), time measurement (Instant), and range checking (cmp::min).
32-32: Import usageThe introduction of
ConnectivityRequesterandNodeIdis consistent with the liveness health check feature.
67-68: Verify default intervalsThe default of 120 seconds for DNS checks and 600 seconds for liveness checks may or may not be optimal for your use case. Please confirm these intervals align with operational expectations.
Would you like to run tests in different scenarios or search external references to confirm these intervals?
89-90: Connectivity and discovery handlesAdding
node_commsandnode_discoveryallows you to perform peer connectivity and discovery within the service.
94-99: Service constructor looks goodThe constructor seamlessly integrates the new fields and configuration parameters.
105-106: Field assignmentAssigning
node_commsandnode_discoveryhere cleanly maintains the service state.
120-121: New watch senderIntroducing a
notify_comms_healthchannel to broadcast liveness results provides a robust mechanism for asynchronous notifications.
122-129: Separate interval tasksUsing two separate timers and pinning them with
MissedTickBehavior::Delayhelps avoid overlapping scheduling for DNS checks and health checks.
134-134: Arc<Mutex<()>> concurrency guardThis mutex-based approach effectively blocks overlapping checks, but consider monitoring for unexpected lock contention or ensuring tasks complete promptly.
174-174: Check for zero intervalsA zero-valued
dns_check_intervalcould cause integer division by zero. Verify that the configuration never provides a zero or near-zero value here.
253-253: Watch channel for liveness resultsPublishing the results through a watch channel is a clean design that enables other services or UI components to consume real-time updates.
260-263: Getter usageReturning a
watch::Reffromget_liveness_checksensures concurrency-safe reads of the current liveness check results without blocking.
267-268: TariPulseServiceInitializer enhancementsPassing in both the DNS and liveness intervals clearly separates the two functionalities. The addition of
networkis consistent with the rest of the service.Also applies to: 273-278
288-288: Creating watch channelsPairing a dedicated watch channel for checkpoint failures and another for liveness checks decouples the responsibilities and simplifies consumption.
Also applies to: 291-292, 295-296
302-309: Service setup stepsAcquiring the handles and constructing the
TariPulseServicewith the provided requesters is consistent with the service framework pattern.
Test Results (CI) 3 files 129 suites 54m 5s ⏱️ Results for commit e814d7a. ♻️ This comment has been updated with latest results. |
Test Results (Integration tests) 2 files + 2 11 suites +11 1h 35m 19s ⏱️ + 1h 35m 19s For more details on these failures, see this check. Results for commit e814d7a. ± Comparison against base commit 2a8ea56. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)
319-356: Simplify pattern matching in error handlingThe error handling pattern in the health check can be simplified for better readability.
Consider these changes:
- if let Ok(_) = discovery - .discover_peer(dest_key.clone(), NodeDestination::PublicKey(dest_key.into())) - .await - { + if discovery + .discover_peer(dest_key.clone(), NodeDestination::PublicKey(dest_key.into())) + .await + .is_ok() + { result.discovery_latency = Some(start.elapsed()); } - if let Ok(_) = comms.dial_peer(result.peer.clone()).await { + if comms.dial_peer(result.peer.clone()).await.is_ok() { result.dial_latency = Some(start2.elapsed()); }Also, line 354 has some unnecessary dereferencing:
- let inner_result = (*(*results).read().await).clone(); + let inner_result = results.read().await.clone();
339-344: Consider logging discovery failuresThe code silently ignores discovery failures without logging them, which could make troubleshooting difficult.
Consider adding debug logging when discovery fails:
if let Ok(_) = discovery .discover_peer(dest_key.clone(), NodeDestination::PublicKey(dest_key.into())) .await { result.discovery_latency = Some(start.elapsed()); + } else { + debug!( + target: LOG_TARGET, + "Failed to discover peer: {}", result.peer + ); }
346-349: Consider logging dial failuresSimilar to discovery failures, dial failures are not logged, which could make troubleshooting difficult.
Consider adding debug logging when dialing fails:
if let Ok(_) = comms.dial_peer(result.peer.clone()).await { result.dial_latency = Some(start2.elapsed()); + } else { + debug!( + target: LOG_TARGET, + "Failed to dial peer: {}", result.peer + ); }
324-325: Consider handling empty seed listThe current implementation handles the case where
get_seeds()fails by defaulting to an empty vector, but doesn't log this scenario.Consider logging when no seeds are available:
let results = Arc::new(RwLock::new(Vec::new())); - let peers = node_comms.get_seeds().await.unwrap_or_else(|_| vec![]); + let peers = match node_comms.get_seeds().await { + Ok(peers) => { + if peers.is_empty() { + debug!(target: LOG_TARGET, "No seed peers available for health check"); + } + peers + }, + Err(e) => { + warn!(target: LOG_TARGET, "Failed to get seed peers: {}", e); + vec![] + } + };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs(8 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (7)
base_layer/p2p/src/initialization.rs (2)
future(514-536)new(460-476)comms/core/src/connectivity/requester.rs (2)
new(120-122)peers(170-173)base_layer/core/src/base_node/service/initializer.rs (4)
new(79-97)handles(190-190)handles(191-191)handles(194-194)base_layer/core/src/base_node/comms_interface/inbound_handlers.rs (2)
new(99-118)clone(1030-1041)comms/dht/examples/memory_net/utilities.rs (1)
discovery(133-191)applications/minotari_node/src/bootstrap.rs (6)
handles(181-182)handles(238-238)handles(239-239)handles(254-254)handles(258-258)handles(259-259)base_layer/core/src/base_node/chain_metadata_service/initializer.rs (3)
handles(46-46)handles(47-47)handles(48-48)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: test (mainnet, stagenet)
- GitHub Check: cargo check with stable
- GitHub Check: test (testnet, esmeralda)
- GitHub Check: test (nextnet, nextnet)
- GitHub Check: ci
- GitHub Check: Cucumber tests / FFI
- GitHub Check: Cucumber tests / Base Layer
🔇 Additional comments (14)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (14)
52-53: LGTM: Well-structured configuration splitThe separation of DNS and liveness check intervals gives appropriate control over the frequency of these distinct operations, allowing for independent tuning based on their different resource impacts and importance.
57-62: LGTM: Clear and concise result structureThe
LivenessCheckResultstruct effectively captures the peer health metrics with optional duration fields to handle potential failures in either discovery or dialing operations.
67-69: LGTM: Reasonable default intervalsThe default intervals (2 minutes for DNS checks and 10 minutes for liveness checks) provide a good balance between timely information and minimizing network overhead.
89-91: LGTM: Appropriate service dependenciesAdding connectivity and discovery requesters as service dependencies properly encapsulates the functionality needed for peer health checks.
94-109: LGTM: Well-structured constructorThe updated constructor properly initializes the service with the necessary components for both DNS checkpoint verification and health checks.
122-129: LGTM: Proper interval configurationGood implementation of interval timers with appropriate missed tick behavior, ensuring both health check mechanisms operate reliably even under system load.
134-134: LGTM: Mutex for concurrency controlUsing a mutex to prevent overlapping health checks is an effective way to avoid resource contention and ensure accurate measurements.
138-156: LGTM: Efficient async health check executionSpawning the health check in a separate task and using try_lock ensures the main service loop remains responsive while preventing overlapping checks.
174-174: LGTM: Updated skip tick calculationThe skip tick calculation now correctly uses the new
dns_check_intervalconfiguration parameter.
253-253: LGTM: Extended handle for liveness dataAdding the liveness checks to the handle provides a clean API for external components to access health data.
261-263: LGTM: Accessor method for liveness dataProviding a getter method for the liveness checks follows the established pattern in this codebase and maintains encapsulation.
273-279: LGTM: Updated initializer constructorThe constructor properly accepts and stores the separate interval configurations.
287-293: LGTM: Proper handle registrationSetting up separate watch channels and registering the handle with both receivers ensures consumers can access both types of status updates.
302-309: LGTM: Comprehensive service initializationThe service is initialized with all necessary dependencies obtained from the service context, providing a clean dependency injection approach.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (1)
338-351: Health check implementation respects built-in timeouts.The health check implementation correctly relies on the built-in timeout mechanisms in the
discover_peeranddial_peermethods, as mentioned in the learnings. This avoids redundant timeout wrappers and potential race conditions.
🧹 Nitpick comments (3)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (3)
325-325: Consider more explicit error handling when retrieving seeds.The current implementation uses
unwrap_or_elseto handle errors when retrieving seed peers, but it silently falls back to an empty vector.Consider logging a warning when seed retrieval fails to make troubleshooting easier:
- let peers = node_comms.get_seeds().await.unwrap_or_else(|_| vec![]); + let peers = match node_comms.get_seeds().await { + Ok(peers) => peers, + Err(err) => { + warn!(target: LOG_TARGET, "Failed to retrieve seed peers for health check: {}", err); + vec![] + } + };
300-309: Consider refactoring the service creation for better readability.The service creation and initialization uses a chained pattern of method calls and awaits that affects readability.
For better readability, consider splitting the service creation and initialization:
- let mut tari_pulse_service = - TariPulseService::new(config, node_comms, node_discovery, shutdown_signal.clone()) - .await - .expect("Should be able to get the service"); - let tari_pulse_service = tari_pulse_service.run(base_node_service, dns_sender, liveness_sender); + let mut tari_pulse_service = TariPulseService::new( + config, + node_comms, + node_discovery, + shutdown_signal.clone() + ) + .await + .expect("Should be able to get the service"); + + let tari_pulse_service = tari_pulse_service.run( + base_node_service, + dns_sender, + liveness_sender + );
353-355: Consider error handling in result collection.When collecting and sending the results, the code uses
expectwhich could potentially panic if the channel is closed.For better error handling, consider using a more graceful approach:
- notify_comms_health.send(inner_result).expect("Channel should be open"); + if let Err(err) = notify_comms_health.send(inner_result) { + warn!(target: LOG_TARGET, "Failed to send health check results: {:?}", err); + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs(8 hunks)
🧰 Additional context used
🧠 Learnings (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.933Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.933Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: Cucumber tests / Base Layer
- GitHub Check: Cucumber tests / FFI
- GitHub Check: test (testnet, esmeralda)
- GitHub Check: cargo check with stable
- GitHub Check: test (mainnet, stagenet)
- GitHub Check: ci
🔇 Additional comments (5)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (5)
51-55: Configuration separation looks good.The split of the configuration into separate intervals for DNS checkpoint verification and liveness health checks is a clean approach. The default values (2 minutes for DNS checks and 10 minutes for liveness checks) seem reasonable for balancing freshness of data with network overhead.
Also applies to: 67-68
57-62: Clean struct design for health check results.The
LivenessCheckResultstruct is well-designed, capturing just the essential information needed: the peer ID and optional latency measurements for both discovery and dialing operations.
126-129: Effective implementation of the health check mechanism.The health check interval setup and execution is well-implemented. Using
MissedTickBehavior::Delayis appropriate to prevent tick pileup if health checks take longer than expected.I like the use of a mutex to prevent overlapping health checks, which could otherwise cause resource contention. The error message when skipping a tick due to lock contention is also helpful for debugging.
Also applies to: 138-156
319-356: Well-structured concurrent health check implementation.The
check_healthfunction is well-designed for concurrent operation, spawning a task for each peer and collecting results asynchronously. This allows for efficient parallel execution of discovery and dialing operations.
144-153: Good use of non-blocking mutex check.Using
try_lockinstead of a blockinglockis a good approach here, as it prevents a health check from blocking if another one is already in progress. The debug log message provides clarity about why a check might be skipped.
There was a problem hiding this comment.
Looking good - some observations:
- I think it will be prudent if we can supply custom base nodes to include in the liveness check alongside the seed peers. Use case: A user runs two base nodes, on the same or different computers. Each base node is configured to report on the other base node. When liveness for each other fails, it is time to restart tor with an empty cache.
- We should swop out
dial-peerforping-peerin the liveness test. Successful discovery indicates a connected network, whereas successful ping indicates a healthy connected peer. Both metrics will be useful. Dial always succeeds after discovery, but ping does not always succeed after dial, which is inherent in ping. - We should think about some ways in which to restart
tor, which will be confirmed by the discovery metrics.
|
You can already add custom "seed" nodes via the config to use Swapping over to ping-peer, good suggestion, But we cannot restart tor from the base node as a lot of cases we run tor completely outside of the base node, in the case of the seed nodes, docker runs on a separate instance, and there is no way the base node application can delete files on another docker application. |
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (1)
144-163: 💡 Verification agent🧩 Analysis chain
Consider synchronizing with review comment suggesting parameter removal
The code implements the health check task properly, but there seems to be a discrepancy with a past review comment.
hansieodendaalsuggested removing theliveness_handleparameter from thecheck_healthfunction call.
🏁 Script executed:
#!/bin/bash # Check if the suggestion to modify the check_health function call has been addressed echo "Current implementation at line 161:" grep -n "check_health" base_layer/core/src/base_node/tari_pulse_service/mod.rs | grep -A 1 161Length of output: 301
Action Required: Update
check_healthCall to Removeliveness_handleThe health check implementation is working correctly; however, as noted by the previous review comment from
hansieodendaal, the call tocheck_healthstill passes theliveness_handleparameter, which should be removed. Please update the invocation on line 161 (and adjustcheck_health’s signature if needed) so that it no longer includesliveness_handle.Example diff suggestion:
- check_health(comms, liveness_handle, discovery, notify_channel).await; + check_health(comms, discovery, notify_channel).await;
🧹 Nitpick comments (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (1)
327-381: Peer health check implementation is robust but could benefit from a few improvementsThe health check implementation follows good async practices and properly handles errors. However, consider these improvements:
The function spawns a separate task for each seed peer without limiting concurrency, which could become a resource issue if there are many seed peers.
The ping response waiting loop doesn't have an explicit timeout - it relies on the liveness service's implementation for timeouts. An explicit timeout might be beneficial.
The results vector is cloned before sending, which could be inefficient for a large number of peers.
Consider adding a concurrency limit to prevent resource exhaustion with a large number of seed peers:
- futures::future::join_all(handles).await; + let max_concurrent = 10; // Adjust based on expected peer count + for chunk in handles.chunks(max_concurrent) { + futures::future::join_all(chunk).await; + }Consider adding an explicit timeout for the ping wait loop:
+ let timeout = tokio::time::timeout(Duration::from_secs(30), async { loop { match liveness_events.recv().await { // ... existing match logic } } + }).await; + + if timeout.is_err() { + // Handle timeout case + }Consider avoiding the full clone of results:
- let inner_result = (*(*results).read().await).clone(); - notify_comms_health.send(inner_result).expect("Channel should be open"); + notify_comms_health.send_modify(|current_results| { + current_results.clear(); + current_results.extend((*(*results).read().await).iter().cloned()); + });
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
applications/minotari_app_grpc/proto/base_node.proto(1 hunks)applications/minotari_node/src/config.rs(2 hunks)applications/minotari_node/src/grpc/base_node_grpc_server.rs(2 hunks)base_layer/core/src/base_node/tari_pulse_service/mod.rs(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- applications/minotari_node/src/config.rs
- applications/minotari_app_grpc/proto/base_node.proto
- applications/minotari_node/src/grpc/base_node_grpc_server.rs
🧰 Additional context used
🧠 Learnings (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.933Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.933Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: test (mainnet, stagenet)
- GitHub Check: test (nextnet, nextnet)
- GitHub Check: cargo check with stable
- GitHub Check: ci
- GitHub Check: Cucumber tests / FFI
- GitHub Check: Cucumber tests / Base Layer
🔇 Additional comments (5)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (5)
22-47: Well-organized importsThe import organization is clean and properly grouped by crate source, which improves readability.
54-58: Good separation of interval configurationsSplitting the configuration into separate intervals for DNS checks and liveness checks is a good design decision, allowing for independent configuration of these different health check mechanisms.
60-65: Clear result structure for health checksThe
LivenessCheckResultstruct provides a clean way to encapsulate health check metrics with appropriate fields for the peer ID and both types of latency measurements.
256-270: Good extension of the handle to expose liveness check resultsThe TariPulseHandle has been properly extended to include access to liveness check results through a watch channel, with a clean getter method.
273-324: Properly updated initializer to support new health check parametersThe initializer changes support the separate intervals for DNS and liveness checks, and all the necessary handles are properly acquired and passed to the service.
hansieodendaal
left a comment
There was a problem hiding this comment.
The previous comment We should think about some ways in which to restart tor with empty caches, which will be confirmed by the discovery metrics. still holds, even if it is not done from within the base node. This problem should not only be solved for Tari Universe or the seed nodes.
One additional comment below.
| if let Ok(Some(mut conn)) = comms.get_connection(result.peer.clone()).await { | ||
| if let Err(err) = conn.disconnect(Minimized::No).await { | ||
| warn!(target: LOG_TARGET, "Failed to disconnect peer {} ({})", result.peer, err); | ||
| } | ||
| } |
There was a problem hiding this comment.
This connection should not be dropped here, as there are other processes running to get rid of unused outbound connections.
An incoming connection on a seed peer from this enquiring node is zero cost to the seed node after this conversation is finished, but the connection may have been established previously as a legitimate outbound connection.
Another consideration is that a consecutive discovery request will still be a proper discovery, although it should be faster, thus coming back to this code while the outbound connection is still open does not invalidate the tests.
There was a problem hiding this comment.
Disagree. These connections are primary to the seed peers. The reason our nodes use upwards of 1000 connections is because we keep them open because we might need them later.
And it's super important to not keep connections open to the seed peers. You are not suppose to use them as a connection hub. If another process wants to use the connection, they can reopen the connection.
There was a problem hiding this comment.
If we insist on closing the connection, it should only be done if a new connection was established for this purpose, otherwise, a legitimate connection established by another process will be closed.
There was a problem hiding this comment.
this has the added benefit of doing this as well: #6907
I don't think this is a big issue to disconnect from the seed peers. They will reopen connections if needed. They are there to seed peers, not seed the network.
There was a problem hiding this comment.
What if the connection we are closing here is from a custom seed peer we are syncing from? Those should not be disconnected. The code here does not differentiate between DNS seeds and custom seeds. DNS seed connections may be closed, but not custom seeds that are not DNS seeds (one can still add a DNS seed as a custom seed).
|
On the If someone is watching the application it self, they will detect its broken and needs to restart. But the node it self should not do this. |
We should create an issue so we can provide tooling to do this for MacOS, Linux and Windows. We cannot only cater to Tari Universe users. This is a known problem. |
Description
This adds in a health check whereby the node checks if it can discover a peer and dial a peer.
It discover, then dials the seed peers
fixes: #6944
Summary by CodeRabbit
New Features
Configuration
Bug Fixes
Tests
Chores