Skip to content

feat: peer health check#6951

Merged
SWvheerden merged 10 commits intotari-project:developmentfrom
SWvheerden:sw_peer_health_check
Apr 17, 2025
Merged

feat: peer health check#6951
SWvheerden merged 10 commits intotari-project:developmentfrom
SWvheerden:sw_peer_health_check

Conversation

@SWvheerden
Copy link
Copy Markdown
Collaborator

@SWvheerden SWvheerden commented Apr 15, 2025

Description

This adds in a health check whereby the node checks if it can discover a peer and dial a peer.
It discover, then dials the seed peers

fixes: #6944

Summary by CodeRabbit

  • New Features

    • Introduced liveness health checks that monitor and report peer discovery and dial latencies for seed peers.
    • Added detailed liveness metrics to the network state gRPC response, providing enhanced visibility into peer connectivity.
  • Configuration

    • Added a new configuration option for setting the liveness health check interval, with a default of 10 minutes.
  • Bug Fixes

    • Improved logging by lowering the log level for certain connection errors to reduce unnecessary warnings.
  • Tests

    • Added unit tests for seed peer retrieval logic.
    • Marked a specific block creation test to be ignored by default during test runs.
  • Chores

    • Updated documentation and configuration presets to reflect new liveness health check settings.

@SWvheerden SWvheerden requested a review from a team as a code owner April 15, 2025 14:48
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 15, 2025

Walkthrough

This change introduces a new liveness health check mechanism to the Tari Pulse service, enabling the system to periodically measure and report discovery and dial latencies for seed peers. The configuration and service initialization are updated to support separate intervals for DNS checkpoint checks and liveness checks. The results of these liveness checks are made available via a new watch channel and are included in the gRPC GetNetworkStateResponse. Supporting changes are made throughout the configuration, connectivity, and peer management modules to enable querying and reporting of seed peer information and their health metrics.

Changes

Files/Paths Change Summary
applications/minotari_app_grpc/proto/base_node.proto Added LivenessResult message and liveness_results field to GetNetworkStateResponse to report per-peer discovery and dial latencies.
applications/minotari_node/src/bootstrap.rs
applications/minotari_node/src/config.rs
common/config/presets/c_base_node_c.toml
Added tari_pulse_health_check configuration to base node config and initialization, with a default of 600 seconds. Updated config struct and default implementation.
applications/minotari_node/src/grpc/base_node_grpc_server.rs Updated get_network_state to include liveness results from Tari Pulse in the gRPC response.
base_layer/core/src/base_node/tari_pulse_service/mod.rs Major extension: Added liveness check logic, new config fields, concurrent interval handling, new result struct, notification channel, and updated service/initializer signatures.
comms/core/src/connection_manager/listener.rs Changed log level for invalid wire format byte from warn to debug.
comms/core/src/connectivity/manager.rs
comms/core/src/connectivity/requester.rs
Added GetSeeds request variant and handler to connectivity manager and requester, enabling querying of seed peers.
comms/core/src/peer_manager/manager.rs
comms/core/src/peer_manager/peer_storage.rs
Added get_seed_peers method to PeerManager and PeerStorage for retrieving seed peers. Included a new unit test for seed peer filtering.
comms/core/src/test_utils/mocks/connectivity_manager.rs Added unimplemented handler for ConnectivityRequest::GetSeeds in the connectivity manager mock.
base_layer/core/src/blocks/pre_mine/mod.rs Marked test_create_genesis_block_info test as ignored by default.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant BaseNodeGrpcServer
    participant TariPulseService
    participant ConnectivityManager
    participant PeerManager

    Client->>BaseNodeGrpcServer: get_network_state()
    BaseNodeGrpcServer->>TariPulseService: get_liveness_checks()
    TariPulseService->>ConnectivityManager: get_seeds()
    ConnectivityManager->>PeerManager: get_seed_peers()
    PeerManager-->>ConnectivityManager: seed peers list
    ConnectivityManager-->>TariPulseService: seed peers list
    TariPulseService->>TariPulseService: measure discovery/dial latencies
    TariPulseService-->>BaseNodeGrpcServer: liveness results
    BaseNodeGrpcServer-->>Client: GetNetworkStateResponse (with liveness_results)
Loading

Poem

In the warren of code where the seed peers reside,
Now a pulse checks their health with metrics supplied.
Latencies measured, results on display,
Through configs and channels, they hop on their way.
With every new tick, the network grows wise—
A rabbit’s delight as the liveness checks rise! 🐇✨

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)
  • We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd7bad7 and e814d7a.

📒 Files selected for processing (1)
  • base_layer/core/src/base_node/tari_pulse_service/mod.rs (7 hunks)
🧰 Additional context used
🧠 Learnings (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.933Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.933Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci
🔇 Additional comments (32)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (32)

22-26: Additional imports and clean organization.

The imports are well-organized and appropriately expanded to support the new liveness health check functionality. Good use of standard library components for time measurement and concurrency.


32-38: Necessary imports added for peer connectivity and liveness checks.

The imports correctly include the required components for implementing peer health checks: ConnectivityRequester for peer communication, LivenessHandle for ping operations, and DhtDiscoveryRequester for peer discovery.


42-47: Good use of Tokio synchronization primitives.

The code makes appropriate use of Tokio's synchronization primitives, including watch channels for state broadcasting, Mutex for concurrency control, and RwLock for safe concurrent access to shared state.


55-56: Configuration split into separate intervals for different health checks.

Good design decision to separate the DNS check interval from the liveness check interval, allowing for independent configuration of these two distinct health check mechanisms.


60-65: Well-structured LivenessCheckResult for reporting health metrics.

The struct effectively captures both discovery and ping latencies, providing valuable metrics for monitoring peer connectivity health.


70-71: Reasonable default intervals for health checks.

The defaults of 120 seconds for DNS checks and 600 seconds (10 minutes) for liveness checks are reasonable starting points, balancing responsiveness with system load.


92-94: New fields added to TariPulseService to support health checks.

The service now correctly stores the necessary components: connectivity requester, liveness handle, and discovery requester, to perform peer health checks.


98-114: Constructor updated to accept necessary components.

The constructor has been properly updated to accept and store the components needed for peer health checks.


126-126: Additional watch channel for health check results.

Good addition of a watch channel to broadcast liveness check results to interested components.


128-135: Well-structured timer setup with appropriate behavior.

Both intervals are properly configured with MissedTickBehavior.Delay to ensure consistent timing even if ticks are missed. The use of tokio::pin! is correct for the mutable interval references.


140-140: Concurrency control for health checks.

Good use of a Mutex to ensure only one health check runs at a time, preventing resource contention and potential race conditions.


144-162: Asynchronous health check execution.

The health check is properly executed in a separate task to avoid blocking the main service loop. The mutex acquisition is handled correctly with early return if it can't be acquired.

Consider accepting the suggestion from hansieodendaal to simplify the function call:

-                        check_health(comms, liveness_handle, discovery, notify_channel).await;
+                        check_health(comms, discovery, notify_channel).await;

Please confirm if the suggestion from hansieodendaal is still applicable or if the current implementation is preferred.


164-167: DNS checkpoint verification maintained.

The existing DNS checkpoint verification functionality has been well-preserved alongside the new health check functionality.


181-181: Improved skip calculation for DNS checks.

Good improvement to calculate the maximum skip count based on the configured interval, ensuring consistent behavior regardless of the configuration.


260-260: TariPulseHandle updated with liveness check results.

The handle is correctly updated to provide access to liveness check results via a watch channel.


268-270: Accessor method for liveness check results.

Good practice to provide a clean accessor method for obtaining the liveness check results.


274-276: Initializer updated for dual intervals.

The service initializer correctly accepts both intervals for configuring the service.


280-285: Constructor updated for initializer.

The constructor properly accepts and stores both intervals.


294-295: Setup of watch channels for both health check types.

Both watch channels are properly initialized with appropriate default values.


298-299: Handle correctly registers both watch receivers.

The TariPulseHandle is properly updated to include both watch receivers.


302-303: Configuration correctly uses both intervals.

The TariPulseConfig is correctly initialized with both intervals from the initializer.


309-312: Retrieval of necessary handles from context.

All required handles are correctly retrieved from the service context.


313-317: Improved service initialization with all components.

The service is properly initialized with all required components and configuration.


327-332: Well-structured check_health function signature.

The function accepts all necessary components: communication requester, liveness handle, discovery requester, and notification channel.


333-337: Shared results collection with proper concurrency control.

Good use of Arc<RwLock> for thread-safe collection of results from multiple concurrent tasks.


334-334: Graceful error handling for seed retrieval.

The code properly handles the case where seed retrieval fails by providing an empty vector as a fallback.


338-342: Clear initialization of result structure.

The LivenessCheckResult is properly initialized with the peer ID and null latencies.


349-356: Efficient peer discovery with latency measurement.

The code correctly measures the discovery latency and only records it if discovery succeeds.

Note: As per the learning from SWvheerden, the discovery_peer method already has built-in timeout mechanisms, so no explicit timeout is needed.


357-375: Effective peer ping implementation with response handling.

The ping operation is well-implemented with proper handling of the response event stream. The code correctly identifies matching pong responses by checking both node ID and nonce.


376-380: Resource cleanup after health check.

Good practice to disconnect from peers after the health check to prevent resource leaks.


381-382: Thread-safe result collection.

Results are properly collected in a thread-safe manner using the write lock.


384-386: Efficient task coordination and result publishing.

The code properly awaits all tasks to complete before publishing the results through the watch channel.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (3)

57-63: Consider storing additional failure information

Currently, liveness checks store only latencies. If an operation fails, there's no dedicated place for error details. Including an optional field for error messages or result codes might help diagnose connectivity issues.


138-156: Lock-based concurrency

Skipping the entire tick if the lock is already held might cause health checks to be repeatedly missed under high load. Evaluate if queuing or a limited concurrency approach is more appropriate.


318-361: Parallel checks and debug statements

  1. Unrestricted parallel tasks: Spawning a separate task for each seed peer could become expensive if the seed list is large. Consider adding concurrency limits or pooling to avoid resource contention.
  2. dbg! usage: Relying on dbg! is typically discouraged in production. Consider using the log macros (debug!, warn!, etc.) for consistency and configurability.
  3. More detailed error recording: You may store or record the specific failure reasons (discovery error, dial timeout, etc.) in LivenessCheckResult for improved diagnostics.

Example diff for removing dbg!:

- dbg!("failed discovery");
+ debug!(target: LOG_TARGET, "Failed to discover peer");
- dbg!("failed dial");
+ debug!(target: LOG_TARGET, "Failed to dial peer");
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a8ea56 and 796e4dd.

📒 Files selected for processing (13)
  • applications/minotari_app_grpc/proto/base_node.proto (1 hunks)
  • applications/minotari_node/src/bootstrap.rs (1 hunks)
  • applications/minotari_node/src/config.rs (2 hunks)
  • applications/minotari_node/src/grpc/base_node_grpc_server.rs (2 hunks)
  • base_layer/core/src/base_node/tari_pulse_service/mod.rs (8 hunks)
  • base_layer/core/src/blocks/pre_mine/mod.rs (1 hunks)
  • common/config/presets/c_base_node_c.toml (1 hunks)
  • comms/core/src/connection_manager/listener.rs (1 hunks)
  • comms/core/src/connectivity/manager.rs (1 hunks)
  • comms/core/src/connectivity/requester.rs (2 hunks)
  • comms/core/src/peer_manager/manager.rs (1 hunks)
  • comms/core/src/peer_manager/peer_storage.rs (2 hunks)
  • comms/core/src/test_utils/mocks/connectivity_manager.rs (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
comms/core/src/peer_manager/manager.rs (1)
comms/core/src/peer_manager/peer_storage.rs (1)
  • get_seed_peers (347-352)
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: test (nextnet, nextnet)
  • GitHub Check: test (mainnet, stagenet)
  • GitHub Check: test (testnet, esmeralda)
  • GitHub Check: cargo check with stable
  • GitHub Check: ci
  • GitHub Check: Cucumber tests / Base Layer
  • GitHub Check: Cucumber tests / FFI
🔇 Additional comments (31)
base_layer/core/src/blocks/pre_mine/mod.rs (1)

1359-1359: Marking the test as ignored is appropriate.

The use of #[ignore] on this async test is standard practice for tests that are slow, resource-intensive, or require special conditions. No issues found.

common/config/presets/c_base_node_c.toml (1)

51-52: Clean and well-documented configuration addition.

The new tari_pulse_health_check configuration parameter is well-placed next to the existing tari_pulse_interval parameter, with a clear description and reasonable default value (10 minutes). This aligns perfectly with the PR's objective of implementing a peer health check mechanism.

comms/core/src/connection_manager/listener.rs (1)

290-296: Appropriate log level adjustment.

Changing the log level from warn to debug for invalid wire format bytes makes sense, as these are often not critical issues that require warning-level visibility. This will help reduce noise in the logs while maintaining the information for debugging purposes.

applications/minotari_node/src/bootstrap.rs (1)

173-177: Proper integration of the new health check parameter.

The addition of base_node_config.tari_pulse_health_check to the TariPulseServiceInitializer::new constructor correctly integrates the new configuration parameter with the service initializer, enabling the periodic discovery and dialing of seed peers as part of the new health check mechanism.

comms/core/src/peer_manager/manager.rs (1)

122-125: Clean implementation of seed peer retrieval function.

The new get_seed_peers() method follows the established pattern in the codebase for getting filtered lists of peers. The implementation correctly uses a read lock on the peer storage and delegates to the underlying storage implementation.

comms/core/src/connectivity/manager.rs (1)

335-341: Robust seed peer retrieval with appropriate error handling.

The implementation handles errors gracefully by returning an empty vector when the peer manager fails to retrieve seed peers, which prevents cascading failures while properly logging the error. This is consistent with the error handling pattern used throughout the file.

applications/minotari_node/src/config.rs (2)

147-150: Health check interval configuration correctly implemented.

The new tari_pulse_health_check configuration parameter is properly annotated with the seconds serializer, maintaining consistency with the existing tari_pulse_interval field. This allows separate configuration of the DNS checkpoint interval and the new liveness health check interval.


189-189: Reasonable default value for health check interval.

The default value of 10 minutes (600 seconds) for the health check interval provides a good balance - frequent enough to detect network issues in a timely manner but not so frequent as to generate excessive network traffic.

applications/minotari_app_grpc/proto/base_node.proto (2)

545-547: Good addition of liveness results to network state response.

Adding the liveness check results to the existing GetNetworkStateResponse is a sensible approach. This avoids creating a new endpoint just for liveness information and allows clients to get a complete view of the network state including connectivity health in a single request.


549-556: Well-structured LivenessResult message definition.

The LivenessResult message contains exactly the information needed to assess peer connectivity health:

  1. The peer's node ID for identification
  2. Discovery latency for DNS resolution performance
  3. Dial latency for connection establishment performance

This captures the key metrics for monitoring and diagnosing network connectivity issues.

applications/minotari_node/src/grpc/base_node_grpc_server.rs (2)

465-474: Appropriate handling of liveness checks for the GetNetworkStateResponse

The code correctly fetches liveness check results from the TariPulse service and transforms them into the format required by the gRPC response. A nice touch is handling the potential missing latency measurements with a default value of u64::MAX.


485-485: Liveness results are properly included in the network state response

This correctly adds the liveness results to the GetNetworkStateResponse struct, ensuring the health check data will be accessible to API clients.

comms/core/src/connectivity/requester.rs (2)

107-107: Appropriate addition of GetSeeds variant

The new GetSeeds variant to the ConnectivityRequest enum follows the existing pattern with a oneshot channel for response delivery.


301-308: Well-implemented get_seeds method

The get_seeds method follows the established pattern for async communication with the connectivity manager. It properly handles error cases by mapping send errors to ActorDisconnected and receive errors to ActorResponseCancelled, consistent with other methods in this struct.

comms/core/src/peer_manager/peer_storage.rs (2)

347-352: Clear implementation of get_seed_peers

The get_seed_peers method efficiently filters the peer database for seed peers, consistent with other filtering methods in this class. It follows the standard result pattern, wrapping database errors properly.


948-976: Comprehensive test for get_seed_peers

This test thoroughly validates the get_seed_peers functionality by:

  1. Creating a mix of seed and non-seed peers
  2. Verifying the correct number of seed peers is returned
  3. Confirming each returned peer is actually a seed peer

The test is well-structured and covers the essential functionality.

base_layer/core/src/base_node/tari_pulse_service/mod.rs (15)

22-26: No major concerns

These added imports appear correctly used for concurrency (Arc), time measurement (Instant), and range checking (cmp::min).


32-32: Import usage

The introduction of ConnectivityRequester and NodeId is consistent with the liveness health check feature.


67-68: Verify default intervals

The default of 120 seconds for DNS checks and 600 seconds for liveness checks may or may not be optimal for your use case. Please confirm these intervals align with operational expectations.

Would you like to run tests in different scenarios or search external references to confirm these intervals?


89-90: Connectivity and discovery handles

Adding node_comms and node_discovery allows you to perform peer connectivity and discovery within the service.


94-99: Service constructor looks good

The constructor seamlessly integrates the new fields and configuration parameters.


105-106: Field assignment

Assigning node_comms and node_discovery here cleanly maintains the service state.


120-121: New watch sender

Introducing a notify_comms_health channel to broadcast liveness results provides a robust mechanism for asynchronous notifications.


122-129: Separate interval tasks

Using two separate timers and pinning them with MissedTickBehavior::Delay helps avoid overlapping scheduling for DNS checks and health checks.


134-134: Arc<Mutex<()>> concurrency guard

This mutex-based approach effectively blocks overlapping checks, but consider monitoring for unexpected lock contention or ensuring tasks complete promptly.


174-174: Check for zero intervals

A zero-valued dns_check_interval could cause integer division by zero. Verify that the configuration never provides a zero or near-zero value here.


253-253: Watch channel for liveness results

Publishing the results through a watch channel is a clean design that enables other services or UI components to consume real-time updates.


260-263: Getter usage

Returning a watch::Ref from get_liveness_checks ensures concurrency-safe reads of the current liveness check results without blocking.


267-268: TariPulseServiceInitializer enhancements

Passing in both the DNS and liveness intervals clearly separates the two functionalities. The addition of network is consistent with the rest of the service.

Also applies to: 273-278


288-288: Creating watch channels

Pairing a dedicated watch channel for checkpoint failures and another for liveness checks decouples the responsibilities and simplifies consumption.

Also applies to: 291-292, 295-296


302-309: Service setup steps

Acquiring the handles and constructing the TariPulseService with the provided requesters is consistent with the service framework pattern.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 15, 2025

Test Results (CI)

    3 files    129 suites   54m 5s ⏱️
1 335 tests 1 335 ✅ 0 💤 0 ❌
4 003 runs  4 003 ✅ 0 💤 0 ❌

Results for commit e814d7a.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 15, 2025

Test Results (Integration tests)

 2 files  + 2  11 suites  +11   1h 35m 19s ⏱️ + 1h 35m 19s
36 tests +36  31 ✅ +31  0 💤 ±0   5 ❌ + 5 
45 runs  +45  32 ✅ +32  0 💤 ±0  13 ❌ +13 

For more details on these failures, see this check.

Results for commit e814d7a. ± Comparison against base commit 2a8ea56.

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (4)

319-356: Simplify pattern matching in error handling

The error handling pattern in the health check can be simplified for better readability.

Consider these changes:

-            if let Ok(_) = discovery
-                .discover_peer(dest_key.clone(), NodeDestination::PublicKey(dest_key.into()))
-                .await
-            {
+            if discovery
+                .discover_peer(dest_key.clone(), NodeDestination::PublicKey(dest_key.into()))
+                .await
+                .is_ok()
+            {
                 result.discovery_latency = Some(start.elapsed());
             }
             
-            if let Ok(_) = comms.dial_peer(result.peer.clone()).await {
+            if comms.dial_peer(result.peer.clone()).await.is_ok() {
                 result.dial_latency = Some(start2.elapsed());
             }

Also, line 354 has some unnecessary dereferencing:

-    let inner_result = (*(*results).read().await).clone();
+    let inner_result = results.read().await.clone();

339-344: Consider logging discovery failures

The code silently ignores discovery failures without logging them, which could make troubleshooting difficult.

Consider adding debug logging when discovery fails:

             if let Ok(_) = discovery
                 .discover_peer(dest_key.clone(), NodeDestination::PublicKey(dest_key.into()))
                 .await
             {
                 result.discovery_latency = Some(start.elapsed());
+            } else {
+                debug!(
+                    target: LOG_TARGET,
+                    "Failed to discover peer: {}", result.peer
+                );
             }

346-349: Consider logging dial failures

Similar to discovery failures, dial failures are not logged, which could make troubleshooting difficult.

Consider adding debug logging when dialing fails:

             if let Ok(_) = comms.dial_peer(result.peer.clone()).await {
                 result.dial_latency = Some(start2.elapsed());
+            } else {
+                debug!(
+                    target: LOG_TARGET,
+                    "Failed to dial peer: {}", result.peer
+                );
             }

324-325: Consider handling empty seed list

The current implementation handles the case where get_seeds() fails by defaulting to an empty vector, but doesn't log this scenario.

Consider logging when no seeds are available:

     let results = Arc::new(RwLock::new(Vec::new()));
-    let peers = node_comms.get_seeds().await.unwrap_or_else(|_| vec![]);
+    let peers = match node_comms.get_seeds().await {
+        Ok(peers) => {
+            if peers.is_empty() {
+                debug!(target: LOG_TARGET, "No seed peers available for health check");
+            }
+            peers
+        },
+        Err(e) => {
+            warn!(target: LOG_TARGET, "Failed to get seed peers: {}", e);
+            vec![]
+        }
+    };
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 796e4dd and 184a6bc.

📒 Files selected for processing (1)
  • base_layer/core/src/base_node/tari_pulse_service/mod.rs (8 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (7)
base_layer/p2p/src/initialization.rs (2)
  • future (514-536)
  • new (460-476)
comms/core/src/connectivity/requester.rs (2)
  • new (120-122)
  • peers (170-173)
base_layer/core/src/base_node/service/initializer.rs (4)
  • new (79-97)
  • handles (190-190)
  • handles (191-191)
  • handles (194-194)
base_layer/core/src/base_node/comms_interface/inbound_handlers.rs (2)
  • new (99-118)
  • clone (1030-1041)
comms/dht/examples/memory_net/utilities.rs (1)
  • discovery (133-191)
applications/minotari_node/src/bootstrap.rs (6)
  • handles (181-182)
  • handles (238-238)
  • handles (239-239)
  • handles (254-254)
  • handles (258-258)
  • handles (259-259)
base_layer/core/src/base_node/chain_metadata_service/initializer.rs (3)
  • handles (46-46)
  • handles (47-47)
  • handles (48-48)
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: test (mainnet, stagenet)
  • GitHub Check: cargo check with stable
  • GitHub Check: test (testnet, esmeralda)
  • GitHub Check: test (nextnet, nextnet)
  • GitHub Check: ci
  • GitHub Check: Cucumber tests / FFI
  • GitHub Check: Cucumber tests / Base Layer
🔇 Additional comments (14)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (14)

52-53: LGTM: Well-structured configuration split

The separation of DNS and liveness check intervals gives appropriate control over the frequency of these distinct operations, allowing for independent tuning based on their different resource impacts and importance.


57-62: LGTM: Clear and concise result structure

The LivenessCheckResult struct effectively captures the peer health metrics with optional duration fields to handle potential failures in either discovery or dialing operations.


67-69: LGTM: Reasonable default intervals

The default intervals (2 minutes for DNS checks and 10 minutes for liveness checks) provide a good balance between timely information and minimizing network overhead.


89-91: LGTM: Appropriate service dependencies

Adding connectivity and discovery requesters as service dependencies properly encapsulates the functionality needed for peer health checks.


94-109: LGTM: Well-structured constructor

The updated constructor properly initializes the service with the necessary components for both DNS checkpoint verification and health checks.


122-129: LGTM: Proper interval configuration

Good implementation of interval timers with appropriate missed tick behavior, ensuring both health check mechanisms operate reliably even under system load.


134-134: LGTM: Mutex for concurrency control

Using a mutex to prevent overlapping health checks is an effective way to avoid resource contention and ensure accurate measurements.


138-156: LGTM: Efficient async health check execution

Spawning the health check in a separate task and using try_lock ensures the main service loop remains responsive while preventing overlapping checks.


174-174: LGTM: Updated skip tick calculation

The skip tick calculation now correctly uses the new dns_check_interval configuration parameter.


253-253: LGTM: Extended handle for liveness data

Adding the liveness checks to the handle provides a clean API for external components to access health data.


261-263: LGTM: Accessor method for liveness data

Providing a getter method for the liveness checks follows the established pattern in this codebase and maintains encapsulation.


273-279: LGTM: Updated initializer constructor

The constructor properly accepts and stores the separate interval configurations.


287-293: LGTM: Proper handle registration

Setting up separate watch channels and registering the handle with both receivers ensures consumers can access both types of status updates.


302-309: LGTM: Comprehensive service initialization

The service is initialized with all necessary dependencies obtained from the service context, providing a clean dependency injection approach.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (1)

338-351: Health check implementation respects built-in timeouts.

The health check implementation correctly relies on the built-in timeout mechanisms in the discover_peer and dial_peer methods, as mentioned in the learnings. This avoids redundant timeout wrappers and potential race conditions.

🧹 Nitpick comments (3)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (3)

325-325: Consider more explicit error handling when retrieving seeds.

The current implementation uses unwrap_or_else to handle errors when retrieving seed peers, but it silently falls back to an empty vector.

Consider logging a warning when seed retrieval fails to make troubleshooting easier:

- let peers = node_comms.get_seeds().await.unwrap_or_else(|_| vec![]);
+ let peers = match node_comms.get_seeds().await {
+     Ok(peers) => peers,
+     Err(err) => {
+         warn!(target: LOG_TARGET, "Failed to retrieve seed peers for health check: {}", err);
+         vec![]
+     }
+ };

300-309: Consider refactoring the service creation for better readability.

The service creation and initialization uses a chained pattern of method calls and awaits that affects readability.

For better readability, consider splitting the service creation and initialization:

- let mut tari_pulse_service =
-     TariPulseService::new(config, node_comms, node_discovery, shutdown_signal.clone())
-         .await
-         .expect("Should be able to get the service");
- let tari_pulse_service = tari_pulse_service.run(base_node_service, dns_sender, liveness_sender);
+ let mut tari_pulse_service = TariPulseService::new(
+     config, 
+     node_comms, 
+     node_discovery, 
+     shutdown_signal.clone()
+ )
+ .await
+ .expect("Should be able to get the service");
+ 
+ let tari_pulse_service = tari_pulse_service.run(
+     base_node_service, 
+     dns_sender, 
+     liveness_sender
+ );

353-355: Consider error handling in result collection.

When collecting and sending the results, the code uses expect which could potentially panic if the channel is closed.

For better error handling, consider using a more graceful approach:

- notify_comms_health.send(inner_result).expect("Channel should be open");
+ if let Err(err) = notify_comms_health.send(inner_result) {
+     warn!(target: LOG_TARGET, "Failed to send health check results: {:?}", err);
+ }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 184a6bc and bd9d7cc.

📒 Files selected for processing (1)
  • base_layer/core/src/base_node/tari_pulse_service/mod.rs (8 hunks)
🧰 Additional context used
🧠 Learnings (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.933Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.933Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: Cucumber tests / Base Layer
  • GitHub Check: Cucumber tests / FFI
  • GitHub Check: test (testnet, esmeralda)
  • GitHub Check: cargo check with stable
  • GitHub Check: test (mainnet, stagenet)
  • GitHub Check: ci
🔇 Additional comments (5)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (5)

51-55: Configuration separation looks good.

The split of the configuration into separate intervals for DNS checkpoint verification and liveness health checks is a clean approach. The default values (2 minutes for DNS checks and 10 minutes for liveness checks) seem reasonable for balancing freshness of data with network overhead.

Also applies to: 67-68


57-62: Clean struct design for health check results.

The LivenessCheckResult struct is well-designed, capturing just the essential information needed: the peer ID and optional latency measurements for both discovery and dialing operations.


126-129: Effective implementation of the health check mechanism.

The health check interval setup and execution is well-implemented. Using MissedTickBehavior::Delay is appropriate to prevent tick pileup if health checks take longer than expected.

I like the use of a mutex to prevent overlapping health checks, which could otherwise cause resource contention. The error message when skipping a tick due to lock contention is also helpful for debugging.

Also applies to: 138-156


319-356: Well-structured concurrent health check implementation.

The check_health function is well-designed for concurrent operation, spawning a task for each peer and collecting results asynchronously. This allows for efficient parallel execution of discovery and dialing operations.


144-153: Good use of non-blocking mutex check.

Using try_lock instead of a blocking lock is a good approach here, as it prevents a health check from blocking if another one is already in progress. The debug log message provides clarity about why a check might be skipped.

Copy link
Copy Markdown
Contributor

@hansieodendaal hansieodendaal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - some observations:

  • I think it will be prudent if we can supply custom base nodes to include in the liveness check alongside the seed peers. Use case: A user runs two base nodes, on the same or different computers. Each base node is configured to report on the other base node. When liveness for each other fails, it is time to restart tor with an empty cache.
  • We should swop out dial-peer for ping-peer in the liveness test. Successful discovery indicates a connected network, whereas successful ping indicates a healthy connected peer. Both metrics will be useful. Dial always succeeds after discovery, but ping does not always succeed after dial, which is inherent in ping.
  • We should think about some ways in which to restart tor, which will be confirmed by the discovery metrics.

@SWvheerden
Copy link
Copy Markdown
Collaborator Author

You can already add custom "seed" nodes via the config to use

Swapping over to ping-peer, good suggestion,

But we cannot restart tor from the base node as a lot of cases we run tor completely outside of the base node, in the case of the seed nodes, docker runs on a separate instance, and there is no way the base node application can delete files on another docker application.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (1)

144-163: 💡 Verification agent

🧩 Analysis chain

Consider synchronizing with review comment suggesting parameter removal

The code implements the health check task properly, but there seems to be a discrepancy with a past review comment. hansieodendaal suggested removing the liveness_handle parameter from the check_health function call.


🏁 Script executed:

#!/bin/bash
# Check if the suggestion to modify the check_health function call has been addressed
echo "Current implementation at line 161:"
grep -n "check_health" base_layer/core/src/base_node/tari_pulse_service/mod.rs | grep -A 1 161

Length of output: 301


Action Required: Update check_health Call to Remove liveness_handle

The health check implementation is working correctly; however, as noted by the previous review comment from hansieodendaal, the call to check_health still passes the liveness_handle parameter, which should be removed. Please update the invocation on line 161 (and adjust check_health’s signature if needed) so that it no longer includes liveness_handle.

Example diff suggestion:

- check_health(comms, liveness_handle, discovery, notify_channel).await;
+ check_health(comms, discovery, notify_channel).await;
🧹 Nitpick comments (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (1)

327-381: Peer health check implementation is robust but could benefit from a few improvements

The health check implementation follows good async practices and properly handles errors. However, consider these improvements:

  1. The function spawns a separate task for each seed peer without limiting concurrency, which could become a resource issue if there are many seed peers.

  2. The ping response waiting loop doesn't have an explicit timeout - it relies on the liveness service's implementation for timeouts. An explicit timeout might be beneficial.

  3. The results vector is cloned before sending, which could be inefficient for a large number of peers.

Consider adding a concurrency limit to prevent resource exhaustion with a large number of seed peers:

- futures::future::join_all(handles).await;
+ let max_concurrent = 10; // Adjust based on expected peer count
+ for chunk in handles.chunks(max_concurrent) {
+     futures::future::join_all(chunk).await;
+ }

Consider adding an explicit timeout for the ping wait loop:

+ let timeout = tokio::time::timeout(Duration::from_secs(30), async {
     loop {
         match liveness_events.recv().await {
             // ... existing match logic
         }
     }
+ }).await;
+ 
+ if timeout.is_err() {
+     // Handle timeout case
+ }

Consider avoiding the full clone of results:

- let inner_result = (*(*results).read().await).clone();
- notify_comms_health.send(inner_result).expect("Channel should be open");
+ notify_comms_health.send_modify(|current_results| {
+     current_results.clear();
+     current_results.extend((*(*results).read().await).iter().cloned());
+ });
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd9d7cc and cd7bad7.

📒 Files selected for processing (4)
  • applications/minotari_app_grpc/proto/base_node.proto (1 hunks)
  • applications/minotari_node/src/config.rs (2 hunks)
  • applications/minotari_node/src/grpc/base_node_grpc_server.rs (2 hunks)
  • base_layer/core/src/base_node/tari_pulse_service/mod.rs (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • applications/minotari_node/src/config.rs
  • applications/minotari_app_grpc/proto/base_node.proto
  • applications/minotari_node/src/grpc/base_node_grpc_server.rs
🧰 Additional context used
🧠 Learnings (1)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (2)
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.933Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.933Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: test (mainnet, stagenet)
  • GitHub Check: test (nextnet, nextnet)
  • GitHub Check: cargo check with stable
  • GitHub Check: ci
  • GitHub Check: Cucumber tests / FFI
  • GitHub Check: Cucumber tests / Base Layer
🔇 Additional comments (5)
base_layer/core/src/base_node/tari_pulse_service/mod.rs (5)

22-47: Well-organized imports

The import organization is clean and properly grouped by crate source, which improves readability.


54-58: Good separation of interval configurations

Splitting the configuration into separate intervals for DNS checks and liveness checks is a good design decision, allowing for independent configuration of these different health check mechanisms.


60-65: Clear result structure for health checks

The LivenessCheckResult struct provides a clean way to encapsulate health check metrics with appropriate fields for the peer ID and both types of latency measurements.


256-270: Good extension of the handle to expose liveness check results

The TariPulseHandle has been properly extended to include access to liveness check results through a watch channel, with a clean getter method.


273-324: Properly updated initializer to support new health check parameters

The initializer changes support the separate intervals for DNS and liveness checks, and all the necessary handles are properly acquired and passed to the service.

Copy link
Copy Markdown
Contributor

@hansieodendaal hansieodendaal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous comment We should think about some ways in which to restart tor with empty caches, which will be confirmed by the discovery metrics. still holds, even if it is not done from within the base node. This problem should not only be solved for Tari Universe or the seed nodes.

One additional comment below.

Comment on lines +376 to +380
if let Ok(Some(mut conn)) = comms.get_connection(result.peer.clone()).await {
if let Err(err) = conn.disconnect(Minimized::No).await {
warn!(target: LOG_TARGET, "Failed to disconnect peer {} ({})", result.peer, err);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This connection should not be dropped here, as there are other processes running to get rid of unused outbound connections.

An incoming connection on a seed peer from this enquiring node is zero cost to the seed node after this conversation is finished, but the connection may have been established previously as a legitimate outbound connection.

Another consideration is that a consecutive discovery request will still be a proper discovery, although it should be faster, thus coming back to this code while the outbound connection is still open does not invalidate the tests.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree. These connections are primary to the seed peers. The reason our nodes use upwards of 1000 connections is because we keep them open because we might need them later.
And it's super important to not keep connections open to the seed peers. You are not suppose to use them as a connection hub. If another process wants to use the connection, they can reopen the connection.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we insist on closing the connection, it should only be done if a new connection was established for this purpose, otherwise, a legitimate connection established by another process will be closed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has the added benefit of doing this as well: #6907
I don't think this is a big issue to disconnect from the seed peers. They will reopen connections if needed. They are there to seed peers, not seed the network.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the connection we are closing here is from a custom seed peer we are syncing from? Those should not be disconnected. The code here does not differentiate between DNS seeds and custom seeds. DNS seed connections may be closed, but not custom seeds that are not DNS seeds (one can still add a DNS seed as a custom seed).

@SWvheerden
Copy link
Copy Markdown
Collaborator Author

On the
The previous comment We should think about some ways in which to restart tor with empty caches, which will be confirmed by the discovery metrics. still holds, even if it is not done from within the base node. This problem should not only be solved for Tari Universe or the seed nodes.
I agree, but is beyond the scope of this PR. This PR only proves a health check system to use. You need an external application to check this, if this a CRON job, another application. Its beyond the scope here.

If someone is watching the application it self, they will detect its broken and needs to restart. But the node it self should not do this.

@hansieodendaal
Copy link
Copy Markdown
Contributor

On the The previous comment We should think about some ways in which to restart tor with empty caches, which will be confirmed by the discovery metrics. still holds, even if it is not done from within the base node. This problem should not only be solved for Tari Universe or the seed nodes. I agree, but is beyond the scope of this PR. This PR only proves a health check system to use. You need an external application to check this, if this a CRON job, another application. Its beyond the scope here.

If someone is watching the application it self, they will detect its broken and needs to restart. But the node it self should not do this.

We should create an issue so we can provide tooling to do this for MacOS, Linux and Windows. We cannot only cater to Tari Universe users. This is a known problem.

@SWvheerden
Copy link
Copy Markdown
Collaborator Author

#6962

@SWvheerden SWvheerden merged commit 484ce45 into tari-project:development Apr 17, 2025
17 of 18 checks passed
@SWvheerden SWvheerden deleted the sw_peer_health_check branch April 22, 2025 05:45
@coderabbitai coderabbitai bot mentioned this pull request May 22, 2025
4 tasks
@coderabbitai coderabbitai bot mentioned this pull request Sep 9, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

base node tor health check grpc

2 participants