feat: new bootstrap process#7121
Conversation
…into bootstrapper-dev
WalkthroughThis update introduces a new explicit seed peer bootstrap phase ("SeedStrap") into the DHT network discovery process, with detailed event signaling, configuration, and state tracking. The base node state machine is refactored to subscribe to and respond to DHT bootstrap events, updating UI state accordingly. Multiple configuration and event types are extended, and peer flag handling is improved. Changes
Sequence Diagram(s)sequenceDiagram
participant Node as BaseNodeStateMachine
participant DHT as Dht Service
participant UI as UI State
participant PeerMgr as PeerManager
participant Seeds as Seed Peers
Node->>DHT: Subscribe to DHT events
DHT-->>Node: BootstrapMethodDetermined / PrimaryBootstrapComplete
alt Bootstrap via SeedStrap
Node->>DHT: Initiate SeedStrap bootstrap
DHT->>Seeds: Query for peers
Seeds-->>DHT: Respond with peer lists
DHT-->>Node: NetworkDiscoveryPeersAdded (with round info)
Node->>UI: Update bootstrap phase (progress/rounds)
DHT-->>Node: PrimaryBootstrapComplete
Node->>UI: Clear bootstrap phase, mark as complete
else Bootstrap via ExistingPeers
DHT-->>Node: BootstrapMethodDetermined(ExistingPeers)
Node->>UI: Mark bootstrap complete, skip SeedStrap
end
Node->>DHT: Listen for further DHT events
DHT-->>Node: Other DHT events (ignored or handled as needed)
Suggested reviewers
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Test Results (Integration tests)0 tests 0 ✅ 0s ⏱️ For more details on these parsing errors, see this check. Results for commit df24c02. ♻️ This comment has been updated with latest results. |
Test Results (CI) 3 files 112 suites 42m 43s ⏱️ For more details on these failures, see this check. Results for commit 7b5c0d5. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (7)
base_layer/core/src/base_node/sync/config.rs (1)
59-64: Consider enforcing configuration consistency with DhtConfig.The comment indicates this value should ideally match
DhtConfig.network_discovery.max_seed_peer_sync_count, but there's no mechanism to ensure consistency. This could lead to configuration mismatches where the base node and DHT have different expectations about bootstrap rounds.Consider adding validation during configuration loading or documenting this relationship more prominently in the configuration files to help users maintain consistency.
comms/dht/src/network_discovery/initializing.rs (1)
64-100: Excellent peer categorization implementation!The code provides comprehensive peer analysis with clear categorization logic. The error handling for database failures is appropriate - logging the error while continuing with bootstrap is the right approach.
Consider extracting the peer suitability check into a helper method for better readability:
+ fn is_suitable_peer(peer: &Peer) -> bool { + !peer.is_seed() && + !peer.is_banned() && + peer.deleted_at.is_none() && + !peer.is_offline() && + !peer.all_addresses_failed() && + peer.features == PeerFeatures::COMMUNICATION_NODE + } for peer in &all_peers { total_peers += 1; if peer.is_seed() { seed_peers += 1; } else if peer.is_banned() { banned_peers += 1; } else if peer.deleted_at.is_some() { deleted_peers += 1; } else if peer.is_offline() { offline_peers += 1; } else if peer.all_addresses_failed() { failed_address_peers += 1; } else if peer.features != PeerFeatures::COMMUNICATION_NODE { non_communication_node_peers += 1; } else { suitable_peers += 1; } }comms/dht/src/network_discovery/ready.rs (1)
235-241: Consider adding debug logging for the edge case.The comment mentions a scenario where
last_round_info_optionis None butcurrent_num_rounds > 0, which "should not happen if SeedStrap always sets last_round_info". While the code handles this gracefully, consider adding a debug log to detect if this edge case ever occurs in practice.} // Fallthrough: continue discovery if: // - last_round_info_option is None (but current_num_rounds > 0 - should not happen if SeedStrap always sets // last_round_info, this path is more for re-entry from Idle/OnConnect where last_round might be old/cleared) // OR // - last_round_info_option showed new peers or failed (and the new SeedStrap success condition above wasn't // met), AND // - idle_after_num_rounds not yet reached. + +// Debug log for edge case detection +if last_round_info_option.is_none() && current_num_rounds > 0 { + debug!( + target: LOG_TARGET, + "Unexpected state: last_round_info is None but current_num_rounds = {}. This might indicate a state inconsistency.", + current_num_rounds + ); +} + let excluded_peers = self.context.all_attempted_peers.read().await.clone();comms/dht/src/network_discovery/seed_strap.rs (3)
478-496: Early exit optimization looks good, but round number update may be misleading.The early exit condition when sufficient peers are found is a good optimization. However, setting
round_info.round_number = round_info.total_roundson line 494 might be misleading since not all seed peers were actually contacted.Consider adding a flag to indicate early exit instead of manipulating the round number.
// If we early exit because we found enough peers, make the round_number reflect completion // of the seed strap phase. -round_info.round_number = round_info.total_rounds; +// Keep the actual round number and add a comment in the log +info!( + target: LOG_TARGET, + "SeedStrap: Early exit after {}/{} seed contacts due to finding sufficient peers", + round_info.round_number.unwrap_or(idx + 1), + round_info.total_rounds.unwrap_or(num_seeds_to_try) +);
457-464: Counter management after banning could be more robust.While the current implementation checks for > 0 before decrementing, managing counters after the fact could lead to inconsistencies. Consider tracking whether this seed was counted as successful before incrementing initially.
// If we banned the seed, we don't count it as a successful sync +// Only decrement if we had previously counted this as successful +let was_counted_as_successful = round_info.num_succeeded > successful_seed_contacts.saturating_sub(1); if round_info.num_succeeded > 0 { round_info.num_succeeded -= 1; } if successful_seed_contacts > 0 { successful_seed_contacts -= 1; } +if !was_counted_as_successful { + warn!(target: LOG_TARGET, "Counter mismatch detected during seed ban handling"); +}
44-45: Consider reducing the stream item timeout.The current 10-second timeout for individual stream items might be too conservative and could slow down the bootstrap process when dealing with unresponsive seeds.
// Define a timeout for individual stream items -const STREAM_ITEM_TIMEOUT: Duration = Duration::from_secs(10); +const STREAM_ITEM_TIMEOUT: Duration = Duration::from_secs(5);Also applies to: 691-691
comms/dht/src/network_discovery/state_machine.rs (1)
432-448: Consider logging when bootstrap timeout monitoring begins.The timeout mechanism is well-implemented, but it would be helpful to log when the bootstrap timeout monitoring starts for better observability.
Add a log statement before the timeout monitoring begins:
let next_event = if !bootstrap_completed { + debug!( + target: LOG_TARGET, + "Bootstrap timeout monitoring active (timeout: {:?})", + bootstrap_timeout_duration + ); // Create a separate context to avoid borrow issues let context_clone = self.context.clone();
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
base_layer/core/src/base_node/state_machine_service/initializer.rs(2 hunks)base_layer/core/src/base_node/state_machine_service/state_machine.rs(4 hunks)base_layer/core/src/base_node/state_machine_service/states/events_and_states.rs(2 hunks)base_layer/core/src/base_node/state_machine_service/states/listening.rs(5 hunks)base_layer/core/src/base_node/state_machine_service/states/starting_state.rs(2 hunks)base_layer/core/src/base_node/sync/config.rs(2 hunks)base_layer/p2p/src/peer_seeds.rs(2 hunks)comms/dht/src/event.rs(2 hunks)comms/dht/src/lib.rs(1 hunks)comms/dht/src/network_discovery/config.rs(2 hunks)comms/dht/src/network_discovery/discovering.rs(3 hunks)comms/dht/src/network_discovery/error.rs(1 hunks)comms/dht/src/network_discovery/initializing.rs(3 hunks)comms/dht/src/network_discovery/mod.rs(1 hunks)comms/dht/src/network_discovery/on_connect.rs(2 hunks)comms/dht/src/network_discovery/ready.rs(2 hunks)comms/dht/src/network_discovery/seed_strap.rs(1 hunks)comms/dht/src/network_discovery/state_machine.rs(16 hunks)
🧰 Additional context used
🧠 Learnings (1)
comms/dht/src/network_discovery/initializing.rs (2)
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
Learnt from: SWvheerden
PR: tari-project/tari#6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.
🧬 Code Graph Analysis (1)
comms/dht/src/network_discovery/error.rs (6)
comms/dht/src/network_discovery/state_machine.rs (1)
eq(144-158)comms/core/src/net_address/multiaddr_with_stats.rs (2)
eq(370-372)eq(443-464)comms/core/src/net_address/mutliaddresses_with_stats.rs (1)
eq(271-273)comms/core/src/peer_manager/peer.rs (1)
eq(378-380)comms/dht/src/envelope.rs (3)
eq(164-172)eq(319-321)eq(325-327)comms/core/src/connection_manager/peer_connection.rs (1)
eq(363-365)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: test (mainnet, stagenet)
- GitHub Check: test (testnet, esmeralda)
- GitHub Check: ci
🔇 Additional comments (33)
base_layer/p2p/src/peer_seeds.rs (2)
36-36: LGTM: Required import for seed peer flagging.The addition of
PeerFlagsimport supports the seed peer flagging implemented below.
169-169: LGTM: Correctly flags seed peers for bootstrap process.This change properly marks peers converted from
SeedPeerwith theSEEDflag, enabling the new bootstrap system to distinguish seed nodes from regular peers. This aligns perfectly with the PR objectives to implement a dedicated seed bootstrap phase.comms/dht/src/network_discovery/on_connect.rs (2)
33-33: LGTM: Required import for discovery phase tracking.The
DiscoveryPhaseimport is necessary for the enhanced event reporting implemented below.
180-182: LGTM: Proper integration with discovery phase tracking.The addition of these fields correctly integrates on-connect peer synchronization with the new discovery phase system:
DiscoveryPhase::Generalappropriately categorizes this as regular discovery (not seed bootstrap)- Setting
round_numberandtotal_roundstoNoneis correct since on-connect mode doesn't track specific rounds- The inline comments clearly explain the reasoning
This ensures consistent event reporting across different discovery modes.
comms/dht/src/lib.rs (1)
95-98: LGTM: Proper API exposure for bootstrap system integration.The expanded re-exports correctly expose the new
BootstrapMethodandDiscoveryPhasetypes from the state machine, enabling external components (such as the base node state machine service) to integrate with the enhanced bootstrap system. This maintains a clean public API while providing access to essential bootstrap-related types.comms/dht/src/network_discovery/discovering.rs (3)
35-41: LGTM! Import addition supports discovery phase tracking.The addition of
DiscoveryPhaseimport is necessary for the new discovery phase tracking functionality implemented in theinitializemethod.
86-89: Good addition of discovery phase context.The initialization of discovery phase and round information properly categorizes this as
Generaldiscovery (as opposed to seed bootstrap) and provides necessary context for tracking discovery progress. Settingtotal_roundstoNonecorrectly indicates that general discovery doesn't have a fixed number of rounds.
280-280: Improved comment clarity.The comment clarification from "Other errors" to "Other errors - no banning needed" better explains the intent of this match arm.
base_layer/core/src/base_node/state_machine_service/initializer.rs (2)
24-25: Import refactoring aligns with DHT event migration.The removal of
PeerManagerimport and addition ofDhtimport correctly reflects the architectural change from peer manager events to DHT events for bootstrap tracking.
110-116:Details
✅ Verification successful
Verify BaseNodeStateMachine constructor signature compatibility.
The change from peer manager to DHT event stream is architecturally sound and aligns with the new bootstrap tracking mechanism. However, ensure that the
BaseNodeStateMachineconstructor signature has been updated to acceptDhtEventReceiverinstead of the previous peer manager parameter.
🏁 Script executed:
#!/bin/bash # Description: Verify BaseNodeStateMachine constructor signature matches the new call # Expected: Constructor should accept DhtEventReceiver parameter ast-grep --pattern $'impl BaseNodeStateMachine$_$ { $$$ pub fn new($$$) -> $_ { $$$ } $$$ }'Length of output: 105
🏁 Script executed:
#!/bin/bash # Description: Locate BaseNodeStateMachine implementation and inspect its new() signature rg --color=never -n "impl BaseNodeStateMachine" -A20Length of output: 54
🏁 Script executed:
#!/bin/bash # Description: Search for BaseNodeStateMachine::new to locate its constructor signature rg --color=never -n "BaseNodeStateMachine::new" -C5Length of output: 6843
🏁 Script executed:
#!/bin/bash # Find BaseNodeStateMachine struct definition and inspect its new() signature rg --color=never -n "struct BaseNodeStateMachine" -C5 base_layer/core/src/base_node rg --color=never -n "pub fn new" -C5 base_layer/core/src/base_nodeLength of output: 32337
Constructor signature is correct
Verified thatBaseNodeStateMachine::newinstate_machine.rsnow takes
metadata_event_stream: broadcast::Receiver<Arc<ChainMetadataEvent>>dht_event_stream: DhtEventReceiver
in that order, and the initializer passeschain_metadata_service.get_event_stream()anddht.subscribe_dht_events()accordingly.comms/dht/src/event.rs (2)
27-27: Import addition supports new bootstrap event.The
BootstrapMethodimport is necessary for the newBootstrapMethodDeterminedevent variant.
42-46: Excellent addition of bootstrap tracking events.The new events provide granular visibility into the bootstrap process:
PrimaryBootstrapCompletesignals when the initial bootstrap phase (e.g., via seed nodes) is finishedBootstrapMethodDetermined(BootstrapMethod)communicates which bootstrap method is being usedThese events enable the base node state machine to accurately track and respond to bootstrap progress, improving the user experience with more precise status updates.
comms/dht/src/network_discovery/mod.rs (1)
41-43: Appropriate visibility change for new bootstrap functionality.Making
seed_strapandstate_machinemodules public is necessary to enable external access to the new bootstrap types and functionality, such as:
BootstrapMethodenum used in DHT eventsSeedStrapimplementation for seed peer bootstrap- Enhanced state machine types used by the base node service
This change properly exposes the new bootstrap infrastructure to consumers.
base_layer/core/src/base_node/sync/config.rs (1)
81-86: LGTM!The accessor method follows Rust conventions and provides clean access to the configuration value.
comms/dht/src/network_discovery/initializing.rs (2)
51-51: Good error handling improvement!The enhanced error logging provides clear visibility into connectivity issues that prevent discovery from proceeding.
102-128: Excellent bootstrap decision implementation!The logic correctly implements the PR objective by skipping seed bootstrap when sufficient suitable peers exist. The detailed logging provides excellent visibility for debugging and monitoring the bootstrap decision process.
comms/dht/src/network_discovery/config.rs (1)
56-103: Well-designed configuration for seed bootstrap control!The new configuration fields provide comprehensive control over the seed bootstrap process with clear documentation and sensible defaults. The use of
#[serde(default)]ensures backward compatibility.Note that
max_seed_peer_sync_count(default: 5) matches the default value ofnum_initial_sync_rounds_seed_bootstrapinBlockchainSyncConfig, maintaining the consistency mentioned in the code comments.base_layer/core/src/base_node/state_machine_service/states/events_and_states.rs (2)
181-186: Well-structured bootstrap phase trackingThe
BootstrapPhaseInfostruct provides a clean way to track bootstrap progress with current and total rounds.
261-276: Good UX decision for bootstrap status displayPrioritizing the bootstrap phase display when active provides better visibility into the node's startup progress. The fallback to existing display logic maintains backward compatibility.
base_layer/core/src/base_node/state_machine_service/states/starting_state.rs (1)
53-95: Good approach to catch up on missed eventsThe initial loop to process any existing DHT events ensures the node doesn't miss bootstrap completion events that occurred before the state machine started listening.
base_layer/core/src/base_node/state_machine_service/state_machine.rs (1)
115-115: Clean refactoring from peer_manager to DHT eventsThe replacement of
peer_managerwithdht_event_streamis well-executed, maintaining the same event-driven architecture while improving the separation of concerns.Also applies to: 129-129
comms/dht/src/network_discovery/ready.rs (2)
40-84: LGTM! Well-structured peer selection logic.The new helper function effectively centralizes peer selection strategy, choosing between closest peers (when new peers were discovered) and random peers (otherwise). This is a good optimization for discovery efficiency.
196-210: Good optimization for successful SeedStrap completion.The logic correctly identifies when the initial SeedStrap phase was very successful and transitions to less aggressive discovery mode. The multi-condition check ensures both quantity thresholds are met before making this optimization.
base_layer/core/src/base_node/state_machine_service/states/listening.rs (3)
168-210: Good defensive handling of missed DHT bootstrap events.The logic to catch up on missed events before setting up the UI state is a good approach to handle timing issues. The error handling for all
try_recvcases is comprehensive.
411-422: Good defensive check for bootstrap completion.The logic to mark bootstrap complete when the final SeedStrap round reports success is a good fallback mechanism in case the
PrimaryBootstrapCompleteevent is missed. This helps ensure robust bootstrap tracking.
254-265:Details
❓ Verification inconclusive
Potential race condition in bootstrap phase management.
The comments indicate awareness of a potential race condition where
bootstrap_phasemight become None prematurely. While defensive checks are in place, the complexity of managing this state across multiple event handlers could lead to inconsistencies.Consider consolidating the bootstrap state management logic to reduce the risk of race conditions.
Run the following script to check for other places where
bootstrap_phaseis modified:Also applies to: 316-340
🏁 Script executed:
#!/bin/bash # Description: Find all locations where bootstrap_phase is modified to ensure consistent state management # Search for bootstrap_phase assignments rg -A 3 "bootstrap_phase\s*=" # Search for set_primary_bootstrap_complete calls rg -A 2 "set_primary_bootstrap_complete"Length of output: 7409
Potential inconsistency in
bootstrap_phasehandling across multiple modulesI found that
bootstrap_phaseis being set or cleared in several places, increasing the risk of a race condition:• state_machine/core/src/base_node/state_machine_service/state_machine.rs
– Clearsinfo.bootstrap_phase = Noneand immediately re-sets state.
• states/starting_state.rs
– Callsshared.set_primary_bootstrap_complete(true)in multiple match arms.
• states/listening.rs
– Initializes, clears, and updatesbootstrap_phasein several code paths and also callsshared.set_primary_bootstrap_complete(true).Each location manages part of the “bootstrap complete” handshake, but none enforce atomicity between clearing the phase and flipping
is_primary_bootstrap_complete. I recommend consolidating all bootstrap-phase transitions into a single helper (e.g.Shared::complete_bootstrap()) that both clears the phase and sets the complete flag together, to eliminate timing gaps.Please review these three areas and ensure the state transitions cannot interleave incorrectly.
comms/dht/src/network_discovery/seed_strap.rs (1)
54-125: Well-structured seed bootstrap implementation.The
next_eventmethod and overall error handling approach is excellent. The code properly handles both success and failure cases, always returning aDiscoveryCompleteevent with appropriate statistics. The extensive logging will be valuable for debugging bootstrap issues.comms/dht/src/network_discovery/state_machine.rs (6)
59-74: LGTM! Well-structured enum for tracking bootstrap methods.The
BootstrapMethodenum clearly represents the different bootstrap scenarios with appropriate Display implementation.
143-159: Consider the implications of partial equality comparison.The current
PartialEqimplementation only compares variant types for complex variants, ignoring their data. This meansBeginDiscovery(params1)==BeginDiscovery(params2)regardless of the parameters, which could lead to unexpected behavior if equality checks are used elsewhere in the codebase.Consider either:
- Deriving
PartialEqif all nested types support it, or- Document this behavior clearly if it's intentional
198-229: Excellent bootstrap lifecycle management!The async methods for tracking bootstrap method, start time, and completion provide clear lifecycle management with appropriate logging at each stage.
231-259: Great improvement to event publishing with detailed logging!The enhanced logging with receiver counts and specific event names will significantly help with debugging event flow issues.
324-406: Well-designed state transitions with robust error handling!The state machine properly handles:
- Bootstrap skipping when sufficient peers exist
- Error recovery during SeedStrap to prevent UI deadlock
- Clear success/failure paths with appropriate logging
The defensive programming approach of completing bootstrap even on errors is excellent for preventing stuck states.
500-524: Well-structured discovery phase tracking!The
DiscoveryPhaseenum and enhancedDhtNetworkDiscoveryRoundInfoprovide clear phase identification and progress tracking, which will be valuable for monitoring the bootstrap process.
base_layer/core/src/base_node/state_machine_service/states/starting_state.rs
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
comms/dht/src/network_discovery/seed_strap.rs (1)
663-770:⚠️ Potential issueAddress the unresolved security concern about excessive peer data.
The past review comment about protecting against excessive peer data from malicious seeds has not been addressed. The
collect_peer_streamfunction still lacks a limit on the total number of peers collected from a single seed node, which could lead to memory exhaustion from malicious seeds.Please implement the previously suggested protection:
async fn collect_peer_stream<S>( &self, seed_node_id_str: &str, peer_stream: &mut S, ) -> Result<Vec<crate::proto::rpc::PeerInfo>, NetworkDiscoveryError> where S: StreamExt<Item = Result<crate::proto::rpc::GetPeersResponse, tari_comms::protocol::rpc::RpcStatus>> + Unpin, { let mut peers_from_seed = Vec::new(); let mut stream_items_processed_total = 0; let mut stream_items_with_peers = 0; + // Limit total peers from a single seed to prevent memory exhaustion + const MAX_PEERS_PER_SEED: usize = 1000; debug!( target: LOG_TARGET, "SeedStrap: Beginning to collect peer stream items from seed '{}'", seed_node_id_str ); loop { + // Check if we've collected enough peers from this seed + if peers_from_seed.len() >= MAX_PEERS_PER_SEED { + warn!( + target: LOG_TARGET, + "SeedStrap: Reached maximum peers limit ({}) for seed '{}'. Stopping collection.", + MAX_PEERS_PER_SEED, + seed_node_id_str + ); + break; + } + debug!( target: LOG_TARGET, "SeedStrap: Attempting to get next peer from stream for seed '{}'. Processed {} items so far ({} with peers).",
🧹 Nitpick comments (1)
comms/dht/src/network_discovery/state_machine.rs (1)
143-159: Consider documenting the PartialEq behaviorThe
PartialEqimplementation only compares variant types for complex data structures, which is reasonable for state machine logic but might be unexpected.Consider adding a comment to clarify the comparison behavior:
+/// PartialEq implementation that compares only variant types for complex data, +/// useful for state machine transition logic where exact data equality isn't needed impl PartialEq for StateEvent {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
base_layer/core/src/base_node/state_machine_service/states/listening.rs(5 hunks)base_layer/core/src/base_node/state_machine_service/states/mod.rs(1 hunks)base_layer/core/tests/helpers/nodes.rs(3 hunks)base_layer/core/tests/helpers/sync.rs(1 hunks)base_layer/core/tests/tests/base_node_rpc.rs(1 hunks)base_layer/core/tests/tests/mempool.rs(1 hunks)base_layer/core/tests/tests/node_service.rs(1 hunks)base_layer/core/tests/tests/node_state_machine.rs(3 hunks)comms/dht/src/network_discovery/error.rs(1 hunks)comms/dht/src/network_discovery/ready.rs(2 hunks)comms/dht/src/network_discovery/seed_strap.rs(1 hunks)comms/dht/src/network_discovery/state_machine.rs(15 hunks)comms/dht/src/network_discovery/test.rs(3 hunks)
✅ Files skipped from review due to trivial changes (3)
- base_layer/core/tests/tests/node_service.rs
- base_layer/core/tests/tests/base_node_rpc.rs
- base_layer/core/tests/tests/mempool.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- comms/dht/src/network_discovery/error.rs
🧰 Additional context used
🧬 Code Graph Analysis (1)
comms/dht/src/network_discovery/state_machine.rs (6)
comms/dht/src/network_discovery/ready.rs (3)
new(87-92)next_event(94-104)config(272-274)comms/dht/src/network_discovery/seed_strap.rs (3)
new(55-57)next_event(59-125)config(773-775)comms/dht/src/network_discovery/initializing.rs (2)
new(38-40)next_event(42-128)comms/dht/src/network_discovery/discovering.rs (4)
new(64-71)next_event(114-151)config(311-313)params(316-325)comms/dht/src/network_discovery/on_connect.rs (3)
new(53-58)next_event(60-122)config(200-202)comms/dht/src/network_discovery/config.rs (1)
default(90-105)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: test (mainnet, stagenet)
- GitHub Check: test (testnet, esmeralda)
- GitHub Check: test (nextnet, nextnet)
- GitHub Check: cargo check with stable
- GitHub Check: Cucumber tests / Base Layer
- GitHub Check: Cucumber tests / FFI
- GitHub Check: ci
🔇 Additional comments (33)
base_layer/core/tests/tests/node_state_machine.rs (3)
109-109: LGTM: Consistent migration to DHT event streams.The change from peer manager event stream to DHT event subscription aligns perfectly with the PR's bootstrap process refactoring. This ensures tests properly validate the new DHT-based event handling model.
252-252: LGTM: Consistent test pattern maintained.The migration to
dht.subscribe_dht_events()maintains consistency with the other test functions and supports the new bootstrap event handling architecture.
302-302: LGTM: Test infrastructure properly updated.The change ensures that the
test_event_channeltest uses the same DHT event stream pattern as the other tests, maintaining consistency across the test suite.base_layer/core/src/base_node/state_machine_service/states/mod.rs (1)
59-59: LGTM: Appropriate visibility change for module restructuring.Making the
events_and_statesmodule public is necessary to support the bootstrap process refactoring and enables external access to the enhancedListeningInfostruct and related types that track bootstrap progress.base_layer/core/tests/helpers/nodes.rs (2)
89-89: LGTM: DHT handle integration supports test infrastructure.Adding the
dhtfield toNodeInterfacesis essential for providing test access to the DHT event stream, which is now required for the new bootstrap process and BaseNodeStateMachine initialization.
408-422: LGTM: Proper DHT handle extraction and assignment.The extraction of the DHT handle from service handles and assignment to the NodeInterfaces struct follows the established pattern for other service handles and enables test access to DHT functionality.
base_layer/core/tests/helpers/sync.rs (1)
188-188: LGTM: Consistent migration to DHT event streams in test helpers.The change from peer manager to DHT event subscription maintains consistency with the bootstrap process refactoring seen across other test files and properly supports the new BaseNodeStateMachine event handling model.
comms/dht/src/network_discovery/test.rs (4)
173-173: LGTM: Proper import for new async synchronization primitives.Good addition of the
RwLockimport to support the new bootstrap tracking fields.
176-182: LGTM: Well-organized imports for bootstrap functionality.The imports are properly organized and support the new bootstrap method tracking introduced in the state machine.
209-210: LGTM: Appropriate test context updates for new bootstrap fields.The addition of
bootstrap_methodandbootstrap_started_atfields properly mirrors the production context structure, enabling accurate testing of bootstrap-related functionality.
259-259: LGTM: Clean pattern for test field initialization.Using
..Default::default()is a good practice for test setup, making the code more maintainable when new fields are added.comms/dht/src/network_discovery/ready.rs (3)
40-84: LGTM: Excellent refactoring with improved separation of concerns.The new
select_peers_for_discovery_roundhelper function encapsulates peer selection logic cleanly and makes the code more testable and maintainable. The logic correctly differentiates between selecting closest peers when the last round found new peers versus random peers otherwise.
97-114: LGTM: Clean parameter addition and improved error handling.The addition of the
current_num_roundsparameter makes the method's dependencies explicit and improves testability. The error handling structure is appropriate.
116-269: LGTM: Well-structured scenario-based discovery logic.The refactoring into three clear scenarios greatly improves code readability and maintainability:
- Not enough peers overall - must discover
- First active discovery round - forced discovery
- Subsequent rounds - process last round results
The special handling for successful SeedStrap rounds (lines 197-211) is a thoughtful addition that optimizes the discovery process by transitioning to OnConnectMode early when sufficient peers are found during seed bootstrap.
comms/dht/src/network_discovery/seed_strap.rs (3)
59-125: LGTM: Well-structured main entry point with comprehensive error handling.The
next_eventmethod provides a clean interface and handles both success and error cases appropriately. The detailed logging and round info tracking will be valuable for debugging and monitoring.
127-573: LGTM: Comprehensive seed discovery implementation with robust error handling.The
discover_peers_via_seedsmethod is well-implemented with:
- Proper seed peer selection and randomization
- Comprehensive error handling and logging
- Malicious seed detection and banning
- Early exit conditions based on configuration
- Detailed progress tracking and statistics
The peer validation logic (lines 357-420) correctly handles different types of validation errors and appropriately bans seeds that provide invalid data.
575-661: LGTM: Proper RPC connection handling with appropriate timeouts.The
fetch_peers_from_connectionmethod correctly establishes RPC connections and makes appropriate requests with reasonable peer count limits.base_layer/core/src/base_node/state_machine_service/states/listening.rs (3)
33-33: LGTM: Appropriate imports for DHT bootstrap integration.The new imports properly support the DHT event handling and discovery phase tracking functionality.
168-231: LGTM: Robust handling of missed DHT events during startup.The logic to check for missed DHT bootstrap events addresses potential timing issues where events might be published before the listening state starts. The handling of
BootstrapMethodDeterminedandPrimaryBootstrapCompleteevents ensures consistent state regardless of event timing.
233-452: LGTM: Well-implemented concurrent event handling with proper bootstrap integration.The refactored event loop using
tokio::select!properly handles both chain metadata and DHT events concurrently. Key strengths:
- Appropriate bootstrap completion checks before sync decisions
- Correct UI state updates based on bootstrap phase progress
- Proper handling of event stream errors and lagging
- Clean separation between bootstrap and normal sync logic
The bootstrap phase tracking (lines 400-424) correctly updates the UI state based on SeedStrap progress and marks completion when appropriate.
comms/dht/src/network_discovery/state_machine.rs (13)
59-74: LGTM: Well-designed bootstrap method enumThe
BootstrapMethodenum clearly represents the three bootstrap approaches with descriptive names and a properDisplayimplementation for logging purposes.
79-79: LGTM: Consistent SeedStrap state integrationThe
SeedStrapstate is properly integrated into theStateenum with correct display formatting and helper methodis_seed_strap()for state checking.Also applies to: 92-92, 107-109
115-115: LGTM: Clear event naming for bootstrap decisionThe
InitialPeersSufficientevent clearly indicates when bootstrap can be skipped due to existing peers, with descriptive display text.Also applies to: 131-131
177-179: LGTM: Appropriate context fields for bootstrap trackingThe new fields
bootstrap_methodandbootstrap_started_atprovide necessary state tracking for the bootstrap process with proper Arc<RwLock<>> wrapping for concurrent access.
197-229: LGTM: Well-structured bootstrap management methodsThe bootstrap management methods properly handle state updates and event publishing:
set_bootstrap_methodupdates state and publishes notificationmark_bootstrap_startedrecords timingcomplete_bootstrapcalculates duration and publishes completionThe async/await usage and logging are appropriate.
231-259: LGTM: Comprehensive event publishing with detailed loggingThe enhanced
publish_eventmethod provides excellent debugging capabilities by logging the specific event type, receiver count, and success/failure status. This will be valuable for troubleshooting bootstrap issues.
298-299: LGTM: Proper initialization of bootstrap tracking fieldsThe new context fields are correctly initialized with appropriate default values (None bootstrap method, no start time).
329-339: LGTM: Proper SeedStrap state transition with bootstrap trackingThe transition from
InitializingtoSeedStrapcorrectly:
- Marks bootstrap as started
- Sets bootstrap method
- Creates the SeedStrap state
The alternative transition for
InitialPeersSufficientappropriately skips to Ready state while completing bootstrap tracking.
340-359: LGTM: Robust SeedStrap completion handlingThe SeedStrap completion logic properly:
- Publishes peer discovery events when new peers are found
- Handles both success and failure cases
- Marks bootstrap complete on success
- Transitions to Ready state appropriately
The failure handling with waiting period is a good defensive approach.
379-395: Excellent error handling to prevent UI deadlockThe error handling logic that marks bootstrap complete even during SeedStrap failures is a crucial defensive measure to prevent the UI from getting stuck waiting for bootstrap completion.
This demonstrates good understanding of system-wide interactions and failure modes.
425-451: Robust bootstrap timeout implementationThe bootstrap timeout logic is well-implemented:
- Uses
tokio::select!to race between normal operation and timeout- Clones context to avoid borrow checker issues
- Forces completion on timeout to prevent indefinite waiting
- Uses configured timeout duration
The approach of forcing bootstrap completion on timeout ensures the system doesn't hang indefinitely while waiting for seed nodes.
453-460: LGTM: Correct bootstrap completion detectionThe bootstrap completion detection correctly identifies when bootstrap finishes:
DiscoveryCompletefrom SeedStrap stateInitialPeersSufficientfor skipped bootstrapThis ensures the timeout logic is disabled once bootstrap is done.
498-516: LGTM: Clear discovery phase trackingThe
DiscoveryPhaseenum and enhancedDhtNetworkDiscoveryRoundInfoprovide good visibility into the current discovery phase with appropriate fields for round tracking.The
Defaultimplementation defaulting toGeneralphase is sensible for ongoing discovery after bootstrap.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
base_layer/core/tests/tests/block_validation.rs (1)
98-108: Consider consolidating network variable declarations.The network configuration logic is sound and ensures consistent test environment. However, there's a redundant variable declaration that could be improved.
-let network = Network::Esmeralda; -if std::env::var("TARI_NETWORK").is_err() { - std::env::set_var("TARI_NETWORK", network.as_key_str()); -} -if Network::get_current_or_user_setting_or_default() != network { - let _ = Network::set_current(network); -} -let current_network = Network::get_current_or_user_setting_or_default(); -if current_network != network { - panic!("could not set network"); -} +const EXPECTED_NETWORK: Network = Network::Esmeralda; +if std::env::var("TARI_NETWORK").is_err() { + std::env::set_var("TARI_NETWORK", EXPECTED_NETWORK.as_key_str()); +} +if Network::get_current_or_user_setting_or_default() != EXPECTED_NETWORK { + let _ = Network::set_current(EXPECTED_NETWORK); +} +let current_network = Network::get_current_or_user_setting_or_default(); +if current_network != EXPECTED_NETWORK { + panic!("Failed to set network to {:?}, current network is {:?}", EXPECTED_NETWORK, current_network); +}And remove the duplicate declaration at line 114:
-let network = Network::Esmeralda; +let network = EXPECTED_NETWORK;This eliminates the duplicate variable declaration and provides a more descriptive panic message.
base_layer/core/src/proof_of_work/monero_rx/helpers.rs (1)
143-143: Appropriate log level reduction for expected error condition.This change correctly reduces log noise by changing from
warn!totrace!for Monero extra field deserialization failures. As noted in the code comments, these failures are expected behavior per Monero consensus rules and don't represent actual errors.However, consider updating the similar logging in
parse_extra_field_truncate_on_error(line 118) for consistency, as it handles the same type of error but still useswarn!level.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
base_layer/core/src/base_node/state_machine_service/states/listening.rs(4 hunks)base_layer/core/src/mempool/mempool_storage.rs(1 hunks)base_layer/core/src/proof_of_work/monero_rx/helpers.rs(1 hunks)base_layer/core/tests/tests/block_validation.rs(2 hunks)base_layer/core/tests/tests/node_state_machine.rs(7 hunks)comms/dht/src/network_discovery/test.rs(5 hunks)comms/dht/src/rpc/mock.rs(0 hunks)
💤 Files with no reviewable changes (1)
- comms/dht/src/rpc/mock.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- base_layer/core/tests/tests/node_state_machine.rs
- base_layer/core/src/base_node/state_machine_service/states/listening.rs
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: test (mainnet, stagenet)
- GitHub Check: Cucumber tests / Base Layer
- GitHub Check: test (testnet, esmeralda)
- GitHub Check: test (nextnet, nextnet)
- GitHub Check: Cucumber tests / FFI
- GitHub Check: cargo check with stable
- GitHub Check: ci
🔇 Additional comments (7)
base_layer/core/src/mempool/mempool_storage.rs (1)
130-130: LGTM - Aligns with bootstrap improvements but consider debug logging alternative.This change effectively reduces log noise during bootstrap/sync operations where maturity errors are expected for time-locked transactions. The selective suppression makes sense in the context of the improved bootstrap process.
However, completely removing logging for maturity errors might hinder debugging in unexpected scenarios. Consider adding a debug-level log statement to retain diagnostic capability while reducing noise:
- Err(ValidationError::MaturityError) => Ok(TxStorageResponse::NotStoredTimeLocked), + Err(ValidationError::MaturityError) => { + debug!(target: LOG_TARGET, "Transaction not stored due to maturity error (time-locked)"); + Ok(TxStorageResponse::NotStoredTimeLocked) + },Note: The AI summary mentions that both
UnknownInputsandMaturityErrorcases had warning logs removed, but only theMaturityErrorcase shows changes in the annotated code.Likely an incorrect or invalid review comment.
base_layer/core/tests/tests/block_validation.rs (2)
28-28: LGTM - Appropriate dependency for test serialization.The
serial_test::serialimport is correctly added to support the#[serial]attribute for preventing test interference.
96-96: LGTM - Serial execution prevents environment variable conflicts.The
#[serial]attribute ensures this test runs sequentially, preventing race conditions when modifying global environment variables and network configuration.comms/dht/src/network_discovery/test.rs (4)
109-119: LGTM! Import additions support the new bootstrap functionality.The new imports are properly organized and directly support the bootstrap tracking features being added to the test context.
145-146: LGTM! Bootstrap tracking fields properly initialized.The new fields
bootstrap_methodandbootstrap_started_atare correctly initialized with appropriate async-safe wrappers (Arc<RwLock<>>) for shared state tracking in the network discovery context.
195-195: Good improvement using struct update syntax.Using
..Default::default()makes the code more maintainable and ensures all fields are properly initialized.
203-203:Details
✅ Verification successful
Verify the state transition change aligns with the new bootstrap behavior.
The test function was renamed from
it_transitions_to_on_connecttoit_transitions_to_idleand now expectsStateEvent::Idleinstead ofStateEvent::OnConnectMode. This suggests a significant change in the state machine behavior.Please verify that this state transition change is intentional and correct by running the following script to check for related state machine modifications:
Also applies to: 218-218
🏁 Script executed:
#!/bin/bash # Description: Verify state machine changes related to OnConnectMode and Idle states # Expected: Find evidence of state machine updates that justify this test change echo "=== Searching for OnConnectMode references ===" rg -A 3 -B 3 "OnConnectMode" echo "=== Searching for Idle state handling ===" rg -A 3 -B 3 "StateEvent::Idle" echo "=== Checking state machine files for bootstrap-related changes ===" fd "state_machine" --type f | head -5 | xargs rg -l "bootstrap|Bootstrap"Length of output: 11138
Verified: Idle transition aligns with updated bootstrap logic
Theit_transitions_to_idletest correctly reflects the changes incomms/dht/src/network_discovery/ready.rs, which now returnsStateEvent::Idlein the tested scenarios. TheOnConnectModebranch still exists for its original conditions, so renaming the test and expectingStateEvent::Idleis intentional and accurate.
There was a problem hiding this comment.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
base_layer/core/src/base_node/state_machine_service/states/listening.rs (1)
141-452: 🛠️ Refactor suggestionConsider breaking down the next_event method for better maintainability.
This method has grown to over 300 lines, making it difficult to maintain and test. Consider extracting logical sections into separate methods.
Suggested refactoring approach:
impl Listening { pub async fn next_event<B: BlockchainBackend + 'static>( &mut self, shared: &mut BaseNodeStateMachine<B>, network_silence: bool, ) -> StateEvent { if network_silence { return self.handle_network_silence(shared); } self.setup_initial_bootstrap_state(shared).await; self.run_event_loop(shared).await } async fn setup_initial_bootstrap_state<B: BlockchainBackend + 'static>( &mut self, shared: &mut BaseNodeStateMachine<B>, ) { // Lines 156-231 logic here } async fn run_event_loop<B: BlockchainBackend + 'static>( &mut self, shared: &mut BaseNodeStateMachine<B>, ) -> StateEvent { // Lines 232-451 logic here } async fn handle_chain_metadata_event<B: BlockchainBackend + 'static>( &mut self, shared: &mut BaseNodeStateMachine<B>, metadata_event: &ChainMetadataEvent, // other parameters ) -> Option<StateEvent> { // Chain metadata handling logic } async fn handle_dht_event<B: BlockchainBackend + 'static>( &mut self, shared: &mut BaseNodeStateMachine<B>, dht_event: &DhtEvent, ) { // DHT event handling logic } }
🧹 Nitpick comments (1)
base_layer/core/src/base_node/state_machine_service/states/listening.rs (1)
168-210: Consider optimizing the missed DHT events processing loop.While the logic for handling missed DHT events is sound, the current implementation could be inefficient if many events need to be processed. Consider adding a limit to prevent excessive processing during startup.
// Try to receive any recent DHT events that might have been published before we started listening let mut events_processed = 0; +const MAX_MISSED_EVENTS_TO_PROCESS: usize = 100; loop { + if events_processed >= MAX_MISSED_EVENTS_TO_PROCESS { + warn!(target: LOG_TARGET, "[BN SM LISTENING] Reached maximum missed events limit ({}), stopping processing", MAX_MISSED_EVENTS_TO_PROCESS); + break; + } match dht_events_check.try_recv() {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
base_layer/core/src/base_node/state_machine_service/states/listening.rs(5 hunks)base_layer/core/tests/tests/node_state_machine.rs(7 hunks)comms/dht/src/rpc/mod.rs(0 hunks)
💤 Files with no reviewable changes (1)
- comms/dht/src/rpc/mod.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- base_layer/core/tests/tests/node_state_machine.rs
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: test (testnet, esmeralda)
- GitHub Check: test (nextnet, nextnet)
- GitHub Check: test (mainnet, stagenet)
- GitHub Check: ci
- GitHub Check: cargo check with stable
🔇 Additional comments (4)
base_layer/core/src/base_node/state_machine_service/states/listening.rs (4)
33-33: LGTM: Import additions are appropriate.The new imports for
DhtEventandDiscoveryPhasealign with the DHT bootstrap functionality being added.
401-424: Good fallback mechanism for bootstrap completion.The logic to mark bootstrap as complete when SeedStrap reports completion (lines 414-421) provides a robust fallback in case the
PrimaryBootstrapCompleteevent is missed. This is a good defensive programming practice.
317-320: Address the TODO comment about premature bootstrap_phase clearing.The comment indicates a potential issue with clearing the bootstrap_phase too early. The current conditional check
if shared.is_primary_bootstrap_completeaddresses this concern well.
254-265:Details
✅ Verification successful
Verify bootstrap completion logic consistency.
The logic that skips sync decisions while bootstrapping looks correct, but ensure this doesn't create edge cases where the node gets stuck in a bootstrap state.
🏁 Script executed:
#!/bin/bash # Description: Check for potential bootstrap state management issues in related files # Search for other places where is_primary_bootstrap_complete is used rg -A 5 -B 5 "is_primary_bootstrap_complete" # Check for bootstrap completion logic in DHT components rg -A 5 -B 5 "PrimaryBootstrapComplete" # Look for bootstrap phase state transitions rg -A 5 -B 5 "bootstrap_phase"Length of output: 48456
Bootstrap state management logic verified.
I reviewed all references tois_primary_bootstrap_complete,bootstrap_phase, and the handling ofDhtEvent::PrimaryBootstrapComplete(and related peer‐sync events). The code unambiguously sets the flag on receipt of the primary bootstrap event (and on final SeedStrap rounds) and immediately clears the UI bootstrap phase. There are no pathways where the node could remain indefinitely stuck in bootstrapping.
Description
Implement a better bootstrap process so we don't overload seed nodes
Motivation and Context
Implement the following bootstrap process:
Basically the seed nodes should never be connected to for very long, because otherwise their connection pool dries up.
How Has This Been Tested?
"It works on my machine"
What process can a PR reviewer use to test or verify this change?
Clear the whole ~/.tari/mainnet folder, and start the node up from scratch. Look for the following debug lines in network.log:
[comms::dht::network_discovery::seed_strap] [Thread:55429085] DEBUG Attempting to discover peers via seed nodes. // comms/dht/src/network_discovery/seed_strap.rs:61And then:
[comms::dht::network_discovery::seed_strap] [Thread:55429095] INFO Added 992 peers via seed nodes. Transitioning to Ready state. // comms/dht/src/network_discovery/seed_strap.rs:71Breaking Changes
Summary by CodeRabbit
New Features
Improvements
Bug Fixes