Description
The current Quorum selection follows a Static Regional or Pure Random strategy. In this model, the system selects nodes either from a pre-defined Zone (e.g., us-east-1a) or picks them randomly across the cluster to satisfy the $W+R > N$ consistency requirement.
The Problem:
Static/Random selection is "blind" to the Runtime Heterogeneity of nodes. Even within the same designated region, certain nodes may experience:
- Resource Contention: Heavy background jobs or localized "hotkeys" causing CPU spikes.
- I/O Wait: Disk saturation on specific storage nodes.
- Network Jitter: Micro-bursts affecting specific rack-level switches.
Relying solely on regional proximity or randomness leads to Tail Latency (P99) inflation, as the Quorum is only as fast as its slowest (most loaded) member.
Proposed Solution: Load-Aware "Smart" Selection
We need to evolve the selection logic to a Hybrid Strategy:
- Regional Constraint (Tier 1): Maintain data sovereignty and low-latency by filtering nodes within the preferred Zone.
- Load Filtering (Tier 2): From the candidate pool in Tier 1, prioritize nodes based on a Dynamic Health Score.
Proposed Algorithm Logic:
- Weighted Probabilistic Selection: Instead of 1/N chance, use $P(node_i) = \frac{Capacity_i - CurrentLoad_i}{\sum (AvailableCapacity)}$.
- Penalty Box: Temporarily "mute" nodes whose load exceeds a critical threshold (e.g., 85% CPU) for a short window (Cool-off period).
Technical Implementation
- Metadata Enhancement: The Node Manager should heartbeat not just "ALIVE" status, but also a LoadFactor (0.0 - 1.0).
- Selection Interface: Refactor QuorumRouter.select() to accept a LoadProvider interface.
- Fall-through Logic: If all nodes in a high-priority Zone are heavily loaded, the strategy should decide whether to:
- Option A: Stick to the Zone but accept higher latency.
- Option B: "Overflow" to a lower-load node in a neighboring Zone (if latency allows).
Acceptance Criteria
Description$W+R > N$ consistency requirement.
The current Quorum selection follows a Static Regional or Pure Random strategy. In this model, the system selects nodes either from a pre-defined Zone (e.g., us-east-1a) or picks them randomly across the cluster to satisfy the
The Problem:
Static/Random selection is "blind" to the Runtime Heterogeneity of nodes. Even within the same designated region, certain nodes may experience:
Relying solely on regional proximity or randomness leads to Tail Latency (P99) inflation, as the Quorum is only as fast as its slowest (most loaded) member.
Proposed Solution: Load-Aware "Smart" Selection
We need to evolve the selection logic to a Hybrid Strategy:
Proposed Algorithm Logic:
Technical Implementation
Acceptance Criteria