Skip to content

enhance: Implement Load-Aware Quorum Selection Strategy #114

@tinswzy

Description

@tinswzy

Description
The current Quorum selection follows a Static Regional or Pure Random strategy. In this model, the system selects nodes either from a pre-defined Zone (e.g., us-east-1a) or picks them randomly across the cluster to satisfy the $W+R > N$ consistency requirement.

The Problem:
Static/Random selection is "blind" to the Runtime Heterogeneity of nodes. Even within the same designated region, certain nodes may experience:

  • Resource Contention: Heavy background jobs or localized "hotkeys" causing CPU spikes.
  • I/O Wait: Disk saturation on specific storage nodes.
  • Network Jitter: Micro-bursts affecting specific rack-level switches.
    Relying solely on regional proximity or randomness leads to Tail Latency (P99) inflation, as the Quorum is only as fast as its slowest (most loaded) member.

Proposed Solution: Load-Aware "Smart" Selection
We need to evolve the selection logic to a Hybrid Strategy:

  1. Regional Constraint (Tier 1): Maintain data sovereignty and low-latency by filtering nodes within the preferred Zone.
  2. Load Filtering (Tier 2): From the candidate pool in Tier 1, prioritize nodes based on a Dynamic Health Score.
    Proposed Algorithm Logic:
  • Weighted Probabilistic Selection: Instead of 1/N chance, use $P(node_i) = \frac{Capacity_i - CurrentLoad_i}{\sum (AvailableCapacity)}$.
  • Penalty Box: Temporarily "mute" nodes whose load exceeds a critical threshold (e.g., 85% CPU) for a short window (Cool-off period).

Technical Implementation

  • Metadata Enhancement: The Node Manager should heartbeat not just "ALIVE" status, but also a LoadFactor (0.0 - 1.0).
  • Selection Interface: Refactor QuorumRouter.select() to accept a LoadProvider interface.
  • Fall-through Logic: If all nodes in a high-priority Zone are heavily loaded, the strategy should decide whether to:
    • Option A: Stick to the Zone but accept higher latency.
    • Option B: "Overflow" to a lower-load node in a neighboring Zone (if latency allows).

Acceptance Criteria

  • The Quorum selector can prioritize low-load nodes within the same region.
  • Introduce a max_load_threshold configuration to prevent pushing "unhealthy but alive" nodes over the edge.
  • Metrics added to track "Quorum Selection Skew" (how often we deviate from random due to load).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions