-
Notifications
You must be signed in to change notification settings - Fork 2.3k
vtgate/buffer: reduce hot-path latency with lock-free shard lookup and atomic state #19801
Description
Summary
The VTGate query buffer is checked on every PRIMARY query. In the normal case (no failover in progress), the check should be near-zero cost. Currently the hot path takes two mutex acquisitions per query — one on the global buffer map and one on the per-shard buffer. Under parallel load this creates measurable contention.
Current Hot Path (no buffering active)
WaitForFailoverEnd()
→ buf.getOrCreateBuffer(keyspace, shard) // buf.mu.Lock() — global write lock
→ sb.mu.RLock() // per-shard read lock
→ shouldBufferLocked() returns false
→ sb.mu.RUnlock()
Every query across all shards contends on buf.mu even though the shard buffer already exists and the operation is read-only.
Proposed Changes
1. Atomic state for fast-path idle check
Replace the sb.mu.RLock() + shouldBufferLocked() + sb.mu.RUnlock() with an atomic state field. The idle check becomes a single atomic.LoadInt32() with no lock.
type shardBuffer struct {
state atomic.Int32 // read atomically on hot path; written under mu
// mu still protects queue, timers, etc.
}
func (sb *shardBuffer) waitForFailoverEnd(...) {
if state(sb.state.Load()) == stateIdle {
return nil, nil // no lock needed
}
// slow path: take lock
}State transitions (start buffering, drain) write the atomic under mu — they're already on the slow path.
2. Replace global mutex with sync.Map for shard lookup
The shard map is read on every query (hot) and written only when a shard is first seen (cold — happens once per shard at startup). This is the exact access pattern sync.Map is optimized for.
type Buffer struct {
buffers sync.Map // string → *shardBuffer, lock-free reads
}WaitForFailoverEnd and HandleKeyspaceEvent use Load() — no lock. getOrCreateBuffer uses LoadOrStore — lock-free when the shard already exists.
3. String state → integer state
Current bufferState is a string type. Integer comparison is faster than string comparison on the hot path.
Expected Impact
We implemented these optimizations in a similar buffer system and benchmarked:
| Benchmark | Before | After | Change |
|---|---|---|---|
| Idle shard check (serial) | ~22 ns | ~14 ns | -36% |
| Idle shard check (parallel, 10 goroutines) | ~200 ns | ~1.7 ns | -99% |
| Multi-shard parallel (8 shards) | ~150 ns | ~2.3 ns | -98% |
The parallel improvements are dramatic because the lock contention is completely eliminated — every goroutine reads independently.
Affected Code
go/vt/vtgate/buffer/buffer.go—Buffer.buffersmap +mumutexgo/vt/vtgate/buffer/shard_buffer.go—shardBuffer.statefield +shouldBufferLocked()go/vt/vtgate/tabletgateway.go— caller ofWaitForFailoverEnd