vtgate/buffer: reduce hot-path latency with lock-free shard lookup and atomic state

## Summary

The VTGate query buffer is checked on every PRIMARY query. In the normal case (no failover in progress), the check should be near-zero cost. Currently the hot path takes two mutex acquisitions per query — one on the global buffer map and one on the per-shard buffer. Under parallel load this creates measurable contention.

## Current Hot Path (no buffering active)

```
WaitForFailoverEnd()
  → buf.getOrCreateBuffer(keyspace, shard)     // buf.mu.Lock() — global write lock
  → sb.mu.RLock()                               // per-shard read lock
  → shouldBufferLocked() returns false
  → sb.mu.RUnlock()
```

Every query across all shards contends on `buf.mu` even though the shard buffer already exists and the operation is read-only.

## Proposed Changes

### 1. Atomic state for fast-path idle check

Replace the `sb.mu.RLock()` + `shouldBufferLocked()` + `sb.mu.RUnlock()` with an atomic state field. The idle check becomes a single `atomic.LoadInt32()` with no lock.

```go
type shardBuffer struct {
    state atomic.Int32  // read atomically on hot path; written under mu
    // mu still protects queue, timers, etc.
}

func (sb *shardBuffer) waitForFailoverEnd(...) {
    if state(sb.state.Load()) == stateIdle {
        return nil, nil  // no lock needed
    }
    // slow path: take lock
}
```

State transitions (start buffering, drain) write the atomic under `mu` — they're already on the slow path.

### 2. Replace global mutex with sync.Map for shard lookup

The shard map is read on every query (hot) and written only when a shard is first seen (cold — happens once per shard at startup). This is the exact access pattern `sync.Map` is optimized for.

```go
type Buffer struct {
    buffers sync.Map  // string → *shardBuffer, lock-free reads
}
```

`WaitForFailoverEnd` and `HandleKeyspaceEvent` use `Load()` — no lock. `getOrCreateBuffer` uses `LoadOrStore` — lock-free when the shard already exists.

### 3. String state → integer state

Current `bufferState` is a `string` type. Integer comparison is faster than string comparison on the hot path.

## Expected Impact

We implemented these optimizations in a similar buffer system and benchmarked:

| Benchmark | Before | After | Change |
|-----------|--------|-------|--------|
| Idle shard check (serial) | ~22 ns | ~14 ns | -36% |
| Idle shard check (parallel, 10 goroutines) | ~200 ns | ~1.7 ns | -99% |
| Multi-shard parallel (8 shards) | ~150 ns | ~2.3 ns | -98% |

The parallel improvements are dramatic because the lock contention is completely eliminated — every goroutine reads independently.

## Affected Code

- `go/vt/vtgate/buffer/buffer.go` — `Buffer.buffers` map + `mu` mutex
- `go/vt/vtgate/buffer/shard_buffer.go` — `shardBuffer.state` field + `shouldBufferLocked()`
- `go/vt/vtgate/tabletgateway.go` — caller of `WaitForFailoverEnd`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vtgate/buffer: reduce hot-path latency with lock-free shard lookup and atomic state #19801

Summary

Current Hot Path (no buffering active)

Proposed Changes

1. Atomic state for fast-path idle check

2. Replace global mutex with sync.Map for shard lookup

3. String state → integer state

Expected Impact

Affected Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark	Before	After	Change
Idle shard check (serial)	~22 ns	~14 ns	-36%
Idle shard check (parallel, 10 goroutines)	~200 ns	~1.7 ns	-99%
Multi-shard parallel (8 shards)	~150 ns	~2.3 ns	-98%

vtgate/buffer: reduce hot-path latency with lock-free shard lookup and atomic state #19801

Description

Summary

Current Hot Path (no buffering active)

Proposed Changes

1. Atomic state for fast-path idle check

2. Replace global mutex with sync.Map for shard lookup

3. String state → integer state

Expected Impact

Affected Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions