Problem Description
I'm running Vespa config servers (3 replicas) in Kubernetes using StatefulSets on AWS EKS.
Current Setup
- Vespa config servers: 3 replicas in StatefulSet
- Kubernetes: AWS EKS
- Pod Disruption Budget:
maxUnavailable: 1
- No readiness probes (as documented - config servers won't report healthy until ZooKeeper quorum is reached)
The Issue
During StatefulSet rolling updates:
- Pod-2 gets restarted (container starts quickly)
- Kubernetes thinks pod-2 is "ready" since there's no readiness probe
- Kubernetes immediately proceeds to restart pod-1
- But pod-2 hasn't actually joined the ZooKeeper cluster yet (takes several minutes)
- Now only pod-0 is functional → quorum is lost → cluster becomes unavailable
Root Cause
The fundamental conflict:
- Vespa: Config servers can't report ready until enough pods are running to establish ZooKeeper quorum (2 out of 3)
- Kubernetes: StatefulSet rolling updates need readiness probes to know when to proceed to next pod
- Result: Without readiness probes, Kubernetes proceeds too quickly and breaks quorum
The Service Routing Problem
With readiness probes, restarting config server pods get stuck in a deadlock:
- Pod restarts but hasn't rejoined ZooKeeper cluster yet
- Readiness probe fails → Service removes pod from endpoints
- Other config servers can't communicate with restarting pod (no Service routing)
- Pod can't rejoin ZooKeeper without inter-pod communication
- Pod never becomes ready → permanent deadlock
Questions
- Is there a recommended pattern for handling this in Kubernetes?
- Could we use a custom readiness probe that checks ZooKeeper cluster membership rather than
/state/v1/health?
- Any plans to support a "ZooKeeper-aware" readiness endpoint that works during rolling updates?
- What's the official recommendation for production Kubernetes deployments with this constraint?
Environment
- Vespa version: 8.517.15
- Kubernetes: AWS EKS 1.32
- Config server replicas: 3
- StatefulSet configuration: Standard with PDB
This seems like a common operational challenge for anyone running Vespa config servers in Kubernetes. Any guidance would be greatly appreciated!
Problem Description
I'm running Vespa config servers (3 replicas) in Kubernetes using StatefulSets on AWS EKS.
Current Setup
maxUnavailable: 1The Issue
During StatefulSet rolling updates:
Root Cause
The fundamental conflict:
The Service Routing Problem
With readiness probes, restarting config server pods get stuck in a deadlock:
Questions
/state/v1/health?Environment
This seems like a common operational challenge for anyone running Vespa config servers in Kubernetes. Any guidance would be greatly appreciated!