Skip to content

Configserver rolling updates in Kubernetes statefulset #34250

@ttrumm

Description

@ttrumm

Problem Description

I'm running Vespa config servers (3 replicas) in Kubernetes using StatefulSets on AWS EKS.

Current Setup

  • Vespa config servers: 3 replicas in StatefulSet
  • Kubernetes: AWS EKS
  • Pod Disruption Budget: maxUnavailable: 1
  • No readiness probes (as documented - config servers won't report healthy until ZooKeeper quorum is reached)

The Issue

During StatefulSet rolling updates:

  1. Pod-2 gets restarted (container starts quickly)
  2. Kubernetes thinks pod-2 is "ready" since there's no readiness probe
  3. Kubernetes immediately proceeds to restart pod-1
  4. But pod-2 hasn't actually joined the ZooKeeper cluster yet (takes several minutes)
  5. Now only pod-0 is functional → quorum is lost → cluster becomes unavailable

Root Cause

The fundamental conflict:

  • Vespa: Config servers can't report ready until enough pods are running to establish ZooKeeper quorum (2 out of 3)
  • Kubernetes: StatefulSet rolling updates need readiness probes to know when to proceed to next pod
  • Result: Without readiness probes, Kubernetes proceeds too quickly and breaks quorum

The Service Routing Problem

With readiness probes, restarting config server pods get stuck in a deadlock:

  1. Pod restarts but hasn't rejoined ZooKeeper cluster yet
  2. Readiness probe fails → Service removes pod from endpoints
  3. Other config servers can't communicate with restarting pod (no Service routing)
  4. Pod can't rejoin ZooKeeper without inter-pod communication
  5. Pod never becomes ready → permanent deadlock

Questions

  1. Is there a recommended pattern for handling this in Kubernetes?
  2. Could we use a custom readiness probe that checks ZooKeeper cluster membership rather than /state/v1/health?
  3. Any plans to support a "ZooKeeper-aware" readiness endpoint that works during rolling updates?
  4. What's the official recommendation for production Kubernetes deployments with this constraint?

Environment

  • Vespa version: 8.517.15
  • Kubernetes: AWS EKS 1.32
  • Config server replicas: 3
  • StatefulSet configuration: Standard with PDB

This seems like a common operational challenge for anyone running Vespa config servers in Kubernetes. Any guidance would be greatly appreciated!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

No status

Relationships

None yet

Development

No branches or pull requests

Issue actions