Configserver rolling updates in Kubernetes statefulset

## Problem Description

I'm running Vespa config servers (3 replicas) in Kubernetes using StatefulSets on AWS EKS.

## Current Setup
- Vespa config servers: 3 replicas in StatefulSet
- Kubernetes: AWS EKS
- Pod Disruption Budget: `maxUnavailable: 1`
- **No readiness probes** (as documented - config servers won't report healthy until ZooKeeper quorum is reached)

## The Issue

During StatefulSet rolling updates:

1. Pod-2 gets restarted (container starts quickly)
2. **Kubernetes thinks pod-2 is "ready" since there's no readiness probe**
3. Kubernetes immediately proceeds to restart pod-1
4. **But pod-2 hasn't actually joined the ZooKeeper cluster yet** (takes several minutes)
5. Now only pod-0 is functional → **quorum is lost** → cluster becomes unavailable

## Root Cause

The fundamental conflict:
- **Vespa**: Config servers can't report ready until enough pods are running to establish ZooKeeper quorum (2 out of 3)
- **Kubernetes**: StatefulSet rolling updates need readiness probes to know when to proceed to next pod
- **Result**: Without readiness probes, Kubernetes proceeds too quickly and breaks quorum

## The Service Routing Problem

With readiness probes, restarting config server pods get stuck in a deadlock:

1. Pod restarts but hasn't rejoined ZooKeeper cluster yet
2. Readiness probe fails → Service removes pod from endpoints  
3. Other config servers can't communicate with restarting pod (no Service routing)
4. Pod can't rejoin ZooKeeper without inter-pod communication
5. Pod never becomes ready → permanent deadlock

## Questions

1. **Is there a recommended pattern for handling this in Kubernetes?**
2. **Could we use a custom readiness probe** that checks ZooKeeper cluster membership rather than `/state/v1/health`?
3. **Any plans to support a "ZooKeeper-aware" readiness endpoint** that works during rolling updates?
4. **What's the official recommendation for production Kubernetes deployments** with this constraint?

## Environment
- Vespa version: 8.517.15
- Kubernetes: AWS EKS 1.32
- Config server replicas: 3
- StatefulSet configuration: Standard with PDB

This seems like a common operational challenge for anyone running Vespa config servers in Kubernetes. Any guidance would be greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configserver rolling updates in Kubernetes statefulset #34250

Problem Description

Current Setup

The Issue

Root Cause

The Service Routing Problem

Questions

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Configserver rolling updates in Kubernetes statefulset #34250

Description

Problem Description

Current Setup

The Issue

Root Cause

The Service Routing Problem

Questions

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions