-
Notifications
You must be signed in to change notification settings - Fork 811
Silent wrong-service communication when IP_COOLDOWN_PERIOD < nf_conntrack_tcp_timeout_syn_sent causes stale conntrack DNAT to route SYN to recycled pod IP #3634
Description
What happened:
What happened:
A silent wrong-service communication occurs when all of the following align:
- Client pod initiates a TCP dial to a Service VIP
- Backend pod dies abruptly mid-handshake (OOMKill / crash / forced eviction)
- Client app times out (e.g. 5s dial timeout) and closes the socket
- OS reuses the same source port on retry (valid — SYN_SENT → CLOSED has no TIME_WAIT)
- Pod IP is recycled within
IP_COOLDOWN_PERIODwindow - New pod starts with the same recycled IP
- Stale conntrack DNAT routes new SYN to new pod → TCP handshake succeeds silently
- Client is now talking to the wrong service with no error, no RST, no log
Root cause — timer mismatch between two independent subsystems:
IP_COOLDOWN_PERIOD (VPC CNI) default: 30s
net.netfilter.nf_conntrack_tcp_timeout_syn_sent (kernel) default: 120sWhen a TCP handshake fails mid-way (SYN_SENT state), the Linux kernel retains the conntrack DNAT entry for nf_conntrack_tcp_timeout_syn_sent seconds (120s by default). This entry maps:
clientIP:srcPort → ServiceVIP:port → DNAT → deadPodIP:portCritically:
- The TCP socket is closed immediately after the app timeout — no TIME_WAIT for SYN_SENT state connections, port is free instantly
- The conntrack entry stays alive for 120 seconds independently
- The OS port allocator has no awareness of conntrack — it sees the port as free and can reuse it
If IP_COOLDOWN_PERIOD=30s allows the dead pod's IP to be recycled before the 120s conntrack entry expires, the stale DNAT silently routes the retried SYN to the new pod occupying the recycled IP.
The required invariant that is currently violated by defaults:
IP_COOLDOWN_PERIOD > nf_conntrack_tcp_timeout_syn_sent
30s > 120s ← VIOLATEDWhy preStop hooks do not prevent this:
preStop hooks only execute during graceful termination (SIGTERM). They are completely bypassed in:
- OOMKill — SIGKILL is sent directly, preStop never runs
- Node pressure eviction — kubelet may skip the graceful period entirely
- Pod crash — container exits before preStop can fire
In all these cases no FIN exchange occurs, so conntrack never naturally transitions out of SYN_SENT state — the entry simply ages out after 120 seconds.
Why this is silent and dangerous:
Normal failure (RST / timeout):
Client gets error → retry logic fires → recoverable ✅
This failure (stale conntrack + recycled IP):
TCP handshake succeeds to wrong pod
HTTP response returns 404 from wrong service
No error, no RST, no log entry anywhere
Client silently processes wrong data ❌This is not detectable at the TCP or HTTP layer without mTLS identity verification or explicit pod identity headers.
Conditions required for this bug to trigger:
All five conditions are default behavior in a standard EKS cluster. This is not a corner case.
Environment: EKS
- Kubernetes version (use
kubectl version): 1.33 - CNI Version v1.20.1-eksbuild.1
- OS (e.g:
cat /etc/os-release): Bottlerocket