Skip to content

Silent wrong-service communication when IP_COOLDOWN_PERIOD < nf_conntrack_tcp_timeout_syn_sent causes stale conntrack DNAT to route SYN to recycled pod IP #3634

@Mohijeet

Description

@Mohijeet

What happened:

What happened:

A silent wrong-service communication occurs when all of the following align:

  1. Client pod initiates a TCP dial to a Service VIP
  2. Backend pod dies abruptly mid-handshake (OOMKill / crash / forced eviction)
  3. Client app times out (e.g. 5s dial timeout) and closes the socket
  4. OS reuses the same source port on retry (valid — SYN_SENT → CLOSED has no TIME_WAIT)
  5. Pod IP is recycled within IP_COOLDOWN_PERIOD window
  6. New pod starts with the same recycled IP
  7. Stale conntrack DNAT routes new SYN to new pod → TCP handshake succeeds silently
  8. Client is now talking to the wrong service with no error, no RST, no log

Root cause — timer mismatch between two independent subsystems:

IP_COOLDOWN_PERIOD                          (VPC CNI)  default: 30s
net.netfilter.nf_conntrack_tcp_timeout_syn_sent  (kernel)   default: 120s

When a TCP handshake fails mid-way (SYN_SENT state), the Linux kernel retains the conntrack DNAT entry for nf_conntrack_tcp_timeout_syn_sent seconds (120s by default). This entry maps:

clientIP:srcPort → ServiceVIP:port → DNAT → deadPodIP:port

Critically:

  • The TCP socket is closed immediately after the app timeout — no TIME_WAIT for SYN_SENT state connections, port is free instantly
  • The conntrack entry stays alive for 120 seconds independently
  • The OS port allocator has no awareness of conntrack — it sees the port as free and can reuse it

If IP_COOLDOWN_PERIOD=30s allows the dead pod's IP to be recycled before the 120s conntrack entry expires, the stale DNAT silently routes the retried SYN to the new pod occupying the recycled IP.

The required invariant that is currently violated by defaults:

IP_COOLDOWN_PERIOD > nf_conntrack_tcp_timeout_syn_sent
     30s           >           120s                     ← VIOLATED

Why preStop hooks do not prevent this:

preStop hooks only execute during graceful termination (SIGTERM). They are completely bypassed in:

  • OOMKill — SIGKILL is sent directly, preStop never runs
  • Node pressure eviction — kubelet may skip the graceful period entirely
  • Pod crash — container exits before preStop can fire

In all these cases no FIN exchange occurs, so conntrack never naturally transitions out of SYN_SENT state — the entry simply ages out after 120 seconds.


Why this is silent and dangerous:

Normal failure (RST / timeout):
  Client gets error → retry logic fires → recoverable ✅

This failure (stale conntrack + recycled IP):
  TCP handshake succeeds to wrong pod
  HTTP response returns 404 from wrong service
  No error, no RST, no log entry anywhere
  Client silently processes wrong data ❌

This is not detectable at the TCP or HTTP layer without mTLS identity verification or explicit pod identity headers.


Conditions required for this bug to trigger:

Condition | Detail -- | -- Abrupt pod termination | OOMKill, crash, forced eviction — no FIN exchange Client dial timeout < 120s | App closes socket before conntrack entry expires OS source port reuse | Ephemeral pool reuses same port (especially under tcp_tw_reuse=1) IP recycled within 120s | IP_COOLDOWN_PERIOD=30s default allows this New pod gets same IP | K8s IP pool recycles IPs — common on busy nodes

All five conditions are default behavior in a standard EKS cluster. This is not a corner case.

Environment: EKS

  • Kubernetes version (use kubectl version): 1.33
  • CNI Version v1.20.1-eksbuild.1
  • OS (e.g: cat /etc/os-release): Bottlerocket

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions