-
Notifications
You must be signed in to change notification settings - Fork 811
[Question] Stale neighbor/ARP cache with wrong MAC for reused pod IP causing intermittent connection timeouts #3628
Description
Summary
We are investigating an intermittent pod-to-pod connectivity issue on EKS
with Amazon VPC CNI. After significant debugging, we have narrowed the
symptom to a stale neighbor (ARP) cache entry on the source node holding
a wrong MAC address for a destination pod IP. We believe this wrong MAC
may belong to a previously deleted node's ENI, and the entry remains in
REACHABLE state indefinitely due to Linux NUD behavior.
We are opening this as a question to understand:
- Whether VPC CNI is expected to handle this via gratuitous ARP (GARP)
- Whether this is a known gap or an already-tracked issue
- What the recommended mitigation is
Environment
- Platform: AWS EKS
- Node OS: Bottlerocket
- CNI: Amazon VPC CNI (aws-node)
- Istio: Used as ingress gateway only (no sidecars, no ztunnel on
affected path) - Node type: Standard EC2 instances with multiple ENIs (VPC CNI
secondary IP mode)
Observed Symptom
Intermittent connection timeouts from a specific source node/pod to a
specific destination pod IP. The same destination pod IP works fine from
other source nodes. Deleting the neighbor cache entry on the source node
with ip neigh del <dst-ip> dev eth0 fixes the issue immediately, and
after fresh ARP resolution a different (new) MAC is learned for the same IP.
Initially appeared as Istio ingress 503 UF / connection_timeout, but
direct curl from source node to destination pod IP:port reproduces the
failure, ruling out Istio configuration as the root cause.
Debugging Steps Performed
1. Direct pod IP reachability confirmed as the failure point
Direct curl from source node to destination pod IP:port timed out,
bypassing any VirtualService or Istio routing logic. Same curl from a
healthy source node to the same destination pod IP worked immediately.
2. Failure is source-node-specific and destination-IP-specific
- Source node A → destination pod IP X: fails
- Source node B → destination pod IP X: works
- Source node A → replacement pod IP Y (same destination node): works
This ruled out the destination application, destination node health, and
generic cluster-wide issues.
3. tcpdump on source node showed only SYN retransmits
10.131.60.193 -> 10.131.55.225:8080 [S]
10.131.60.193 -> 10.131.55.225:8080 [S] (retransmit)
10.131.60.193 -> 10.131.55.225:8080 [S] (retransmit)
...
No SYN-ACK was ever seen. The TCP handshake never completed.
4. conntrack showed SYN_SENT [UNREPLIED]
src=10.131.60.193 dst=10.131.55.225 dport=8080 SYN_SENT [UNREPLIED]
SYN left the source but no reply was tracked by conntrack.
5. Simultaneous behavior: existing TCP flow works, new SYN fails
In one capture, a pre-existing established TCP connection between the
same source and destination IPs was still exchanging data while new SYN
packets to the same destination IP kept retransmitting and never
completed. This suggested the issue affects new connection establishment
and path resolution, not total connectivity.
6. ip neigh showed wrong MAC as REACHABLE
$ ip -s neigh show 10.131.49.193
10.131.49.193 dev eth0 lladdr 0a:47:d5:85:ae:29 ref 1 used ... REACHABLEAfter ip neigh del 10.131.49.193 dev eth0 and arping:
$ arping -c 5 -I eth0 10.131.49.193
Unicast reply from 10.131.49.193 [0a:04:bb:8d:7b:61]A completely different MAC was resolved, and connectivity was immediately
restored. This pattern repeated across multiple destination pod IPs and
multiple incidents.
7. Asymmetric reachability observed
- Source node → destination pod IP: fails
- Destination pod → source node: works
This strongly suggests the issue is on the outbound path from the source
node, not a bidirectional break.
8. conntrack table exhaustion ruled out
nf_conntrack_count = 1343
nf_conntrack_max = 131072
9. iptables ruled out
Compared iptables rules on faulty vs healthy source nodes. No DROP/REJECT
rule specific to affected pod IPs or ports found.
Current Hypothesis (Not Confirmed)
We suspect the following sequence may be occurring:
-
A destination pod IP (secondary ENI IP) was previously assigned to an
ENI on a now-deleted node. The source node learned the correct MAC for
that IP at the time. -
The node was deleted. The same pod IP was subsequently reassigned —
either to a new pod on the same destination node via a different ENI,
or to a pod on a different node entirely. -
VPC CNI (or AWS fabric) updates L3 routing for the IP correctly, but
the source node's neighbor cache is not invalidated. The old MAC —
now belonging to a deleted node's ENI — remains cached. -
A separate TCP flow exists where the destination pod has a connection
to the source node (for an unrelated purpose). The destination pod
sends packets to the source node via correct VPC L3 routing. The source
node receives these packets. -
Linux NUD treats these inbound packets as upper-layer confirmation
(neigh_confirm()) for the destination IP's neighbor entry, resetting
the REACHABLE timer — even though the outbound path uses a dead MAC. -
The neighbor entry never goes STALE → never goes to PROBE → dead MAC
is never re-ARPed → stale entry persists indefinitely. -
New TCP connections from source to destination use the dead MAC, are
silently dropped by the AWS fabric (no ICMP, no RST), and the SYN
retransmits forever.
The core question is: In VPC CNI's IP reuse lifecycle (pod deletion →
IP released → IP reassigned to new pod/node), is a gratuitous ARP (GARP)
expected to be sent by VPC CNI or the AWS fabric to invalidate stale
neighbor entries on remote nodes?
Questions
-
Does VPC CNI send a gratuitous ARP when a secondary ENI IP is
assigned to a new pod? If yes, is there a scenario where this GARP
could be missed or not reach a particular source node? -
Is
net.ipv4.conf.eth0.arp_notify=1expected to be set on EKS nodes
with VPC CNI? This sysctl causes the kernel to send a GARP when an
IP address is added to an interface, which would naturally invalidate
stale entries on peer nodes. -
Is there a known race condition or gap between IP reuse in VPC CNI
and neighbor cache invalidation on remote nodes, particularly when the
reused IP belongs to a completely different node than before? -
Is the Linux NUD upper-layer confirmation behavior (where inbound
TCP packets from a destination keep its neighbor entry REACHABLE on the
source even when the outbound MAC is wrong) a known interaction with
VPC CNI's IP reuse model? Is this tracked anywhere? -
Are there recommended sysctl settings (
base_reachable_time_ms,
gc_stale_time,arp_notify) for EKS/VPC CNI nodes to reduce the
window in which stale neighbor entries can cause this class of failure?
Workaround
Manually running the following on the source node restores connectivity
immediately:
ip neigh del <destination-pod-ip> dev eth0After deletion, fresh ARP resolution retrieves the correct MAC and new
TCP connections succeed.
References / Related
- Linux NUD state machine and upper-layer confirmation:
neigh_confirm()innet/core/neighbour.c net.ipv4.conf.all.arp_notifykernel documentation- Other CNIs (Antrea, Calico, Flannel) explicitly send GARP on pod
startup to handle this class of stale ARP problem - Amazon VPC CNI repo: https://github.com/aws/amazon-vpc-cni-k8s
Additional Context
Happy to provide additional node diagnostics, kernel versions, VPC CNI
version, or packet captures if helpful for investigation.