Skip to content

[Question] Stale neighbor/ARP cache with wrong MAC for reused pod IP causing intermittent connection timeouts #3628

@slice-mohijeet

Description

@slice-mohijeet

Summary

We are investigating an intermittent pod-to-pod connectivity issue on EKS
with Amazon VPC CNI. After significant debugging, we have narrowed the
symptom to a stale neighbor (ARP) cache entry on the source node holding
a wrong MAC address for a destination pod IP. We believe this wrong MAC
may belong to a previously deleted node's ENI, and the entry remains in
REACHABLE state indefinitely due to Linux NUD behavior.

We are opening this as a question to understand:

  1. Whether VPC CNI is expected to handle this via gratuitous ARP (GARP)
  2. Whether this is a known gap or an already-tracked issue
  3. What the recommended mitigation is

Environment

  • Platform: AWS EKS
  • Node OS: Bottlerocket
  • CNI: Amazon VPC CNI (aws-node)
  • Istio: Used as ingress gateway only (no sidecars, no ztunnel on
    affected path)
  • Node type: Standard EC2 instances with multiple ENIs (VPC CNI
    secondary IP mode)

Observed Symptom

Intermittent connection timeouts from a specific source node/pod to a
specific destination pod IP. The same destination pod IP works fine from
other source nodes. Deleting the neighbor cache entry on the source node
with ip neigh del <dst-ip> dev eth0 fixes the issue immediately, and
after fresh ARP resolution a different (new) MAC is learned for the same IP.

Initially appeared as Istio ingress 503 UF / connection_timeout, but
direct curl from source node to destination pod IP:port reproduces the
failure, ruling out Istio configuration as the root cause.


Debugging Steps Performed

1. Direct pod IP reachability confirmed as the failure point

Direct curl from source node to destination pod IP:port timed out,
bypassing any VirtualService or Istio routing logic. Same curl from a
healthy source node to the same destination pod IP worked immediately.

2. Failure is source-node-specific and destination-IP-specific

  • Source node A → destination pod IP X: fails
  • Source node B → destination pod IP X: works
  • Source node A → replacement pod IP Y (same destination node): works

This ruled out the destination application, destination node health, and
generic cluster-wide issues.

3. tcpdump on source node showed only SYN retransmits

10.131.60.193 -> 10.131.55.225:8080 [S]
10.131.60.193 -> 10.131.55.225:8080 [S] (retransmit)
10.131.60.193 -> 10.131.55.225:8080 [S] (retransmit)
...

No SYN-ACK was ever seen. The TCP handshake never completed.

4. conntrack showed SYN_SENT [UNREPLIED]

src=10.131.60.193 dst=10.131.55.225 dport=8080 SYN_SENT [UNREPLIED]

SYN left the source but no reply was tracked by conntrack.

5. Simultaneous behavior: existing TCP flow works, new SYN fails

In one capture, a pre-existing established TCP connection between the
same source and destination IPs was still exchanging data while new SYN
packets to the same destination IP kept retransmitting and never
completed. This suggested the issue affects new connection establishment
and path resolution, not total connectivity.

6. ip neigh showed wrong MAC as REACHABLE

$ ip -s neigh show 10.131.49.193
10.131.49.193 dev eth0 lladdr 0a:47:d5:85:ae:29 ref 1 used ... REACHABLE

After ip neigh del 10.131.49.193 dev eth0 and arping:

$ arping -c 5 -I eth0 10.131.49.193
Unicast reply from 10.131.49.193 [0a:04:bb:8d:7b:61]

A completely different MAC was resolved, and connectivity was immediately
restored. This pattern repeated across multiple destination pod IPs and
multiple incidents.

7. Asymmetric reachability observed

  • Source node → destination pod IP: fails
  • Destination pod → source node: works

This strongly suggests the issue is on the outbound path from the source
node, not a bidirectional break.

8. conntrack table exhaustion ruled out

nf_conntrack_count = 1343
nf_conntrack_max = 131072

9. iptables ruled out

Compared iptables rules on faulty vs healthy source nodes. No DROP/REJECT
rule specific to affected pod IPs or ports found.


Current Hypothesis (Not Confirmed)

We suspect the following sequence may be occurring:

  1. A destination pod IP (secondary ENI IP) was previously assigned to an
    ENI on a now-deleted node. The source node learned the correct MAC for
    that IP at the time.

  2. The node was deleted. The same pod IP was subsequently reassigned —
    either to a new pod on the same destination node via a different ENI,
    or to a pod on a different node entirely.

  3. VPC CNI (or AWS fabric) updates L3 routing for the IP correctly, but
    the source node's neighbor cache is not invalidated. The old MAC —
    now belonging to a deleted node's ENI — remains cached.

  4. A separate TCP flow exists where the destination pod has a connection
    to the source node (for an unrelated purpose). The destination pod
    sends packets to the source node via correct VPC L3 routing. The source
    node receives these packets.

  5. Linux NUD treats these inbound packets as upper-layer confirmation
    (neigh_confirm()) for the destination IP's neighbor entry, resetting
    the REACHABLE timer — even though the outbound path uses a dead MAC.

  6. The neighbor entry never goes STALE → never goes to PROBE → dead MAC
    is never re-ARPed → stale entry persists indefinitely.

  7. New TCP connections from source to destination use the dead MAC, are
    silently dropped by the AWS fabric (no ICMP, no RST), and the SYN
    retransmits forever.

The core question is: In VPC CNI's IP reuse lifecycle (pod deletion →
IP released → IP reassigned to new pod/node), is a gratuitous ARP (GARP)
expected to be sent by VPC CNI or the AWS fabric to invalidate stale
neighbor entries on remote nodes?


Questions

  1. Does VPC CNI send a gratuitous ARP when a secondary ENI IP is
    assigned to a new pod?
    If yes, is there a scenario where this GARP
    could be missed or not reach a particular source node?

  2. Is net.ipv4.conf.eth0.arp_notify=1 expected to be set on EKS nodes
    with VPC CNI?
    This sysctl causes the kernel to send a GARP when an
    IP address is added to an interface, which would naturally invalidate
    stale entries on peer nodes.

  3. Is there a known race condition or gap between IP reuse in VPC CNI
    and neighbor cache invalidation on remote nodes, particularly when the
    reused IP belongs to a completely different node than before?

  4. Is the Linux NUD upper-layer confirmation behavior (where inbound
    TCP packets from a destination keep its neighbor entry REACHABLE on the
    source even when the outbound MAC is wrong) a known interaction with
    VPC CNI's IP reuse model? Is this tracked anywhere?

  5. Are there recommended sysctl settings (base_reachable_time_ms,
    gc_stale_time, arp_notify) for EKS/VPC CNI nodes to reduce the
    window in which stale neighbor entries can cause this class of failure?


Workaround

Manually running the following on the source node restores connectivity
immediately:

ip neigh del <destination-pod-ip> dev eth0

After deletion, fresh ARP resolution retrieves the correct MAC and new
TCP connections succeed.


References / Related

  • Linux NUD state machine and upper-layer confirmation:
    neigh_confirm() in net/core/neighbour.c
  • net.ipv4.conf.all.arp_notify kernel documentation
  • Other CNIs (Antrea, Calico, Flannel) explicitly send GARP on pod
    startup to handle this class of stale ARP problem
  • Amazon VPC CNI repo: https://github.com/aws/amazon-vpc-cni-k8s

Additional Context

Happy to provide additional node diagnostics, kernel versions, VPC CNI
version, or packet captures if helpful for investigation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions