[Question] Stale neighbor/ARP cache with wrong MAC for reused pod IP causing intermittent connection timeouts

## Summary

We are investigating an intermittent pod-to-pod connectivity issue on EKS 
with Amazon VPC CNI. After significant debugging, we have narrowed the 
symptom to a stale neighbor (ARP) cache entry on the source node holding 
a wrong MAC address for a destination pod IP. We believe this wrong MAC 
may belong to a previously deleted node's ENI, and the entry remains in 
REACHABLE state indefinitely due to Linux NUD behavior.

We are opening this as a question to understand:
1. Whether VPC CNI is expected to handle this via gratuitous ARP (GARP)
2. Whether this is a known gap or an already-tracked issue
3. What the recommended mitigation is

---

## Environment

- **Platform**: AWS EKS
- **Node OS**: Bottlerocket
- **CNI**: Amazon VPC CNI (aws-node)
- **Istio**: Used as ingress gateway only (no sidecars, no ztunnel on 
  affected path)
- **Node type**: Standard EC2 instances with multiple ENIs (VPC CNI 
  secondary IP mode)

---

## Observed Symptom

Intermittent connection timeouts from a specific source node/pod to a 
specific destination pod IP. The same destination pod IP works fine from 
other source nodes. Deleting the neighbor cache entry on the source node 
with `ip neigh del <dst-ip> dev eth0` fixes the issue immediately, and 
after fresh ARP resolution a different (new) MAC is learned for the same IP.

Initially appeared as Istio ingress 503 UF / `connection_timeout`, but 
direct curl from source node to destination pod IP:port reproduces the 
failure, ruling out Istio configuration as the root cause.

---

## Debugging Steps Performed

### 1. Direct pod IP reachability confirmed as the failure point

Direct curl from source node to destination pod IP:port timed out, 
bypassing any VirtualService or Istio routing logic. Same curl from a 
healthy source node to the same destination pod IP worked immediately.

### 2. Failure is source-node-specific and destination-IP-specific

- Source node A → destination pod IP X: **fails**
- Source node B → destination pod IP X: **works**
- Source node A → replacement pod IP Y (same destination node): **works**

This ruled out the destination application, destination node health, and 
generic cluster-wide issues.

### 3. tcpdump on source node showed only SYN retransmits
```
10.131.60.193 -> 10.131.55.225:8080 [S]
10.131.60.193 -> 10.131.55.225:8080 [S] (retransmit)
10.131.60.193 -> 10.131.55.225:8080 [S] (retransmit)
...
```

No SYN-ACK was ever seen. The TCP handshake never completed.

### 4. conntrack showed SYN_SENT [UNREPLIED]
```
src=10.131.60.193 dst=10.131.55.225 dport=8080 SYN_SENT [UNREPLIED]
```

SYN left the source but no reply was tracked by conntrack.

### 5. Simultaneous behavior: existing TCP flow works, new SYN fails

In one capture, a pre-existing established TCP connection between the 
same source and destination IPs was still exchanging data while new SYN 
packets to the same destination IP kept retransmitting and never 
completed. This suggested the issue affects new connection establishment 
and path resolution, not total connectivity.

### 6. ip neigh showed wrong MAC as REACHABLE
```bash
$ ip -s neigh show 10.131.49.193
10.131.49.193 dev eth0 lladdr 0a:47:d5:85:ae:29 ref 1 used ... REACHABLE
```

After `ip neigh del 10.131.49.193 dev eth0` and arping:
```bash
$ arping -c 5 -I eth0 10.131.49.193
Unicast reply from 10.131.49.193 [0a:04:bb:8d:7b:61]
```

A completely different MAC was resolved, and connectivity was immediately 
restored. This pattern repeated across multiple destination pod IPs and 
multiple incidents.

### 7. Asymmetric reachability observed

- Source node → destination pod IP: **fails**
- Destination pod → source node: **works**

This strongly suggests the issue is on the outbound path from the source 
node, not a bidirectional break.

### 8. conntrack table exhaustion ruled out
```
nf_conntrack_count = 1343
nf_conntrack_max = 131072
```

### 9. iptables ruled out

Compared iptables rules on faulty vs healthy source nodes. No DROP/REJECT 
rule specific to affected pod IPs or ports found.

---

## Current Hypothesis (Not Confirmed)

We suspect the following sequence may be occurring:

1. A destination pod IP (secondary ENI IP) was previously assigned to an 
   ENI on a now-deleted node. The source node learned the correct MAC for 
   that IP at the time.

2. The node was deleted. The same pod IP was subsequently reassigned — 
   either to a new pod on the same destination node via a different ENI, 
   or to a pod on a different node entirely.

3. VPC CNI (or AWS fabric) updates L3 routing for the IP correctly, but 
   **the source node's neighbor cache is not invalidated**. The old MAC — 
   now belonging to a deleted node's ENI — remains cached.

4. A separate TCP flow exists where the destination pod has a connection 
   to the source node (for an unrelated purpose). The destination pod 
   sends packets to the source node via correct VPC L3 routing. The source 
   node receives these packets.

5. Linux NUD treats these inbound packets as upper-layer confirmation 
   (`neigh_confirm()`) for the destination IP's neighbor entry, resetting 
   the REACHABLE timer — **even though the outbound path uses a dead MAC**.

6. The neighbor entry never goes STALE → never goes to PROBE → dead MAC 
   is never re-ARPed → stale entry persists indefinitely.

7. New TCP connections from source to destination use the dead MAC, are 
   silently dropped by the AWS fabric (no ICMP, no RST), and the SYN 
   retransmits forever.

**The core question is:** In VPC CNI's IP reuse lifecycle (pod deletion → 
IP released → IP reassigned to new pod/node), is a gratuitous ARP (GARP) 
expected to be sent by VPC CNI or the AWS fabric to invalidate stale 
neighbor entries on remote nodes?

---

## Questions

1. **Does VPC CNI send a gratuitous ARP when a secondary ENI IP is 
   assigned to a new pod?** If yes, is there a scenario where this GARP 
   could be missed or not reach a particular source node?

2. **Is `net.ipv4.conf.eth0.arp_notify=1` expected to be set on EKS nodes 
   with VPC CNI?** This sysctl causes the kernel to send a GARP when an 
   IP address is added to an interface, which would naturally invalidate 
   stale entries on peer nodes.

3. **Is there a known race condition or gap** between IP reuse in VPC CNI 
   and neighbor cache invalidation on remote nodes, particularly when the 
   reused IP belongs to a completely different node than before?

4. **Is the Linux NUD upper-layer confirmation behavior** (where inbound 
   TCP packets from a destination keep its neighbor entry REACHABLE on the 
   source even when the outbound MAC is wrong) a known interaction with 
   VPC CNI's IP reuse model? Is this tracked anywhere?

5. **Are there recommended sysctl settings** (`base_reachable_time_ms`, 
   `gc_stale_time`, `arp_notify`) for EKS/VPC CNI nodes to reduce the 
   window in which stale neighbor entries can cause this class of failure?

---

## Workaround

Manually running the following on the source node restores connectivity 
immediately:
```bash
ip neigh del <destination-pod-ip> dev eth0
```

After deletion, fresh ARP resolution retrieves the correct MAC and new 
TCP connections succeed.

---

## References / Related

- Linux NUD state machine and upper-layer confirmation: 
  `neigh_confirm()` in `net/core/neighbour.c`
- `net.ipv4.conf.all.arp_notify` kernel documentation
- Other CNIs (Antrea, Calico, Flannel) explicitly send GARP on pod 
  startup to handle this class of stale ARP problem
- Amazon VPC CNI repo: https://github.com/aws/amazon-vpc-cni-k8s

---

## Additional Context

Happy to provide additional node diagnostics, kernel versions, VPC CNI 
version, or packet captures if helpful for investigation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Stale neighbor/ARP cache with wrong MAC for reused pod IP causing intermittent connection timeouts #3628

Summary

Environment

Observed Symptom

Debugging Steps Performed

1. Direct pod IP reachability confirmed as the failure point

2. Failure is source-node-specific and destination-IP-specific

3. tcpdump on source node showed only SYN retransmits

4. conntrack showed SYN_SENT [UNREPLIED]

5. Simultaneous behavior: existing TCP flow works, new SYN fails

6. ip neigh showed wrong MAC as REACHABLE

7. Asymmetric reachability observed

8. conntrack table exhaustion ruled out

9. iptables ruled out

Current Hypothesis (Not Confirmed)

Questions

Workaround

References / Related

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Stale neighbor/ARP cache with wrong MAC for reused pod IP causing intermittent connection timeouts #3628

Description

Summary

Environment

Observed Symptom

Debugging Steps Performed

1. Direct pod IP reachability confirmed as the failure point

2. Failure is source-node-specific and destination-IP-specific

3. tcpdump on source node showed only SYN retransmits

4. conntrack showed SYN_SENT [UNREPLIED]

5. Simultaneous behavior: existing TCP flow works, new SYN fails

6. ip neigh showed wrong MAC as REACHABLE

7. Asymmetric reachability observed

8. conntrack table exhaustion ruled out

9. iptables ruled out

Current Hypothesis (Not Confirmed)

Questions

Workaround

References / Related

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions