kiam-agent race with dependent application pods on Node startup

kiam version: `v3.6-rc1` (primarily to support IMDS v2 from https://github.com/uswitch/kiam/pull/381)

We frequently autoscale node groups in our cluster. These autoscaled nodes are dedicated to running applications which depend upon kiam for AWS credentials.

We're seeing a race between the `kiam-agent` pod and application pods when a new node is launched. It appears the application pod makes an AWS credentials request before the `kiam-agent` pod is fully ready, which causes the application to throw AWS `NoCredentialsError` errors, i.e.
```
botocore.exceptions.NoCredentialsError: Unable to locate credentials
```

We've reviewed the following similar issues, among others: #203, #358 

We've made the following to mitigate issues on startup, but still havent fully solved the problem:
1. Add `initContainer` to the `kiam-agent` pods to check that dns can resolve the `kiam-server` service (per #358)
2. Set `priorityClassName: system-cluster-critical` on the `kiam-agent` daemonset template to prioritize scheduling the `kiam-agent` (per #343)
3. Explicitly allocate `resources` to the `kiam-agent`

It seems the additional guidance on how to mitigate issues on node startup includes:
1. Make your application pods `sleep` for some amount of time on startup to allow the `kiam-agent` to become ready (this feels like a pretty disgusting hack of a workaround)  (per #203)
2. Somehow taint new nodes with something akin to `kiam-not-ready`, and then somehow remove that taint when `kiam-agent` is ready. This also feels fairly invasive because it would require amending other important daemonsets to tolerate this additional taint (per #203).
*edit:* I see the uswitch team created https://github.com/uswitch/nidhogg to address this - this seems worth documenting here in the kiam project

[*edit:* this issue of pod readiness priority on node startup is tracked in a KEP](https://github.com/kubernetes/enhancements/pull/1003#issuecomment-567474295)

Am I missing something here? Is there additional configuration/strategies to harden `kiam-agent` deployment such that it is quickly and reliably available to all workloads on newly created cluster nodes? Appreciate any and all advice/help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kiam-agent race with dependent application pods on Node startup #395

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kiam-agent race with dependent application pods on Node startup #395

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions