-
Notifications
You must be signed in to change notification settings - Fork 236
kiam-agent race with dependent application pods on Node startup #395
Description
kiam version: v3.6-rc1 (primarily to support IMDS v2 from #381)
We frequently autoscale node groups in our cluster. These autoscaled nodes are dedicated to running applications which depend upon kiam for AWS credentials.
We're seeing a race between the kiam-agent pod and application pods when a new node is launched. It appears the application pod makes an AWS credentials request before the kiam-agent pod is fully ready, which causes the application to throw AWS NoCredentialsError errors, i.e.
botocore.exceptions.NoCredentialsError: Unable to locate credentials
We've reviewed the following similar issues, among others: #203, #358
We've made the following to mitigate issues on startup, but still havent fully solved the problem:
- Add
initContainerto thekiam-agentpods to check that dns can resolve thekiam-serverservice (per kiam-agent random errors on node startup #358) - Set
priorityClassName: system-cluster-criticalon thekiam-agentdaemonset template to prioritize scheduling thekiam-agent(per [Feature] Set priorityClassName on kiam-server and kiam-agent Pods #343) - Explicitly allocate
resourcesto thekiam-agent
It seems the additional guidance on how to mitigate issues on node startup includes:
- Make your application pods
sleepfor some amount of time on startup to allow thekiam-agentto become ready (this feels like a pretty disgusting hack of a workaround) (per kiam agent init time and pod waiting for ready state #203) - Somehow taint new nodes with something akin to
kiam-not-ready, and then somehow remove that taint whenkiam-agentis ready. This also feels fairly invasive because it would require amending other important daemonsets to tolerate this additional taint (per kiam agent init time and pod waiting for ready state #203).
edit: I see the uswitch team created https://github.com/uswitch/nidhogg to address this - this seems worth documenting here in the kiam project
edit: this issue of pod readiness priority on node startup is tracked in a KEP
Am I missing something here? Is there additional configuration/strategies to harden kiam-agent deployment such that it is quickly and reliably available to all workloads on newly created cluster nodes? Appreciate any and all advice/help!