Skip to content
This repository was archived by the owner on Mar 5, 2024. It is now read-only.
This repository was archived by the owner on Mar 5, 2024. It is now read-only.

kiam-agent race with dependent application pods on Node startup #395

@jpugliesi

Description

@jpugliesi

kiam version: v3.6-rc1 (primarily to support IMDS v2 from #381)

We frequently autoscale node groups in our cluster. These autoscaled nodes are dedicated to running applications which depend upon kiam for AWS credentials.

We're seeing a race between the kiam-agent pod and application pods when a new node is launched. It appears the application pod makes an AWS credentials request before the kiam-agent pod is fully ready, which causes the application to throw AWS NoCredentialsError errors, i.e.

botocore.exceptions.NoCredentialsError: Unable to locate credentials

We've reviewed the following similar issues, among others: #203, #358

We've made the following to mitigate issues on startup, but still havent fully solved the problem:

  1. Add initContainer to the kiam-agent pods to check that dns can resolve the kiam-server service (per kiam-agent random errors on node startup #358)
  2. Set priorityClassName: system-cluster-critical on the kiam-agent daemonset template to prioritize scheduling the kiam-agent (per [Feature] Set priorityClassName on kiam-server and kiam-agent Pods #343)
  3. Explicitly allocate resources to the kiam-agent

It seems the additional guidance on how to mitigate issues on node startup includes:

  1. Make your application pods sleep for some amount of time on startup to allow the kiam-agent to become ready (this feels like a pretty disgusting hack of a workaround) (per kiam agent init time and pod waiting for ready state #203)
  2. Somehow taint new nodes with something akin to kiam-not-ready, and then somehow remove that taint when kiam-agent is ready. This also feels fairly invasive because it would require amending other important daemonsets to tolerate this additional taint (per kiam agent init time and pod waiting for ready state #203).
    edit: I see the uswitch team created https://github.com/uswitch/nidhogg to address this - this seems worth documenting here in the kiam project

edit: this issue of pod readiness priority on node startup is tracked in a KEP

Am I missing something here? Is there additional configuration/strategies to harden kiam-agent deployment such that it is quickly and reliably available to all workloads on newly created cluster nodes? Appreciate any and all advice/help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions