Skip to content
This repository was archived by the owner on Mar 5, 2024. It is now read-only.
This repository was archived by the owner on Mar 5, 2024. It is now read-only.

Intermittent connectivity failures #351

@JethroMV

Description

@JethroMV

Hi, we love kiam it's great!
We're deploying it on our AWS KOPS K8s cluster using the following settings:

agent:
  gatewayTimeoutCreation: 20s
  timeout: 20s
  log.level: debug
  host:
    interface: weave
    iptables: true
server:
  gatewayTimeoutCreation: 20s
  timeout: 20s
  log.level: debug
  assumeRoleArn: arn:aws:iam::123456789:role/my-kiam-server
  nodeSelector:
    kubernetes.io/role: master
  tolerations:
  - key: "node-role.kubernetes.io/master" 
    effect: NoSchedule 
  sslCertHostPath: /etc/ssl/certs

Everything was working well for several weeks but we've started to notice that individual nodes will occasionally fail to connect. Restarting the kiam-agent pod has never gotten it unstuck. Neither has restarting the kiam-server.
Once a node has turned bad we've had to drain/delete the node, and Kops gives us a new one which works most of the time. But sometimes, like today, all new nodes fail too.

Any help debugging this would be amazing!

We are using the latest version from helm. Ie we install using helm install kiam uswitch/kiam --namespace kiam --values kiam-values.yaml.
Kops Version: 1.14.0.
kubernetes Version: 1.16
Container Runtime Version: docker://18.6.3
Kubelet Version: v1.12.10

Example server logs. Sometimes these look healthier.
kiam-server.log
It will often get to a (I assume healthy) state of repeating: kiam-server-p2wmk kiam-server {"level":"info","msg":"found role","pod.iam.role":"arn:aws:iam::630521468456:role/jura-stereotypes-read","pod.ip":"100.121.0.4","time":"2020-01-09T13:54:54Z"}

Example agent logs case 1: the most common case is an error on starting up
kiam-agent1.log

Example agent logs case 2: in the second most common case the logs get further
kiam-agent2.log

The main error seems to be:
{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2020-01-09T13:28:29Z"}

Sometimes it's been working, but it seems to be getting worse. We haven't changed our pod that is annotated to use a role.
Thanks for any help debugging this!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions