Intermittent connectivity failures

Hi, we love kiam it's great!
We're deploying it on our AWS KOPS K8s cluster using the following settings:

```
agent:
  gatewayTimeoutCreation: 20s
  timeout: 20s
  log.level: debug
  host:
    interface: weave
    iptables: true
server:
  gatewayTimeoutCreation: 20s
  timeout: 20s
  log.level: debug
  assumeRoleArn: arn:aws:iam::123456789:role/my-kiam-server
  nodeSelector:
    kubernetes.io/role: master
  tolerations:
  - key: "node-role.kubernetes.io/master" 
    effect: NoSchedule 
  sslCertHostPath: /etc/ssl/certs
```

Everything was working well for several weeks but we've started to notice that individual nodes will occasionally fail to connect. Restarting the kiam-agent pod has never gotten it unstuck.  Neither has restarting the kiam-server.
Once a node has turned bad we've had to drain/delete the node, and Kops gives us a new one which works most of the time. But sometimes, like today, all new nodes fail too.

Any help debugging this would be amazing!

We are using the latest version from helm.  Ie we install using `helm install kiam uswitch/kiam --namespace kiam --values kiam-values.yaml`.
Kops Version: 1.14.0.  
kubernetes Version: 1.16 
Container Runtime Version:  docker://18.6.3
Kubelet Version:  v1.12.10

Example server logs. Sometimes these look healthier.
[kiam-server.log](https://github.com/uswitch/kiam/files/4040648/kiam-server.log)
It will often get to a (I assume healthy) state of repeating: `kiam-server-p2wmk kiam-server {"level":"info","msg":"found role","pod.iam.role":"arn:aws:iam::630521468456:role/jura-stereotypes-read","pod.ip":"100.121.0.4","time":"2020-01-09T13:54:54Z"}`

Example agent logs case 1: the most common case is an error on starting up
[kiam-agent1.log](https://github.com/uswitch/kiam/files/4040656/kiam-agent1.log)

Example agent logs case 2: in the second most common case the logs get further
[kiam-agent2.log](https://github.com/uswitch/kiam/files/4040661/kiam-agent2.log)

The main error seems to be:
`{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2020-01-09T13:28:29Z"}`

Sometimes it's been working, but it seems to be getting worse. We haven't changed our pod that is annotated to use a role.
Thanks for any help debugging this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent connectivity failures #351

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Intermittent connectivity failures #351

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions