-
Notifications
You must be signed in to change notification settings - Fork 236
Intermittent connectivity failures #351
Description
Hi, we love kiam it's great!
We're deploying it on our AWS KOPS K8s cluster using the following settings:
agent:
gatewayTimeoutCreation: 20s
timeout: 20s
log.level: debug
host:
interface: weave
iptables: true
server:
gatewayTimeoutCreation: 20s
timeout: 20s
log.level: debug
assumeRoleArn: arn:aws:iam::123456789:role/my-kiam-server
nodeSelector:
kubernetes.io/role: master
tolerations:
- key: "node-role.kubernetes.io/master"
effect: NoSchedule
sslCertHostPath: /etc/ssl/certs
Everything was working well for several weeks but we've started to notice that individual nodes will occasionally fail to connect. Restarting the kiam-agent pod has never gotten it unstuck. Neither has restarting the kiam-server.
Once a node has turned bad we've had to drain/delete the node, and Kops gives us a new one which works most of the time. But sometimes, like today, all new nodes fail too.
Any help debugging this would be amazing!
We are using the latest version from helm. Ie we install using helm install kiam uswitch/kiam --namespace kiam --values kiam-values.yaml.
Kops Version: 1.14.0.
kubernetes Version: 1.16
Container Runtime Version: docker://18.6.3
Kubelet Version: v1.12.10
Example server logs. Sometimes these look healthier.
kiam-server.log
It will often get to a (I assume healthy) state of repeating: kiam-server-p2wmk kiam-server {"level":"info","msg":"found role","pod.iam.role":"arn:aws:iam::630521468456:role/jura-stereotypes-read","pod.ip":"100.121.0.4","time":"2020-01-09T13:54:54Z"}
Example agent logs case 1: the most common case is an error on starting up
kiam-agent1.log
Example agent logs case 2: in the second most common case the logs get further
kiam-agent2.log
The main error seems to be:
{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2020-01-09T13:28:29Z"}
Sometimes it's been working, but it seems to be getting worse. We haven't changed our pod that is annotated to use a role.
Thanks for any help debugging this!