Experiencing 502's

We use Fabio as a load balancer in a Nomad setup with Consul and some backend services (node.Http servers). With millions of requests coming in each month we see that about 1750 requests are getting 502'ed. Percentage wise this is about a ~0.01%, so a very low number. Nevertheless we'd like to solve this issue.

Reading through the issues we found a couple of similar looking ones. For example, see; https://github.com/fabiolb/fabio/issues/721
https://github.com/fabiolb/fabio/issues/716
___

We typically find these log lines in Fabio at the time of 502 errors:

```
http: proxy error: read tcp IP:33374->IP:24218: read: connection reset by peer
http: proxy error: EOF
```

These are indicators for a TCP RST packet, so the backend killed the connection already. We figured that it has to do something with incorrectly configured keepAlive's from both sides. See these two articles https://shuheikagawa.com/blog/2019/04/25/keep-alive-timeout/ & https://docs.apigee.com/api-platform/troubleshoot/runtime/502-bad-gateway for a better explanation. Even though the load balancers are different they do describe the same type of problem.

We tried a setup where we set the `proxy.keepalivetimeout` to 20s for Fabio and our backend service a `server.keepAliveTimeout` of 30s. This way the load balancer will always try to kill the connection and _not_ the backend service. However we found that all TCP connectings were still getting into a `TIME-WAIT` state on the backend service, indicating that the backend service initiated the closing of the socket. No matter what configuration we tried to set for Fabio, it didn't work. The backend service was always initiating closure.

Upon further investigation it seems that the node.Http server actually does some extra stuff when the `keepAliveTimeout` is being hit. It also destroys the socket (https://github.com/nodejs/node/blob/45b5ca810a16074e639157825c1aa2e90d60e9f6/lib/_http_server.js#L587), this behaviour is not found in Fabio when you set `proxy.keepalivetimeout`. It just keeps the socket there and eventually the backend service would kill the connection. Because it hits it's own `keepAliveTimeout`.

Additional testing did gave us some good results though. We found that we have to set `IdleConnTimeout`* of the `http.Transport` to signal closure on the Fabio side after timeout. When configured this way (+ timeouts mentioned above) we noticed that the TCP connections on the backend services were no longer getting into a `TIME-WAIT` state, rather into a `CLOSE-WAIT` state, indicating that Fabio initiated closure. However, there's no `IdleConnTimeout` configurable in Fabio.

I was looking for additional insights and feedback. Are we even on the right path? I have prepared a PR that makes setting `IdleConnTimeout` possible through `proxy.idleconntimeout`.

Thank you.

*See https://go.dev/src/net/http/transport.go L994 (`closeConnIfStillIdle`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiencing 502's #862

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Experiencing 502's #862

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions