Abstract
In #426, the register option was introduced in order to provide a way for Fabio to register itself in Consul under user-specified names, so that certain services become reachable through Fabio via Consul-provided A records.
Said functionality (code here) registers each service with:
It seems to me that, in theory, this meant to the author and reviewers of the feature that whenever a Fabio instance wasn't reported as healthy, all services pointing to it would be marked as unhealthy and it would be given a certain amount of time for that Fabio instance to recover before Consul deregistering said services. However, this is not truly the case.
Detailed problem description
At this point, it's important to understand each service ID registered by Fabio is computed from the Fabio's instance hostname and configured IP:port endpoint. This information will be relevant shortly.
The problem is two-fold:
- Consul may pass the HTTP check for service entries owned by a now defunct Fabio instance;
- Following-up, and since the services' checks pass, Consul reaper will not deregister service entries owned by defunct Fabio instances, effectively resulting in Consul piling up duplicated service and related check entries.
Here's one example of how this can happen. If one runs Fabio as a Docker container and maps ports to host network, and said container crash-loops, one will end up with duplicated service entries that only differ in the service ID, given the hostname used to build the service ID is unique per container instance.
Now, all duplicated services belonging to Fabio instances that shared the same Docker host will point to the same HTTP check endpoint, the Fabio's (Docker host) HTTP endpoint, effectively resulting in said duplicated entries to not be removed and pile up as more crashes happen.
This will eventually resolve itself if no more instances of Fabio container run on said host and the HTTP check fails and doesn't recover until the Consul reaper kicks-in (Fabio defaults to 90m).
Potential solution
So how can we have:
- Fabio understand the difference between a service it owns and a similar service that was owned by a now defunct instance of Fabio?
- Consul reaping duplicated, dead service and related check entries?
Consul supports the notion of multiple checks per service and even different types of checks. One type of check that seems to be fully deterministic for using here, is the TTL check.
One potential solution is for Fabio to add a TTL check to each service, with a check ID computed the same way as the related service ID, so that only it (this instance of Fabio) will know about the TTL check and explicitly reset its clock, periodically (before the TTL expires). This way, whenever a Fabio instance fails to reset a TTL check clock, the same check will be marked as critical, and assuming Fabio is configured with registry.consul.checksrequired = all, if the HTTP check passes but the TTL fails, Fabio will not serve this service. It will, however, serve the service that this instance owns and periodically refreshes the TTL check for.
Last but not least, since the seemingly defunct services will be marked as unhealthy, the Consul reaper will finally be able to do its job. One can tune Fabio's registry.consul.checkDeregisterCriticalServiceAfter configuration parameter, in order to have more control of when the service and related checks are deleted after the former is marked as unhealthy.
If we can agree on this solution, I am ready to open a PR that addresses it.
Abstract
In #426, the
registeroption was introduced in order to provide a way for Fabio to register itself in Consul under user-specified names, so that certain services become reachable through Fabio via Consul-provided A records.Said functionality (code here) registers each service with:
It seems to me that, in theory, this meant to the author and reviewers of the feature that whenever a Fabio instance wasn't reported as healthy, all services pointing to it would be marked as unhealthy and it would be given a certain amount of time for that Fabio instance to recover before Consul deregistering said services. However, this is not truly the case.
Detailed problem description
At this point, it's important to understand each service ID registered by Fabio is computed from the Fabio's instance hostname and configured IP:port endpoint. This information will be relevant shortly.
The problem is two-fold:
Here's one example of how this can happen. If one runs Fabio as a Docker container and maps ports to host network, and said container crash-loops, one will end up with duplicated service entries that only differ in the service ID, given the
hostnameused to build the service ID is unique per container instance.Now, all duplicated services belonging to Fabio instances that shared the same Docker host will point to the same HTTP check endpoint, the Fabio's (Docker host) HTTP endpoint, effectively resulting in said duplicated entries to not be removed and pile up as more crashes happen.
This will eventually resolve itself if no more instances of Fabio container run on said host and the HTTP check fails and doesn't recover until the Consul reaper kicks-in (Fabio defaults to 90m).
Potential solution
So how can we have:
Consul supports the notion of multiple checks per service and even different types of checks. One type of check that seems to be fully deterministic for using here, is the TTL check.
One potential solution is for Fabio to add a TTL check to each service, with a check ID computed the same way as the related service ID, so that only it (this instance of Fabio) will know about the TTL check and explicitly reset its clock, periodically (before the TTL expires). This way, whenever a Fabio instance fails to reset a TTL check clock, the same check will be marked as critical, and assuming Fabio is configured with
registry.consul.checksrequired = all, if the HTTP check passes but the TTL fails, Fabio will not serve this service. It will, however, serve the service that this instance owns and periodically refreshes the TTL check for.Last but not least, since the seemingly defunct services will be marked as unhealthy, the Consul reaper will finally be able to do its job. One can tune Fabio's
registry.consul.checkDeregisterCriticalServiceAfterconfiguration parameter, in order to have more control of when the service and related checks are deleted after the former is marked as unhealthy.If we can agree on this solution, I am ready to open a PR that addresses it.