max_retries circuit breaker vs. dynamic cluster selection

There are two cases where the cluster is dynamically determined at load-balancing time:
1. Aggregate clusters.
2. Dynamic cluster selection via the new extension point added in #16944.

In these cases, it's not clear how the max_retries circuit breaker can work, because max_retries is configured on a per-cluster basis, but when the cluster is being dynamically determined, the cluster is not known until after the retry code runs.  And, in fact, the chosen cluster may be different for each retry attempt.

For aggregate clusters, one possible solution might be to say that we use the max_retries value from the aggregate cluster itself instead of the one from the chosen underlying cluster.  As discussed in #13134, we agreed that for aggregate clusters, the aggregate cluster should control the LB policy, but all other functionality -- including circuit breakers -- should be controlled by the underlying clusters that the aggregate cluster points to.  In other words, if aggregate cluster A points to underlying clusters B and C, then the circuit breakers configured for B or C should be used depending on when the aggregate cluster chooses to send a request to B or C; the circuit breaker limits configured for cluster A itself are basically ignored.  However, we could change that to say that the max_retries circuit breaker is one specific exception to this, since it simply does not make sense for it to come from the underlying cluster.

Unfortunately, that approach would not work for the dynamic cluster selection case, because in that case there is no aggregate cluster that is configured to begin with.

Another possible option here would be to add a per-route setting to override the max_retries circuit breaker setting from the cluster, which could be used in cases where the cluster is dynamically determined.  However, this would also allow overriding the setting when the cluster is *not* dynamically determined, which would mean that circuit breakers will no longer be configured in just one place for each cluster, which seems a little sub-optimal.  But on the other hand, retries are actually implemented on Envoy's downstream side, so configuring the retry circuit breaker on the upstream side seems a little counter-intuitive as well.

Anyone have any other suggestions on a better way to handle this?

CC @htuch @yxue @mattklein123 @ejona86 @dfawley @dapengzhang0 @donnadionne @andresguedez

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

max_retries circuit breaker vs. dynamic cluster selection #17412

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

max_retries circuit breaker vs. dynamic cluster selection #17412

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions