Skip to content

max_retries circuit breaker vs. dynamic cluster selection #17412

@markdroth

Description

@markdroth

There are two cases where the cluster is dynamically determined at load-balancing time:

  1. Aggregate clusters.
  2. Dynamic cluster selection via the new extension point added in api: add cluster_specifier_plugin to RouteAction #16944.

In these cases, it's not clear how the max_retries circuit breaker can work, because max_retries is configured on a per-cluster basis, but when the cluster is being dynamically determined, the cluster is not known until after the retry code runs. And, in fact, the chosen cluster may be different for each retry attempt.

For aggregate clusters, one possible solution might be to say that we use the max_retries value from the aggregate cluster itself instead of the one from the chosen underlying cluster. As discussed in #13134, we agreed that for aggregate clusters, the aggregate cluster should control the LB policy, but all other functionality -- including circuit breakers -- should be controlled by the underlying clusters that the aggregate cluster points to. In other words, if aggregate cluster A points to underlying clusters B and C, then the circuit breakers configured for B or C should be used depending on when the aggregate cluster chooses to send a request to B or C; the circuit breaker limits configured for cluster A itself are basically ignored. However, we could change that to say that the max_retries circuit breaker is one specific exception to this, since it simply does not make sense for it to come from the underlying cluster.

Unfortunately, that approach would not work for the dynamic cluster selection case, because in that case there is no aggregate cluster that is configured to begin with.

Another possible option here would be to add a per-route setting to override the max_retries circuit breaker setting from the cluster, which could be used in cases where the cluster is dynamically determined. However, this would also allow overriding the setting when the cluster is not dynamically determined, which would mean that circuit breakers will no longer be configured in just one place for each cluster, which seems a little sub-optimal. But on the other hand, retries are actually implemented on Envoy's downstream side, so configuring the retry circuit breaker on the upstream side seems a little counter-intuitive as well.

Anyone have any other suggestions on a better way to handle this?

CC @htuch @yxue @mattklein123 @ejona86 @dfawley @dapengzhang0 @donnadionne @AndresGuedez

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/clusterarea/retryquestionQuestions that are neither investigations, bugs, nor enhancements

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions