Feature Request: Have an ERS cooldown period a la RecoveryPeriodBlockSeconds of classical orchestrator

### Feature Description

Under some scenarios (e.g. client app overloading a shard's primary), vtorc can detect a `DeadPrimary` when the host is not really dead, just overloaded. This triggers an ERS, after which the newly elected primary is also overloaded, which leads to another `DeadPrimary`,  which leads to another ERS... in other words, we have a cascading failure that can remove capacity from the shard.

Classical orchestrator had `RecoveryPeriodBlockSeconds` to avoid this. We propose having the same for `vtorc`/ERS, with suitable changes:
* Since `vtorc` instances do not have a shared state, they need to figure out whether an ERS has taken place less than the cooldown time ago externally. We propose leveraging the reparenting journal for that. 
* Given the above, we might as well put the logic outside of vtorc so it works not only for it but also for any externally triggered ERS.
* `EmergencyReparentShardRequest` would include an extra field that specifies an `CoolDown`.
* The ERS code would:
  * Fetch the last ERS reparenting journal entry of each tablet in the shard.
  * Find the most recent one.
  * Compare that with `now - CoolDown` and decide whether the ERS can proceed or not.

### Use Case(s)

Avoid cascading failure scenarios lie the one described above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Have an ERS cooldown period a la RecoveryPeriodBlockSeconds of classical orchestrator #19775

Feature Description

Use Case(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Have an ERS cooldown period a la RecoveryPeriodBlockSeconds of classical orchestrator #19775

Description

Feature Description

Use Case(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions