[BUG] Scale down failure leaves leaked temporary cluster block (id=20), index permanently write-blocked

### Describe the bug

When performing a scale down operation via `prepareScaleSearchOnly(index, true)`, the operation adds a temporary index cluster block (id=20, 'preparing to scale down'). If the subsequent shard sync phase fails (for example, due to Transport/Network failure, node-side rejection, or exceptions in the underlying dataNode), the temporary block is never cleaned up, leaving the index in a write-blocked but not search-only state. This makes the index unusable for normal writes, with the block persisting until the entire cluster is restarted or a special recover/fix routine is applied. Index settings changes (including `index.blocks.search_only` and `index.blocks.read_only`) cannot clear this temporary block since it has a different UUID than the setting-based blocks.

### Related component

Search

### To Reproduce

1. Create an index with segment replication and remote store enabled, plus at least one search replica.
2. Ensure cluster is green.
3. Set up a test (e.g., using MockTransportService) to intercept and force the `TransportScaleIndexAction.NAME` transport request on data node(s) to throw an exception.
4. Trigger scale down: `client().admin().indices().prepareScaleSearchOnly(index, true).get()`
5. Observe that the operation fails as expected (shard sync phase). Examine the cluster state:
   - `clusterState.blocks().hasIndexBlockWithId(index, 20)` returns `true`: The block lingers.
   - The index is not marked as search_only in settings (search_only setting is false).
   - Normal writes are now blocked (`ClusterBlockException` with block id 20).
6. Try updating `index.blocks.search_only` or `index.blocks.read_only` to false; the block remains.
7. Only a full cluster restart removes the temporary block from state.

### Expected behavior

If scale down fails (e.g., due to shard sync failure), the temporary block should be removed and the index should return to normal operation. Write operations should be allowed, and normal state/setting updates should work. There should never be a persistent write-block with no way to recover except restart.

### Additional Details

**Opensearch Version:** (e.g., 2.x~latest main)

**Plugins:**
None required to reproduce.

**Impact:**

- If scale down fails, the index becomes permanently write-blocked because a temporary block is leaked.
- Even `index.blocks.read_only` and `index.blocks.search_only` cannot clear it, since the UUID is different.
- Only a cluster restart (or dirty state update tool) will clear the block.
- This can be reproduced deterministically in IT using MockTransportService's addRequestHandlingBehavior to throw on `TransportScaleIndexAction.NAME`.

**Potential fix:**

The scale-down flow should ensure the temporary block is *always* removed if shard sync fails (e.g., rollback in the listener's onFailure callback from `proceedWithScaleDown`).

See the test example in this report; it's related to code in `TransportScaleIndexAction`, `AddBlockClusterStateUpdateTask`, and `ScaleIndexShardSyncManager`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Scale down failure leaves leaked temporary cluster block (id=20), index permanently write-blocked #21188

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Scale down failure leaves leaked temporary cluster block (id=20), index permanently write-blocked #21188

Description

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions