nodes can't stop allocations if they've changed node pools

Cross-posted from https://github.com/hashicorp/nomad/issues/27961#issuecomment-4631069472

@optiz0r wrote:

> We moved all nodes from default node pool into separate node pools a few months ago, along with changing namespace default node_pool, and the default node pool became empty. Some jobs (inheriting the namespace default node pool) were never properly restarted after this change, so the running allocations were still associated with the default node pool while still running on nodes which were moved into a different node pool.
> 
> Doing an upgrade from 1.10.8(+ent) to 2.0.2(+ent) today got nomad into a state where those jobs failed to reschedule due to "no eligible nodes" (in default node pool). Stopping and starting the jobs reassociated them with the namespace-default node pool which did contain resources. But the allocations stuck in pending state with the new node logging errors being unable to fetch the previous allocation ID.

```
Jun 05 11:40:16 worker2 nomad[209858]:     2026-06-05T11:40:16.691+0100 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: Unknown allocation \"4dbffbe7-fca9-b12c-c09e-3f82eb5c22b8\"" rpc=Alloc.GetAlloc server=10.x.x.x:4647
Jun 05 11:40:16 worker2 nomad[209858]:     2026-06-05T11:40:16.691+0100 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: Unknown allocation \"4dbffbe7-fca9-b12c-c09e-3f82eb5c22b8\"" rpc=Alloc.GetAlloc server=10.x.x.x:4647
```

> The node on which the jobs had been running was logging the same errors in this thread at high rates:

```Jun 05 12:02:38 worker1 nomad[909]:     2026-06-05T12:02:38.575+0100 [ERROR] client: error querying updated allocations: error="rpc error: Permission denied"
Jun 05 12:02:38 worker1 nomad[909]:     2026-06-05T12:02:38.583+0100 [ERROR] client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Alloc.GetAllocs server=10.x.x.x:4647
```

> It was not possible to drain the node, and even after rebooting the node it was still logging permission errors. To clear the fault, I stopped nomad agent, nuked the datadir (containing client-id and state.db) and rejoined it to the cluster as a new node. After some time, nomad correctly recovered all jobs, and permission errors are no longer being logged. 

This looks like two different issues:
* When the client comes back online with a new node pool, it makes `Alloc.GetAllocs` calls to get updates on allocations it's supposed to have, but now those allocations are assigned to the wrong pool because they belong to an older version of the job. So the node is asking for the alloc, the server says "that's not in your node pool" and disallows it. This is a fairly significant bug.
* The `Alloc.GetAlloc` (singular) RPC call is from the alloc watcher, which is how the node migrates workloads from one node to another. This _should_ fail, making it impossible to do the best-effort migration between node pools. (This was a known tradeoff.) But it shouldn't be looping there... it should fail once and then stop watching.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nodes can't stop allocations if they've changed node pools #28093

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

nodes can't stop allocations if they've changed node pools #28093

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions