Cross-posted from #27961 (comment)
@optiz0r wrote:
We moved all nodes from default node pool into separate node pools a few months ago, along with changing namespace default node_pool, and the default node pool became empty. Some jobs (inheriting the namespace default node pool) were never properly restarted after this change, so the running allocations were still associated with the default node pool while still running on nodes which were moved into a different node pool.
Doing an upgrade from 1.10.8(+ent) to 2.0.2(+ent) today got nomad into a state where those jobs failed to reschedule due to "no eligible nodes" (in default node pool). Stopping and starting the jobs reassociated them with the namespace-default node pool which did contain resources. But the allocations stuck in pending state with the new node logging errors being unable to fetch the previous allocation ID.
Jun 05 11:40:16 worker2 nomad[209858]: 2026-06-05T11:40:16.691+0100 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: Unknown allocation \"4dbffbe7-fca9-b12c-c09e-3f82eb5c22b8\"" rpc=Alloc.GetAlloc server=10.x.x.x:4647
Jun 05 11:40:16 worker2 nomad[209858]: 2026-06-05T11:40:16.691+0100 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: Unknown allocation \"4dbffbe7-fca9-b12c-c09e-3f82eb5c22b8\"" rpc=Alloc.GetAlloc server=10.x.x.x:4647
The node on which the jobs had been running was logging the same errors in this thread at high rates:
Jun 05 12:02:38 worker1 nomad[909]: 2026-06-05T12:02:38.583+0100 [ERROR] client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Alloc.GetAllocs server=10.x.x.x:4647
It was not possible to drain the node, and even after rebooting the node it was still logging permission errors. To clear the fault, I stopped nomad agent, nuked the datadir (containing client-id and state.db) and rejoined it to the cluster as a new node. After some time, nomad correctly recovered all jobs, and permission errors are no longer being logged.
This looks like two different issues:
- When the client comes back online with a new node pool, it makes
Alloc.GetAllocs calls to get updates on allocations it's supposed to have, but now those allocations are assigned to the wrong pool because they belong to an older version of the job. So the node is asking for the alloc, the server says "that's not in your node pool" and disallows it. This is a fairly significant bug.
- The
Alloc.GetAlloc (singular) RPC call is from the alloc watcher, which is how the node migrates workloads from one node to another. This should fail, making it impossible to do the best-effort migration between node pools. (This was a known tradeoff.) But it shouldn't be looping there... it should fail once and then stop watching.
Cross-posted from #27961 (comment)
@optiz0r wrote:
This looks like two different issues:
Alloc.GetAllocscalls to get updates on allocations it's supposed to have, but now those allocations are assigned to the wrong pool because they belong to an older version of the job. So the node is asking for the alloc, the server says "that's not in your node pool" and disallows it. This is a fairly significant bug.Alloc.GetAlloc(singular) RPC call is from the alloc watcher, which is how the node migrates workloads from one node to another. This should fail, making it impossible to do the best-effort migration between node pools. (This was a known tradeoff.) But it shouldn't be looping there... it should fail once and then stop watching.