Skip to content

Skip orphaned consumer assignments failing healthz#7930

Draft
wallyqs wants to merge 1 commit intomainfrom
wq/healthz-consumer-not-found
Draft

Skip orphaned consumer assignments failing healthz#7930
wallyqs wants to merge 1 commit intomainfrom
wq/healthz-consumer-not-found

Conversation

@wallyqs
Copy link
Copy Markdown
Member

@wallyqs wallyqs commented Mar 10, 2026

These do not go away automatically and can prevent the system from booting when relying on healthz passing to start. Instead rely on consumer list or another API reporting the missing consumers which can be deleted manually.

Signed-off-by: Waldemar Quevedo wally@nats.io

@wallyqs
Copy link
Copy Markdown
Member Author

wallyqs commented Mar 10, 2026

@codex review

@wallyqs wallyqs force-pushed the wq/healthz-consumer-not-found branch from 056fb0f to 2b9bd99 Compare March 10, 2026 09:51
// We'll start erroring once we're sure this consumer is actually broken.
return nil
}
if node == nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right approach, I have the following worries:

  • The test creates a R1 consumer that's assigned on the meta layer but can't be created due to filesystem/permission issues. Isn't this exactly what the health check needs to report? You've asked the system to run a consumer, but it isn't running, so the consumer not found error is legitimate.
  • node == nil is only true for non-replicated consumers, as replicated consumers do have node populated.
  • This seems to patch over what could be a genuine bug, I'd rather have us fix the bug itself than ignore it in the health check.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex seems to think the same relating to health reporting green whereas the R1 consumer is missing: #7930 (comment).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the bug is that those failed consumer assignments do not clear up... there is indeed a condition but healthz I do not think the right place to troubleshoot these ones probably (or at least not helping I think), we need to find a way to let the system start with consumers that failed to be created.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the bug is that those failed consumer assignments do not clear up

That is not a bug though. In this test js.AddConsumer call properly got quorum, all servers within the cluster know that that consumer should exist, so it should exist. If it doesn't exist, it should be reported in healthz like it does now.

That's why I'm saying we must not ignore it, we need to find the original reason for the consumer not found. So, we need a reproducible example first and then see what the fix should be.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 056fb0fcd5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@wallyqs wallyqs force-pushed the wq/healthz-consumer-not-found branch 2 times, most recently from da4cb38 to c70fa88 Compare March 10, 2026 10:20
These do not go away automatically and can prevent the system
from booting when relying on healthz passing to start.
Instead rely on `consumer list` or another API reporting the
missing consumers which can be deleted manually.

Signed-off-by: Waldemar Quevedo <wally@nats.io>
@wallyqs wallyqs force-pushed the wq/healthz-consumer-not-found branch from c70fa88 to d77c5df Compare March 10, 2026 10:23
@wallyqs
Copy link
Copy Markdown
Member Author

wallyqs commented Mar 10, 2026

@codex review

wallyqs pushed a commit to wallyqs/nats-server that referenced this pull request Mar 10, 2026
Add the isConsumerHealthy ca.err check from PR nats-io#7930 and extensive tests
covering all conditions where ca.err gets set, verifying the healthz
endpoint correctly treats orphaned consumer assignments as healthy.

Tests cover:
- All error types set during processClusterCreateConsumer (stream not
  found, store failures, config validation, consumer limits, etc.)
- ClusterNotAssigned error from processConsumerAssignmentResults
- Inherited errors via ca.err = oca.err propagation
- R1 and R3 consumer replica configurations
- Full healthz endpoint integration with injected orphaned assignments
- Behavior when consumer is actually running but has inherited ca.err

https://claude.ai/code/session_015vdv3kuW1ghKLMjPyNeUJW
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d77c5dfb9f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +724 to +730
if caErr != nil {
// The consumer assignment has a recorded error from a failed creation attempt.
// This is an orphaned assignment that will not recover on its own.
// Don't report it as a health error since there's nothing running to be unhealthy.
// These consumers can be detected via consumer list (shows the error)
// and cleaned up via consumer delete.
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict healthz bypass to true orphaned assignment errors

This branch now suppresses health failures for any missing consumer with a non-nil ca.err, but ca.err is populated for all consumer-create failures (including transient ones such as out-of-space) and is carried forward in assignment state; it is not reliably cleared after successful recovery paths. That means a consumer assignment that ever recorded an error can later lose its consumer and still pass healthz, causing readiness checks to report healthy while the consumer is actually absent. Please gate this bypass to the specific non-recoverable orphan condition(s) instead of all ca.err != nil cases.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants