Skip orphaned consumer assignments failing healthz#7930
Conversation
|
@codex review |
056fb0f to
2b9bd99
Compare
server/jetstream_cluster.go
Outdated
| // We'll start erroring once we're sure this consumer is actually broken. | ||
| return nil | ||
| } | ||
| if node == nil { |
There was a problem hiding this comment.
I don't think this is the right approach, I have the following worries:
- The test creates a R1 consumer that's assigned on the meta layer but can't be created due to filesystem/permission issues. Isn't this exactly what the health check needs to report? You've asked the system to run a consumer, but it isn't running, so the
consumer not founderror is legitimate. node == nilis only true for non-replicated consumers, as replicated consumers do havenodepopulated.- This seems to patch over what could be a genuine bug, I'd rather have us fix the bug itself than ignore it in the health check.
There was a problem hiding this comment.
Codex seems to think the same relating to health reporting green whereas the R1 consumer is missing: #7930 (comment).
There was a problem hiding this comment.
the bug is that those failed consumer assignments do not clear up... there is indeed a condition but healthz I do not think the right place to troubleshoot these ones probably (or at least not helping I think), we need to find a way to let the system start with consumers that failed to be created.
There was a problem hiding this comment.
the bug is that those failed consumer assignments do not clear up
That is not a bug though. In this test js.AddConsumer call properly got quorum, all servers within the cluster know that that consumer should exist, so it should exist. If it doesn't exist, it should be reported in healthz like it does now.
That's why I'm saying we must not ignore it, we need to find the original reason for the consumer not found. So, we need a reproducible example first and then see what the fix should be.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 056fb0fcd5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
da4cb38 to
c70fa88
Compare
These do not go away automatically and can prevent the system from booting when relying on healthz passing to start. Instead rely on `consumer list` or another API reporting the missing consumers which can be deleted manually. Signed-off-by: Waldemar Quevedo <wally@nats.io>
c70fa88 to
d77c5df
Compare
|
@codex review |
Add the isConsumerHealthy ca.err check from PR nats-io#7930 and extensive tests covering all conditions where ca.err gets set, verifying the healthz endpoint correctly treats orphaned consumer assignments as healthy. Tests cover: - All error types set during processClusterCreateConsumer (stream not found, store failures, config validation, consumer limits, etc.) - ClusterNotAssigned error from processConsumerAssignmentResults - Inherited errors via ca.err = oca.err propagation - R1 and R3 consumer replica configurations - Full healthz endpoint integration with injected orphaned assignments - Behavior when consumer is actually running but has inherited ca.err https://claude.ai/code/session_015vdv3kuW1ghKLMjPyNeUJW
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d77c5dfb9f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if caErr != nil { | ||
| // The consumer assignment has a recorded error from a failed creation attempt. | ||
| // This is an orphaned assignment that will not recover on its own. | ||
| // Don't report it as a health error since there's nothing running to be unhealthy. | ||
| // These consumers can be detected via consumer list (shows the error) | ||
| // and cleaned up via consumer delete. | ||
| return nil |
There was a problem hiding this comment.
Restrict healthz bypass to true orphaned assignment errors
This branch now suppresses health failures for any missing consumer with a non-nil ca.err, but ca.err is populated for all consumer-create failures (including transient ones such as out-of-space) and is carried forward in assignment state; it is not reliably cleared after successful recovery paths. That means a consumer assignment that ever recorded an error can later lose its consumer and still pass healthz, causing readiness checks to report healthy while the consumer is actually absent. Please gate this bypass to the specific non-recoverable orphan condition(s) instead of all ca.err != nil cases.
Useful? React with 👍 / 👎.
These do not go away automatically and can prevent the system from booting when relying on healthz passing to start. Instead rely on
consumer listor another API reporting themissingconsumers which can be deleted manually.Signed-off-by: Waldemar Quevedo wally@nats.io