Modify WaitUntilWorking init probe to provide more useful error messages#2440
Merged
Merged
Conversation
…messages Previously, server administrators may have encountered errors like ``` Web engine check failed: context deadline exceeded ``` when they had misconfigured their `Server.ExternalWebUrl` param. This error message is terrible because a) it doesn't really tell you what broke and b) it doesn't provide _any_ hints about how to fix it. With this refactor, such a misconfiguration will produce an error more similar to: ``` The server was unable to unable to ping its health test endpoint at the configured Server.ExternalWebUrl during startup: url https://s2gdk4nj4:666/api/v1.0/health didn't respond with the expected status code 200 within the timeout of 10s (Server.StartupTimeout): Get "https://s2gdk4nj4:666/api/v1.0/health": dial tcp: lookup s2gdk4nj4 on 192.168.65.7:53: no such host ``` While far more verbose, this tells the admin how they might adjust timeout values for the startup sequence, what param sets the probed endpoint, and most importantly -- the actual cause of the failure (no route to host). I wound up refactoring more of the `WaitUntilWorking` function than I initially planned to because it felt very ad-hoc and the logic felt a bit too sprawling for me.
This vein of work tries to better expose the reason `WaitUntilWorking` might be failing by attaching the last encountered error to the error message we produce instead of leaving it detached in a log statement. Previously I only attached errors that weren't context errors because I hadn't seen a context error that wasn't a generic timeout. However, while playing around with some of the tests, I discovered some context errors were more useful -- such as one that indicated the TCP connection was successful, but that the context timed out awaiting response headers. To capture these without overwriting more useful errors by the final context timeout that inevitably happens when the server doesn't pass this probe, I now grab the context error IFF it is the first error. In my testing, this captures the header timeout correctly.
I really dislike tests that rely on the particular wording of an error message to pass, but this is one of those cases where it really doesn't feel correct to spin off a whole bunch of typed errors only for easier test validation. To somewhat ameliorate the situation, I modified the existing tests so that they're less dependent on an exact error message and instead check for key words that I think are more likely to exist in any error message regardless of the exact wording. For example, if WaitUntilWorking expects an HTTP 200 but gets a 500, I check that the error contains "expected 200" and "500".
patrickbrophy
suggested changes
Jun 25, 2025
patrickbrophy
left a comment
Contributor
There was a problem hiding this comment.
I think there is some room for change with regards to the errors, but otherwise the refactor looks good.
Requested by the reviewer, this uses a ternary result instead of reyling on a sentinel error to signal "success", "retry" and "fail". I'd initially written this with a custom enum for the probe, but decided it was basically identical to the ternary utility I wrote awhile ago so I decided to use that instead with the caveat that 'Tern_Unknown' is being used here to mean "retry". I believe this should be clear with the provided comments.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This vein of work was to improve the error messages around this
WaitUntilWorkingstartup probe. The problem was that the probe fires every ~50ms until some timeout is reached, but we didn't do anything to keep track of any intermittent errors that occurred up until the very last failure. Instead, these intermediary errors were tucked into log messages that were hard to identify as coming from the function. Unfortunately, the last error was usually an unhelpful "context timeout exceeded" that did nothing to explain why the probe had been failing up until the timeout.With these changes, we keep track of the most recent error that may be generated by each firing of the probe so we can present that error if the overall function fails.