Modify `WaitUntilWorking` init probe to provide more useful error messages by jhiemstrawisc · Pull Request #2440 · PelicanPlatform/pelican

jhiemstrawisc · 2025-06-25T15:44:38Z

This vein of work was to improve the error messages around this WaitUntilWorking startup probe. The problem was that the probe fires every ~50ms until some timeout is reached, but we didn't do anything to keep track of any intermittent errors that occurred up until the very last failure. Instead, these intermediary errors were tucked into log messages that were hard to identify as coming from the function. Unfortunately, the last error was usually an unhelpful "context timeout exceeded" that did nothing to explain why the probe had been failing up until the timeout.

With these changes, we keep track of the most recent error that may be generated by each firing of the probe so we can present that error if the overall function fails.

…messages Previously, server administrators may have encountered errors like ``` Web engine check failed: context deadline exceeded ``` when they had misconfigured their `Server.ExternalWebUrl` param. This error message is terrible because a) it doesn't really tell you what broke and b) it doesn't provide _any_ hints about how to fix it. With this refactor, such a misconfiguration will produce an error more similar to: ``` The server was unable to unable to ping its health test endpoint at the configured Server.ExternalWebUrl during startup: url https://s2gdk4nj4:666/api/v1.0/health didn't respond with the expected status code 200 within the timeout of 10s (Server.StartupTimeout): Get "https://s2gdk4nj4:666/api/v1.0/health": dial tcp: lookup s2gdk4nj4 on 192.168.65.7:53: no such host ``` While far more verbose, this tells the admin how they might adjust timeout values for the startup sequence, what param sets the probed endpoint, and most importantly -- the actual cause of the failure (no route to host). I wound up refactoring more of the `WaitUntilWorking` function than I initially planned to because it felt very ad-hoc and the logic felt a bit too sprawling for me.

This vein of work tries to better expose the reason `WaitUntilWorking` might be failing by attaching the last encountered error to the error message we produce instead of leaving it detached in a log statement. Previously I only attached errors that weren't context errors because I hadn't seen a context error that wasn't a generic timeout. However, while playing around with some of the tests, I discovered some context errors were more useful -- such as one that indicated the TCP connection was successful, but that the context timed out awaiting response headers. To capture these without overwriting more useful errors by the final context timeout that inevitably happens when the server doesn't pass this probe, I now grab the context error IFF it is the first error. In my testing, this captures the header timeout correctly.

I really dislike tests that rely on the particular wording of an error message to pass, but this is one of those cases where it really doesn't feel correct to spin off a whole bunch of typed errors only for easier test validation. To somewhat ameliorate the situation, I modified the existing tests so that they're less dependent on an exact error message and instead check for key words that I think are more likely to exist in any error message regardless of the exact wording. For example, if WaitUntilWorking expects an HTTP 200 but gets a 500, I check that the error contains "expected 200" and "500".

patrickbrophy

I think there is some room for change with regards to the errors, but otherwise the refactor looks good.

Requested by the reviewer, this uses a ternary result instead of reyling on a sentinel error to signal "success", "retry" and "fail". I'd initially written this with a custom enum for the probe, but decided it was basically identical to the ternary utility I wrote awhile ago so I decided to use that instead with the caveat that 'Tern_Unknown' is being used here to mean "retry". I believe this should be clear with the provided comments.

patrickbrophy

LGTM

jhiemstrawisc added 3 commits June 17, 2025 19:28

jhiemstrawisc added this to the v7.18 milestone Jun 25, 2025

jhiemstrawisc requested a review from patrickbrophy June 25, 2025 15:44

jhiemstrawisc assigned patrickbrophy Jun 25, 2025

jhiemstrawisc linked an issue Jun 25, 2025 that may be closed by this pull request

Improve health check error messages #1807

Closed

jhiemstrawisc modified the milestones: v7.18, April Docs Focus Jun 25, 2025

patrickbrophy suggested changes Jun 25, 2025

View reviewed changes

Comment thread server_utils/server_utils.go Outdated

Comment thread server_utils/server_utils.go Outdated

jhiemstrawisc requested a review from patrickbrophy July 3, 2025 18:11

patrickbrophy approved these changes Jul 3, 2025

View reviewed changes

patrickbrophy merged commit d9ea31d into PelicanPlatform:main Jul 3, 2025
25 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify `WaitUntilWorking` init probe to provide more useful error messages#2440

Modify `WaitUntilWorking` init probe to provide more useful error messages#2440
patrickbrophy merged 4 commits into
PelicanPlatform:mainfrom
jhiemstrawisc:issue-1807

jhiemstrawisc commented Jun 25, 2025 •

edited

Loading

Uh oh!

patrickbrophy left a comment

Uh oh!

Uh oh!

Uh oh!

patrickbrophy left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jhiemstrawisc commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickbrophy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

patrickbrophy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jhiemstrawisc commented Jun 25, 2025 •

edited

Loading