Skip to content

Modify WaitUntilWorking init probe to provide more useful error messages#2440

Merged
patrickbrophy merged 4 commits into
PelicanPlatform:mainfrom
jhiemstrawisc:issue-1807
Jul 3, 2025
Merged

Modify WaitUntilWorking init probe to provide more useful error messages#2440
patrickbrophy merged 4 commits into
PelicanPlatform:mainfrom
jhiemstrawisc:issue-1807

Conversation

@jhiemstrawisc

@jhiemstrawisc jhiemstrawisc commented Jun 25, 2025

Copy link
Copy Markdown
Member

This vein of work was to improve the error messages around this WaitUntilWorking startup probe. The problem was that the probe fires every ~50ms until some timeout is reached, but we didn't do anything to keep track of any intermittent errors that occurred up until the very last failure. Instead, these intermediary errors were tucked into log messages that were hard to identify as coming from the function. Unfortunately, the last error was usually an unhelpful "context timeout exceeded" that did nothing to explain why the probe had been failing up until the timeout.

With these changes, we keep track of the most recent error that may be generated by each firing of the probe so we can present that error if the overall function fails.

…messages

Previously, server administrators may have encountered errors like
```
Web engine check failed: context deadline exceeded
```
when they had misconfigured their `Server.ExternalWebUrl` param.
This error message is terrible because a) it doesn't really tell you
what broke and b) it doesn't provide _any_ hints about how to fix it.

With this refactor, such a misconfiguration will produce an error more
similar to:
```
The server was unable to unable to ping its health test endpoint at the
configured Server.ExternalWebUrl during startup: url
https://s2gdk4nj4:666/api/v1.0/health didn't respond with the expected
status code 200 within the timeout of 10s (Server.StartupTimeout): Get
"https://s2gdk4nj4:666/api/v1.0/health": dial tcp: lookup s2gdk4nj4 on
192.168.65.7:53: no such host
```

While far more verbose, this tells the admin how they might adjust timeout
values for the startup sequence, what param sets the probed endpoint,
and most importantly -- the actual cause of the failure (no route to host).

I wound up refactoring more of the `WaitUntilWorking` function than I
initially planned to because it felt very ad-hoc and the logic felt a
bit too sprawling for me.
This vein of work tries to better expose the reason `WaitUntilWorking`
might be failing by attaching the last encountered error to the error
message we produce instead of leaving it detached in a log statement.

Previously I only attached errors that weren't context errors because
I hadn't seen a context error that wasn't a generic timeout. However,
while playing around with some of the tests, I discovered some context
errors were more useful -- such as one that indicated the TCP connection
was successful, but that the context timed out awaiting response headers.

To capture these without overwriting more useful errors by the final
context timeout that inevitably happens when the server doesn't pass
this probe, I now grab the context error IFF it is the first error.

In my testing, this captures the header timeout correctly.
I really dislike tests that rely on the particular wording of an error
message to pass, but this is one of those cases where it really doesn't
feel correct to spin off a whole bunch of typed errors only for easier
test validation.

To somewhat ameliorate the situation, I modified the existing tests
so that they're less dependent on an exact error message and instead
check for key words that I think are more likely to exist in any error
message regardless of the exact wording.

For example, if WaitUntilWorking expects an HTTP 200 but gets a 500,
I check that the error contains "expected 200" and "500".
@jhiemstrawisc jhiemstrawisc added this to the v7.18 milestone Jun 25, 2025
@jhiemstrawisc jhiemstrawisc added internal Internal code improvements, not user-facing cache Issue relating to the cache component origin Issue relating to the origin component director Issue relating to the director component registry Issue relating to the registry component error messages Improvements to error messages labels Jun 25, 2025
@jhiemstrawisc jhiemstrawisc linked an issue Jun 25, 2025 that may be closed by this pull request
@jhiemstrawisc jhiemstrawisc modified the milestones: v7.18, April Docs Focus Jun 25, 2025

@patrickbrophy patrickbrophy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is some room for change with regards to the errors, but otherwise the refactor looks good.

Comment thread server_utils/server_utils.go Outdated
Comment thread server_utils/server_utils.go Outdated
Requested by the reviewer, this uses a ternary result instead of reyling
on a sentinel error to signal "success", "retry" and "fail".

I'd initially written this with a custom enum for the probe, but decided
it was basically identical to the ternary utility I wrote awhile ago so
I decided to use that instead with the caveat that 'Tern_Unknown' is
being used here to mean "retry". I believe this should be clear with the
provided comments.

@patrickbrophy patrickbrophy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@patrickbrophy patrickbrophy merged commit d9ea31d into PelicanPlatform:main Jul 3, 2025
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cache Issue relating to the cache component director Issue relating to the director component error messages Improvements to error messages internal Internal code improvements, not user-facing origin Issue relating to the origin component registry Issue relating to the registry component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve health check error messages

2 participants