Add health status to server ads and use to report high concurrency in origins/caches by jhiemstrawisc · Pull Request #2338 · PelicanPlatform/pelican

jhiemstrawisc · 2025-05-21T21:48:33Z

This PR does two primary things:

It adds information from an xrootd server's health status map to its outgoing advertisement so that the Director may eventually use this information in server filtering.
It spins off a routine that lets caches/origins scrape their own prometheus endpoint for information reported from the throttle plugin. When the result of that scrape indicates the server has more active IO operations than the configured limit, the server puts itself into a "degraded" health state.

This does not implement any special logic Director side to use that information.

…consumption

jhiemstrawisc · 2025-05-21T21:51:03Z

@patrickbrophy I'm hoping you can take a look at the prometheus bits in this. In particular, I wasn't totally sure which metric I actually needed. I know the config I compare the scraped value against is fed to the throttle plugin, which helped me narrow things down to those metrics. I couldn't find documentation to understand the differences between the options I had there, however.

Since whether or not we need a token can be determined through our param package by inspecting the value of 'Monitoring.PromQLAuthorization', we can move conditional token generation completely into the function itself.

It ocurred to me that this was another value we should move into the function body for a few reasons... It's not really valid to pass values like "director/Director", "Cache", or "Origin" through the function signature. The JWT RFC states that token subjects MUST either be unique within their issuer scope or globally unique, so lumping all "Origin"s and "Director"s into a single string is a violation. Moreover, the function name implies we already know both the sender and receiver of the token -- "me". We now simply grabs the server's external Web URL inside the function and use that to satisfy both subject uniqueness and standardize on the values that are used.

h2zh

LGTM

jhiemstrawisc added 3 commits May 21, 2025 20:30

Fix duplicate/redundant advertise bug

731b59a

Move queryPrometheus to more common base

0d7749f

Use Prometheus's active IO metrics to set health status for Director …

c922b4b

…consumption

jhiemstrawisc requested review from h2zh and patrickbrophy May 21, 2025 21:48

jhiemstrawisc assigned h2zh May 21, 2025

jhiemstrawisc added enhancement New feature or request cache Issue relating to the cache component origin Issue relating to the origin component director Issue relating to the director component monitoring labels May 21, 2025

jhiemstrawisc linked an issue May 21, 2025 that may be closed by this pull request

Give origins/servers ability to indicate they're experiencing high concurrency to Director #2337

Closed

Lint

cf8b31e

h2zh requested changes May 22, 2025

View reviewed changes

Comment thread server_utils/prom_query.go Outdated

Comment thread director/director.go

Comment thread server_utils/server_utils.go

Comment thread server_utils/server_utils.go

Comment thread server_utils/server_utils.go

Comment thread server_utils/server_utils.go

patrickbrophy reviewed May 23, 2025

View reviewed changes

Comment thread director/director_api.go Outdated

jhiemstrawisc added 3 commits May 27, 2025 15:54

Remove 'withToken' from 'QueryMyPrometheus' signature

9674da8

Since whether or not we need a token can be determined through our param package by inspecting the value of 'Monitoring.PromQLAuthorization', we can move conditional token generation completely into the function itself.

Plumb server ad status back into UI representations of server ads

09c1e3e

jhiemstrawisc requested a review from h2zh May 27, 2025 20:53

h2zh approved these changes May 28, 2025

View reviewed changes

h2zh merged commit d6b80cb into PelicanPlatform:main May 28, 2025
26 checks passed

h2zh added this to the v7.17 milestone Jun 2, 2025

This was referenced Jun 30, 2025

Origin.Concurrency health check triggers Prometheus query errors when Web UI is disabled #2458

Closed

Add a Monitoring.EnablePrometheus configuration parameter to disable the embedded Prometheus server and related checks #2465

Merged

jhiemstrawisc mentioned this pull request Jul 14, 2025

Use server load metric to throttle redirect choices in Director #2491

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add health status to server ads and use to report high concurrency in origins/caches#2338

Add health status to server ads and use to report high concurrency in origins/caches#2338
h2zh merged 7 commits into
PelicanPlatform:mainfrom
jhiemstrawisc:issue-2337

jhiemstrawisc commented May 21, 2025

Uh oh!

jhiemstrawisc commented May 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

h2zh left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jhiemstrawisc commented May 21, 2025

Uh oh!

jhiemstrawisc commented May 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

h2zh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants