Skip to content

Add health status to server ads and use to report high concurrency in origins/caches#2338

Merged
h2zh merged 7 commits into
PelicanPlatform:mainfrom
jhiemstrawisc:issue-2337
May 28, 2025
Merged

Add health status to server ads and use to report high concurrency in origins/caches#2338
h2zh merged 7 commits into
PelicanPlatform:mainfrom
jhiemstrawisc:issue-2337

Conversation

@jhiemstrawisc

Copy link
Copy Markdown
Member

This PR does two primary things:

  1. It adds information from an xrootd server's health status map to its outgoing advertisement so that the Director may eventually use this information in server filtering.
  2. It spins off a routine that lets caches/origins scrape their own prometheus endpoint for information reported from the throttle plugin. When the result of that scrape indicates the server has more active IO operations than the configured limit, the server puts itself into a "degraded" health state.

This does not implement any special logic Director side to use that information.

@jhiemstrawisc jhiemstrawisc added enhancement New feature or request cache Issue relating to the cache component origin Issue relating to the origin component director Issue relating to the director component monitoring labels May 21, 2025
@jhiemstrawisc

Copy link
Copy Markdown
Member Author

@patrickbrophy I'm hoping you can take a look at the prometheus bits in this. In particular, I wasn't totally sure which metric I actually needed. I know the config I compare the scraped value against is fed to the throttle plugin, which helped me narrow things down to those metrics. I couldn't find documentation to understand the differences between the options I had there, however.

Comment thread server_utils/prom_query.go Outdated
Comment thread director/director.go
Comment thread server_utils/server_utils.go
Comment thread server_utils/server_utils.go
Comment thread server_utils/server_utils.go
Comment thread server_utils/server_utils.go
Comment thread director/director_api.go Outdated
Since whether or not we need a token can be determined through our
param package by inspecting the value of 'Monitoring.PromQLAuthorization',
we can move conditional token generation completely into the function
itself.
It ocurred to me that this was another value we should move into the
function body for a few reasons...

It's not really valid to pass values like "director/Director",
"Cache", or "Origin" through the function signature. The JWT RFC states
that token subjects MUST either be unique within their issuer scope or
globally unique, so lumping all "Origin"s and "Director"s into a single
string is a violation.

Moreover, the function name implies we already know both the sender and
receiver of the token -- "me".

We now simply grabs the server's external Web URL inside the function
and use that to satisfy both subject uniqueness and standardize on
the values that are used.
@jhiemstrawisc jhiemstrawisc requested a review from h2zh May 27, 2025 20:53

@h2zh h2zh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cache Issue relating to the cache component director Issue relating to the director component enhancement New feature or request monitoring origin Issue relating to the origin component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Give origins/servers ability to indicate they're experiencing high concurrency to Director

3 participants