Graceful shutdown an Origin or Cache by h2zh · Pull Request #2318 · PelicanPlatform/pelican

h2zh · 2025-05-16T01:04:40Z

Trigger the graceful shutdown once a Pelican Origin or Cache receives a SIGTERM signal (kill -TERM <pid>)

Upon receiving SIGTERM, the Origin or Cache process waits one minute for in-flight transfers to complete. If any transfers remain after the one-minute grace period, the server shuts down immediately.

Before the wait, the Origin or Cache sends a shutdown advertisement with the new Status field (="shutting down") to the Director, instructing it to stop routing any new transfer requests to this server. The Director then applies an indefinite downtime window to this server, so that offline servers still listed in its cached serverAds will be excluded from new requests. Whenever the server is back online, it will send a regular server ad to flush out the "shutting down" downtime.

cc @bbockelm Here is a formal up-to-date system design diagram. (Content in this PR is marked in red)

h2zh · 2025-05-19T16:44:03Z

Hey @jhiemstrawisc, could you review this before/after (whenever you think it's the good time) you work on status field? You mentioned you have plan to integrate shuttingdown, degraded to the existing healthStatus in the serverAds, so I think I might need to change this PR a lot to incorporate that.

brianaydemir · 2025-05-29T13:30:07Z

If any transfers remain after the one-minute grace period, the server shuts down immediately.

@jhiemstrawisc @h2zh Is this grace period configurable? Do we have an idea how long "immediately" actually takes?

I'm thinking ahead to origins and caches running on a Kubernetes cluster, and what it might take to configure them to shutdown gracefully.

brianaydemir · 2025-05-29T13:47:16Z

The Director then applies a short-term downtime window (=Director_AdvertisementTTL+Shutdown_Grace_period=15+1=16m by default) to this server, so that offline servers still listed in its cached serverAds are excluded from new requests.

So, imagine the case where an admin is performing a routine upgrade of a cache or origin. We gracefully shutdown the service, and very soon after start it back up. If all goes well, this only takes a few minutes.

In this scenario, what's the mechanism for removing the downtime? (Do we need the downtime in the first place?)

h2zh · 2025-05-30T19:12:55Z

Hey @brianaydemir, thanks for your suggestions!

I updated the code to make the shutdown timeout configurable. "Immediately" here is its literal meaning (definitely less than 1s).
I modified the logics in the code. Director applies an indefinite downtime window to the server when it receives the shutdown signal. Origin/Cache server would lost the StatusShuttingDown in its in-memory healthStatus once its process gets killled. Whenever the server is back online, it will send a regular server ad to flush out the old "shutting down" downtime.

I also updated the write-up and diagram on the top to reflect the latest workflow in the code.

…gnal - Create and propagate a special advertisement for shutdown, asking the Director to stop redirect new transfer requests to this server - ShuttingDown status should have the highest priority in the health status. Once the `OriginCache_XRootD` component is set to `StatusShuttingDown`, Director has to pick it up in the `Status` field in server ad and put the server into downtime - Wait for Xrootd_ShutdownTimeout for the in-flight transfers to finish - If the in-flight transfers are not finished by the timeout, still shutdown the server - Bugfix: In daemon.LaunchDaemons, write expiry back into the slice. Otherwise the original for-loop modifies the copy and doesn't affect the original. New expiry time is not saved

- And lift the downtime when the server is back online

jhiemstrawisc

A few comments/questions/suggestions. Otherwise I tested this locally and observed both the graceful shutdown and that restarting the Origin cleans out the old filter. Nice job!

jhiemstrawisc · 2025-06-12T17:04:39Z

One other note @h2zh -- I love that you often provide system diagrams with your PRs. You set a good example for the rest of us. Do you also store these in our internal documentation in the Pelican drive? If not, can you find a reasonable spot for them and put them there so we have some organized internal references to them?

- Address PR review - Change webUI: disable (grey out) "Restart Server" button after user clicks it for 1 min to allow a graceful shutdown then restart

jhiemstrawisc

Getting super close! I'm going to pre-approve so you can merge once the last grammar change is in. (I did not review the react changes)

Nice work with this one, I hope it's really helpful for our "0.1%" goal!

h2zh · 2025-06-17T19:03:33Z

One other note @h2zh -- I love that you often provide system diagrams with your PRs. You set a good example for the rest of us. Do you also store these in our internal documentation in the Pelican drive? If not, can you find a reasonable spot for them and put them there so we have some organized internal references to them?

@jhiemstrawisc Thanks! I really try to make my work neat, structured, and easy for everyone to understand, especially since I recently learned about the idea of "translational computing".

You bring up a good point about storing these diagrams in our internal documentation. I've been relying on GitHub PRs for feature write-ups because it offers some advantages:

Single Source of Truth: I only have to maintain one version, keeping the PR write-up up-to-date and linking all relevant tickets and other PRs directly within it.
Contextual Timeline: It's easier for anyone viewing the PR to grasp the entire timeline and how the feature evolved alongside the code changes.

However, I agree that having a central reference point in our Google Drive would be beneficial. What if we create a "symlink" (a file containing a link) in the "Design Docs" folder? I would name the file after the feature, and its content would simply be the direct link to the GitHub PR. This way, we get the best of both: the PR remains the living document, and our drive serves as an organized index.

What do you think of that approach?

h2zh added critical High priority for next release cache Issue relating to the cache component origin Issue relating to the origin component director Issue relating to the director component labels May 16, 2025

h2zh linked an issue May 16, 2025 that may be closed by this pull request

Graceful shutdown a Cache #2287

Closed

7 tasks

h2zh mentioned this pull request May 12, 2025

Graceful shutdown a Cache #2287

Closed

7 tasks

h2zh requested a review from jhiemstrawisc May 19, 2025 16:44

h2zh added 2 commits May 29, 2025 23:01

A new dedicated downtime filter type for shutdown scenario

b7031aa

A new config param to control the length of shutdown period

8323c05

h2zh force-pushed the graceful-shutdown-cache branch 2 times, most recently from cead8a4 to e117439 Compare May 30, 2025 18:43

h2zh added 2 commits May 30, 2025 19:28

Put the shutting down server into downtime on the Director

c986ae2

- And lift the downtime when the server is back online

h2zh force-pushed the graceful-shutdown-cache branch from e117439 to c986ae2 Compare May 30, 2025 19:29

h2zh added this to the v7.17 milestone Jun 2, 2025

h2zh assigned jhiemstrawisc Jun 10, 2025

jhiemstrawisc requested changes Jun 12, 2025

View reviewed changes

Comment thread director/director.go Outdated

Comment thread docs/parameters.yaml Outdated

Comment thread docs/parameters.yaml Outdated

Comment thread launchers/launcher.go

Comment thread launchers/launcher.go

Comment thread docs/parameters.yaml

Add shutdown downtime period for server restart

f08a78e

- Address PR review - Change webUI: disable (grey out) "Restart Server" button after user clicks it for 1 min to allow a graceful shutdown then restart

h2zh requested a review from jhiemstrawisc June 13, 2025 19:00

jhiemstrawisc approved these changes Jun 13, 2025

View reviewed changes

Comment thread launchers/launcher.go Outdated

Comment thread docs/parameters.yaml Outdated

Minor wording improvement

f4468b7

h2zh merged commit 2c39d54 into PelicanPlatform:main Jun 13, 2025
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful shutdown an Origin or Cache#2318

Graceful shutdown an Origin or Cache#2318
h2zh merged 6 commits into
PelicanPlatform:mainfrom
h2zh:graceful-shutdown-cache

h2zh commented May 16, 2025 •

edited

Loading

Uh oh!

h2zh commented May 19, 2025

Uh oh!

brianaydemir commented May 29, 2025 •

edited

Loading

Uh oh!

brianaydemir commented May 29, 2025 •

edited

Loading

Uh oh!

h2zh commented May 30, 2025 •

edited

Loading

Uh oh!

jhiemstrawisc left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jhiemstrawisc commented Jun 12, 2025

Uh oh!

jhiemstrawisc left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

h2zh commented Jun 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

h2zh commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h2zh commented May 19, 2025

Uh oh!

brianaydemir commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brianaydemir commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h2zh commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhiemstrawisc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jhiemstrawisc commented Jun 12, 2025

Uh oh!

jhiemstrawisc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

h2zh commented Jun 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

h2zh commented May 16, 2025 •

edited

Loading

brianaydemir commented May 29, 2025 •

edited

Loading

brianaydemir commented May 29, 2025 •

edited

Loading

h2zh commented May 30, 2025 •

edited

Loading