Skip to content

Graceful shutdown an Origin or Cache#2318

Merged
h2zh merged 6 commits into
PelicanPlatform:mainfrom
h2zh:graceful-shutdown-cache
Jun 13, 2025
Merged

Graceful shutdown an Origin or Cache#2318
h2zh merged 6 commits into
PelicanPlatform:mainfrom
h2zh:graceful-shutdown-cache

Conversation

@h2zh

@h2zh h2zh commented May 16, 2025

Copy link
Copy Markdown
Contributor

Trigger the graceful shutdown once a Pelican Origin or Cache receives a SIGTERM signal (kill -TERM <pid>)

Upon receiving SIGTERM, the Origin or Cache process waits one minute for in-flight transfers to complete. If any transfers remain after the one-minute grace period, the server shuts down immediately.

Before the wait, the Origin or Cache sends a shutdown advertisement with the new Status field (="shutting down") to the Director, instructing it to stop routing any new transfer requests to this server. The Director then applies an indefinite downtime window to this server, so that offline servers still listed in its cached serverAds will be excluded from new requests. Whenever the server is back online, it will send a regular server ad to flush out the "shutting down" downtime.

graceful-shutdown (3) drawio

cc @bbockelm Here is a formal up-to-date system design diagram. (Content in this PR is marked in red)

@h2zh h2zh added critical High priority for next release cache Issue relating to the cache component origin Issue relating to the origin component director Issue relating to the director component labels May 16, 2025
@h2zh h2zh linked an issue May 16, 2025 that may be closed by this pull request
7 tasks
@h2zh h2zh mentioned this pull request May 12, 2025
7 tasks
@h2zh

h2zh commented May 19, 2025

Copy link
Copy Markdown
Contributor Author

Hey @jhiemstrawisc, could you review this before/after (whenever you think it's the good time) you work on status field? You mentioned you have plan to integrate shuttingdown, degraded to the existing healthStatus in the serverAds, so I think I might need to change this PR a lot to incorporate that.

@h2zh h2zh requested a review from jhiemstrawisc May 19, 2025 16:44
@brianaydemir

brianaydemir commented May 29, 2025

Copy link
Copy Markdown
Contributor

If any transfers remain after the one-minute grace period, the server shuts down immediately.

@jhiemstrawisc @h2zh Is this grace period configurable? Do we have an idea how long "immediately" actually takes?

I'm thinking ahead to origins and caches running on a Kubernetes cluster, and what it might take to configure them to shutdown gracefully.

@brianaydemir

brianaydemir commented May 29, 2025

Copy link
Copy Markdown
Contributor

The Director then applies a short-term downtime window (=Director_AdvertisementTTL+Shutdown_Grace_period=15+1=16m by default) to this server, so that offline servers still listed in its cached serverAds are excluded from new requests.

So, imagine the case where an admin is performing a routine upgrade of a cache or origin. We gracefully shutdown the service, and very soon after start it back up. If all goes well, this only takes a few minutes.

In this scenario, what's the mechanism for removing the downtime? (Do we need the downtime in the first place?)

@h2zh h2zh force-pushed the graceful-shutdown-cache branch 2 times, most recently from cead8a4 to e117439 Compare May 30, 2025 18:43
@h2zh

h2zh commented May 30, 2025

Copy link
Copy Markdown
Contributor Author

Hey @brianaydemir, thanks for your suggestions!

  1. I updated the code to make the shutdown timeout configurable. "Immediately" here is its literal meaning (definitely less than 1s).
  2. I modified the logics in the code. Director applies an indefinite downtime window to the server when it receives the shutdown signal. Origin/Cache server would lost the StatusShuttingDown in its in-memory healthStatus once its process gets killled. Whenever the server is back online, it will send a regular server ad to flush out the old "shutting down" downtime.

I also updated the write-up and diagram on the top to reflect the latest workflow in the code.

h2zh added 2 commits May 30, 2025 19:28
…gnal

- Create and propagate a special advertisement for shutdown, asking the Director to stop redirect new transfer requests to this server
- ShuttingDown status should have the highest priority in the health status. Once the `OriginCache_XRootD` component is set to `StatusShuttingDown`, Director has to pick it up in the `Status` field in server ad and put the server into downtime
- Wait for Xrootd_ShutdownTimeout for the in-flight transfers to finish
- If the in-flight transfers are not finished by the timeout, still shutdown the server
- Bugfix: In daemon.LaunchDaemons, write expiry back into the slice. Otherwise the original for-loop modifies the copy and doesn't affect the original. New expiry time is not saved
- And lift the downtime when the server is back online
@h2zh h2zh force-pushed the graceful-shutdown-cache branch from e117439 to c986ae2 Compare May 30, 2025 19:29
@h2zh h2zh added this to the v7.17 milestone Jun 2, 2025

@jhiemstrawisc jhiemstrawisc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments/questions/suggestions. Otherwise I tested this locally and observed both the graceful shutdown and that restarting the Origin cleans out the old filter. Nice job!

Comment thread director/director.go Outdated
Comment thread docs/parameters.yaml Outdated
Comment thread docs/parameters.yaml Outdated
Comment thread launchers/launcher.go
Comment thread launchers/launcher.go
Comment thread docs/parameters.yaml
@jhiemstrawisc

Copy link
Copy Markdown
Member

One other note @h2zh -- I love that you often provide system diagrams with your PRs. You set a good example for the rest of us. Do you also store these in our internal documentation in the Pelican drive? If not, can you find a reasonable spot for them and put them there so we have some organized internal references to them?

- Address PR review
- Change webUI: disable (grey out) "Restart Server" button after user clicks it for 1 min to allow a graceful shutdown then restart
@h2zh h2zh requested a review from jhiemstrawisc June 13, 2025 19:00

@jhiemstrawisc jhiemstrawisc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting super close! I'm going to pre-approve so you can merge once the last grammar change is in. (I did not review the react changes)

Nice work with this one, I hope it's really helpful for our "0.1%" goal!

Comment thread launchers/launcher.go Outdated
Comment thread docs/parameters.yaml Outdated
@h2zh h2zh merged commit 2c39d54 into PelicanPlatform:main Jun 13, 2025
26 checks passed
@h2zh

h2zh commented Jun 17, 2025

Copy link
Copy Markdown
Contributor Author

One other note @h2zh -- I love that you often provide system diagrams with your PRs. You set a good example for the rest of us. Do you also store these in our internal documentation in the Pelican drive? If not, can you find a reasonable spot for them and put them there so we have some organized internal references to them?

@jhiemstrawisc Thanks! I really try to make my work neat, structured, and easy for everyone to understand, especially since I recently learned about the idea of "translational computing".

You bring up a good point about storing these diagrams in our internal documentation. I've been relying on GitHub PRs for feature write-ups because it offers some advantages:

  • Single Source of Truth: I only have to maintain one version, keeping the PR write-up up-to-date and linking all relevant tickets and other PRs directly within it.
  • Contextual Timeline: It's easier for anyone viewing the PR to grasp the entire timeline and how the feature evolved alongside the code changes.

However, I agree that having a central reference point in our Google Drive would be beneficial. What if we create a "symlink" (a file containing a link) in the "Design Docs" folder? I would name the file after the feature, and its content would simply be the direct link to the GitHub PR. This way, we get the best of both: the PR remains the living document, and our drive serves as an organized index.

What do you think of that approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cache Issue relating to the cache component critical High priority for next release director Issue relating to the director component origin Issue relating to the origin component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Graceful shutdown a Cache

3 participants