Graceful shutdown an Origin or Cache#2318
Conversation
|
Hey @jhiemstrawisc, could you review this before/after (whenever you think it's the good time) you work on |
@jhiemstrawisc @h2zh Is this grace period configurable? Do we have an idea how long "immediately" actually takes? I'm thinking ahead to origins and caches running on a Kubernetes cluster, and what it might take to configure them to shutdown gracefully. |
So, imagine the case where an admin is performing a routine upgrade of a cache or origin. We gracefully shutdown the service, and very soon after start it back up. If all goes well, this only takes a few minutes. In this scenario, what's the mechanism for removing the downtime? (Do we need the downtime in the first place?) |
cead8a4 to
e117439
Compare
|
Hey @brianaydemir, thanks for your suggestions!
I also updated the write-up and diagram on the top to reflect the latest workflow in the code. |
…gnal - Create and propagate a special advertisement for shutdown, asking the Director to stop redirect new transfer requests to this server - ShuttingDown status should have the highest priority in the health status. Once the `OriginCache_XRootD` component is set to `StatusShuttingDown`, Director has to pick it up in the `Status` field in server ad and put the server into downtime - Wait for Xrootd_ShutdownTimeout for the in-flight transfers to finish - If the in-flight transfers are not finished by the timeout, still shutdown the server - Bugfix: In daemon.LaunchDaemons, write expiry back into the slice. Otherwise the original for-loop modifies the copy and doesn't affect the original. New expiry time is not saved
- And lift the downtime when the server is back online
e117439 to
c986ae2
Compare
jhiemstrawisc
left a comment
There was a problem hiding this comment.
A few comments/questions/suggestions. Otherwise I tested this locally and observed both the graceful shutdown and that restarting the Origin cleans out the old filter. Nice job!
|
One other note @h2zh -- I love that you often provide system diagrams with your PRs. You set a good example for the rest of us. Do you also store these in our internal documentation in the Pelican drive? If not, can you find a reasonable spot for them and put them there so we have some organized internal references to them? |
- Address PR review - Change webUI: disable (grey out) "Restart Server" button after user clicks it for 1 min to allow a graceful shutdown then restart
jhiemstrawisc
left a comment
There was a problem hiding this comment.
Getting super close! I'm going to pre-approve so you can merge once the last grammar change is in. (I did not review the react changes)
Nice work with this one, I hope it's really helpful for our "0.1%" goal!
@jhiemstrawisc Thanks! I really try to make my work neat, structured, and easy for everyone to understand, especially since I recently learned about the idea of "translational computing". You bring up a good point about storing these diagrams in our internal documentation. I've been relying on GitHub PRs for feature write-ups because it offers some advantages:
However, I agree that having a central reference point in our Google Drive would be beneficial. What if we create a "symlink" (a file containing a link) in the "Design Docs" folder? I would name the file after the feature, and its content would simply be the direct link to the GitHub PR. This way, we get the best of both: the PR remains the living document, and our drive serves as an organized index. What do you think of that approach? |
Trigger the graceful shutdown once a Pelican Origin or Cache receives a SIGTERM signal (
kill -TERM <pid>)Upon receiving
SIGTERM, the Origin or Cache process waits one minute for in-flight transfers to complete. If any transfers remain after the one-minute grace period, the server shuts down immediately.Before the wait, the Origin or Cache sends a shutdown advertisement with the new
Statusfield (="shutting down") to the Director, instructing it to stop routing any new transfer requests to this server. The Director then applies an indefinite downtime window to this server, so that offline servers still listed in its cachedserverAdswill be excluded from new requests. Whenever the server is back online, it will send a regular server ad to flush out the "shutting down" downtime.cc @bbockelm Here is a formal up-to-date system design diagram. (Content in this PR is marked in red)