Multiple Director Bug Fixes#2400
Merged
jhiemstrawisc merged 8 commits intoJul 16, 2025
Merged
Conversation
bbockelm
reviewed
Jun 19, 2025
bbockelm
requested changes
Jun 19, 2025
bbockelm
left a comment
Collaborator
There was a problem hiding this comment.
A few small items and a request for a unit test (since it looks like there's one big broken piece that predated this PR).
h2zh
requested changes
Jul 7, 2025
h2zh
left a comment
Contributor
There was a problem hiding this comment.
When two Directors are enabled in local dev environment, the Director Self Test on both Directors are flaky (see the below logs and screenshot).
DEBUG[2025-07-07T22:11:47Z] Starting a director test cycle for Cache server 88386cd92f0a at https://88386cd92f0a:8442
DEBUG[2025-07-07T22:11:47Z] Director file transfer test cycle succeeded at 2025-07-07T22:11:47Z for Cache server with URL at https://88386cd92f0a:8442
DEBUG[2025-07-07T22:11:47Z] Signing token with key id: at4Rza8YjVaviIxNlSFsDBux3pF2SsbTyxPsV9A9Pak
DEBUG[2025-07-07T22:11:47Z] Director is sending Cache server test result to https://88386cd92f0a:8449/api/v1.0/cache/directorTest
DEBUG[2025-07-07T22:11:48Z] Starting a director test cycle for Origin server 88386cd92f0a-Origin at https://88386cd92f0a:8443
DEBUG[2025-07-07T22:11:48Z] Signing token with key id: at4Rza8YjVaviIxNlSFsDBux3pF2SsbTyxPsV9A9Pak
WARNING[2025-07-07T22:11:48Z] Director file transfer test cycle failed for Origin server: https://88386cd92f0a:8443 Test file transfer failed during upload: Error response 423 from test file upload: 423 Unknown
DEBUG[2025-07-07T22:11:48Z] Signing token with key id: at4Rza8YjVaviIxNlSFsDBux3pF2SsbTyxPsV9A9Pak
DEBUG[2025-07-07T22:11:48Z] Director is sending Origin server test result to https://88386cd92f0a:8447/api/v1.0/origin/directorTest
WARNING[2025-07-07T22:11:48Z] Failed to report director test result to Origin server at https://88386cd92f0a:8447: error response 403 from reporting director test: {"status":"error","msg":"Failed to verify the token: Cannot verify token: Cannot verify token with federation issuer: Failed to get federation's public JWKS: cached object is not a Set (was \u003cnil\u003e)\n"}
h2zh
approved these changes
Jul 16, 2025
h2zh
left a comment
Contributor
There was a problem hiding this comment.
Neat and clever PR that fixes multiple bugs! It’d be nice to have a concise writeup with diagram. Also the next step could be adding prometheus metrics.
h2zh
requested changes
Jul 16, 2025
1feb31e to
a132837
Compare
Contributor
Author
|
Looks like Mac |
h2zh
approved these changes
Jul 16, 2025
Member
|
Using super powers to merge, since review of this PR was handed off from BrianB to Howard with a dangling request for changes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

This PR addresses issue #2369.
The following key changes were made:
Implemented
nilchecks in advertisement handling, specifically withinupdateInternalDirectorCacheandIsDirectorAdFromSelf. This resolves theSIGSEGVandnil pointer dereferencepanics that occurred when processingnildirector ads, a core problem noted in the issue.All service and director advertisements are now wrapped in a consistent
forwardAdstruct before being propagated to other directors. This ensures uniform processing across the federation.Directors now track the originator of incoming service ads and will not forward them back to the sender. This crucial change prevents advertisement loops, which were causing instability and timeouts.
Fixed a bug in
updateInternalDirectorCachewhere updating a director's information would inadvertently discard existing data. The cache update now correctly modifies the entry in-place, preserving data integrity and ensuring each director maintains an accurate view of its peers.