Skip to content

Multiple Director Bug Fixes#2400

Merged
jhiemstrawisc merged 8 commits into
PelicanPlatform:mainfrom
patrickbrophy:multi-director-fix
Jul 16, 2025
Merged

Multiple Director Bug Fixes#2400
jhiemstrawisc merged 8 commits into
PelicanPlatform:mainfrom
patrickbrophy:multi-director-fix

Conversation

@patrickbrophy

Copy link
Copy Markdown
Contributor

This PR addresses issue #2369.

The following key changes were made:

  1. Implemented nil checks in advertisement handling, specifically within updateInternalDirectorCache and IsDirectorAdFromSelf. This resolves the SIGSEGV and nil pointer dereference panics that occurred when processing nil director ads, a core problem noted in the issue.

  2. All service and director advertisements are now wrapped in a consistent forwardAd struct before being propagated to other directors. This ensures uniform processing across the federation.

  3. Directors now track the originator of incoming service ads and will not forward them back to the sender. This crucial change prevents advertisement loops, which were causing instability and timeouts.

  4. Fixed a bug in updateInternalDirectorCache where updating a director's information would inadvertently discard existing data. The cache update now correctly modifies the entry in-place, preserving data integrity and ensuring each director maintains an accurate view of its peers.

@patrickbrophy patrickbrophy added bug Something isn't working director Issue relating to the director component labels Jun 12, 2025
@patrickbrophy patrickbrophy linked an issue Jun 12, 2025 that may be closed by this pull request
Comment thread director/director_advertise.go Outdated

@bbockelm bbockelm left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small items and a request for a unit test (since it looks like there's one big broken piece that predated this PR).

Comment thread director/director_advertise.go Outdated
Comment thread director/director_advertise.go
@patrickbrophy patrickbrophy requested a review from bbockelm June 23, 2025 16:13
@patrickbrophy patrickbrophy requested a review from h2zh July 7, 2025 14:17

@h2zh h2zh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When two Directors are enabled in local dev environment, the Director Self Test on both Directors are flaky (see the below logs and screenshot).

DEBUG[2025-07-07T22:11:47Z] Starting a director test cycle for Cache server 88386cd92f0a at https://88386cd92f0a:8442 
DEBUG[2025-07-07T22:11:47Z] Director file transfer test cycle succeeded at 2025-07-07T22:11:47Z for Cache server with URL at https://88386cd92f0a:8442 
DEBUG[2025-07-07T22:11:47Z] Signing token with key id: at4Rza8YjVaviIxNlSFsDBux3pF2SsbTyxPsV9A9Pak 
DEBUG[2025-07-07T22:11:47Z] Director is sending Cache server test result to https://88386cd92f0a:8449/api/v1.0/cache/directorTest 
DEBUG[2025-07-07T22:11:48Z] Starting a director test cycle for Origin server 88386cd92f0a-Origin at https://88386cd92f0a:8443 
DEBUG[2025-07-07T22:11:48Z] Signing token with key id: at4Rza8YjVaviIxNlSFsDBux3pF2SsbTyxPsV9A9Pak 
WARNING[2025-07-07T22:11:48Z] Director file transfer test cycle failed for  Origin  server:  https://88386cd92f0a:8443   Test file transfer failed during upload: Error response 423 from test file upload: 423 Unknown 
DEBUG[2025-07-07T22:11:48Z] Signing token with key id: at4Rza8YjVaviIxNlSFsDBux3pF2SsbTyxPsV9A9Pak 
DEBUG[2025-07-07T22:11:48Z] Director is sending Origin server test result to https://88386cd92f0a:8447/api/v1.0/origin/directorTest 
WARNING[2025-07-07T22:11:48Z] Failed to report director test result to Origin server at https://88386cd92f0a:8447: error response 403 from reporting director test: {"status":"error","msg":"Failed to verify the token: Cannot verify token: Cannot verify token with federation issuer:  Failed to get federation's public JWKS: cached object is not a Set (was \u003cnil\u003e)\n"} 

image

@patrickbrophy patrickbrophy requested a review from h2zh July 10, 2025 14:22
@patrickbrophy patrickbrophy added this to the v7.18 milestone Jul 15, 2025

@h2zh h2zh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat and clever PR that fixes multiple bugs! It’d be nice to have a concise writeup with diagram. Also the next step could be adding prometheus metrics.

Comment thread director/director_advertise.go Outdated
Comment thread director/director_advertise.go Outdated
@patrickbrophy patrickbrophy requested a review from h2zh July 16, 2025 15:12
@patrickbrophy

Copy link
Copy Markdown
Contributor Author

Looks like Mac TestStatMemory is continuing to be flaky

@jhiemstrawisc

Copy link
Copy Markdown
Member

Using super powers to merge, since review of this PR was handed off from BrianB to Howard with a dangling request for changes.

@jhiemstrawisc jhiemstrawisc merged commit 10ce726 into PelicanPlatform:main Jul 16, 2025
26 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working director Issue relating to the director component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-Director Federation Testing Notes

4 participants