Bugfix for downtime handling in Director#2323
Conversation
- In Director, Clear expired downtimes set by Registry when no active downtimes set by Registry
CannonLock
left a comment
There was a problem hiding this comment.
LGTM.
As a comment I wish some of this was generalized in a function. I imagine ( without looking ) the code for the topo and server downtime updates looks similar to this.
// Remove existing filteredSevers that are fetched from the Registry first
for key, val := range filteredServers {
if val == tempFiltered {
delete(filteredServers, key)
}
}
// Build a new map to replace the in-memory federationDowntimes map
newFederationDowntimes := make(map[string][]server_structs.Downtime)
currentTime := time.Now().UTC().UnixMilli()
for _, downtime := range runningServersDowntimes {
// Save all active and future downtimes to the new map
newFederationDowntimes[downtime.ServerName] = append(newFederationDowntimes[downtime.ServerName], downtime)
// Check existing downtime filter
originalFilterType, hasOriginalFilter := filteredServers[downtime.ServerName]
// If this server is already put in downtime, we don't need to do anything to the filteredServers map
if hasOriginalFilter && originalFilterType != tempAllowed {
continue
}
// Otherwise, if it is an active downtime, we need to put it into the filteredServers map
if currentTime >= downtime.StartTime && (currentTime <= downtime.EndTime || downtime.EndTime == -1) {
filteredServers[downtime.ServerName] = tempFiltered
}
}
|
This is the React background talking but I would also prefer that filtered servers wasn't a separate value but a memoized function that only updates when the inputs change. Having this as its own state allows potential for it to desync from its parent. |
|
In Director, expired downtimes set by Registry should also be cleared when no active downtimes set by Registry.
How to test this PR locally
Spin up Registry, Director, Origin (or Cache). In Registry, create a downtime (ends at X minutes from now) for the Origin. Go to Director to verify this Origin is in downtime. X minutes later, this downtime expires. Check this Origin in Director again to make sure it is not in downtime.
If you find these downtime logics are too complicated... here is a diagram for the Downtime management workflow in Pelican #2218