Fix GPU model brokerage: asymmetric include/exclude logic and case-insensitive matching#682
Conversation
|
This should fix the inclusion (string) and the exclusion (if defined in CRIC). @tmaeno please check once again that the logic is the correct one, thank you! |
| ): | ||
| continue | ||
| if model_excl: | ||
| # exclusion: only skip if the site declares a model that matches the pattern; |
There was a problem hiding this comment.
I don't quite understand the logic here. If the model is not contained in the CRIC dict, doesn't it mean that the PQ has no GPU so it should be skipped both for inclusion and exclusion?
There was a problem hiding this comment.
The "gpu" not in architecture_map check at line 1085 above already skips sites with no GPU entirely, so by the time we reach the model check the site is guaranteed to have a GPU.
The "model" not in architecture_map["gpu"] is a different case: the site has a GPU but its model isn't declared in CRIC. Several GPU queues (e.g. CERN-GPU, UKI-SOUTHGRID-RALPP_GPU) have real hardware but no model field in their CRIC architecture entry. This behavior is intentional for these sites: for inclusion we can't confirm the right model is there so we skip (strict); for exclusion we can't confirm the excluded model is there so we allow through (non-strict). Does this logic make sense?
There was a problem hiding this comment.
If a PQ does not publish GPU models, shouldn’t it be skipped for both inclusion and exclusion? E.g. what if a task wants to avoid a specific GPU and you send it to a PQ that actually has that GPU but does not publish this information? Such PQs can be used only when the user has not specified any model-related constraints—neither inclusion nor exclusion.
There was a problem hiding this comment.
You are right, if a PQ doesn't publish model info we can't make any model-based routing decision so we can't confirm it has the right model (inclusion) nor that it doesn't have the excluded model (exclusion). I will update my PR with the original logic but with "re.IGNORECASE" to catch the a100 vs A100 use-case
Add re.IGNORECASE to GPU model regexp matching so that patterns like .*A100.* and .*a100.* are equivalent. CRIC stores model names in lowercase (e.g. a100, v100, p100), so without this fix uppercase patterns always returned 0 candidates. PQs that do not publish model info in CRIC are skipped when any model constraint (inclusion or exclusion) is specified, since the actual GPU model cannot be determined.
9bceba9 to
5c6eddf
Compare
Summary
Add
re.IGNORECASEto GPU model regexp matching so that patterns like.*A100.*and.*a100.*are equivalent. CRIC stores model names in lowercase (e.g.a100,v100,p100), so without this fix uppercase patterns always returned 0 candidates.Queues that do not publish model information in CRIC are skipped whenever a model constraint is specified (inclusion or exclusion), since the actual GPU model cannot be determined.
Test plan
--architecture '{"gpu_spec": {"vendor": "nvidia", "model": ".*a100.*"}}'→ should route to queues declaringa100in CRIC--architecture '{"gpu_spec": {"vendor": "nvidia", "model": ".*A100.*"}}'→ same result (case-insensitive){"model": {"pattern": ".*P100.*", "excl": true}}→ should exclude queues declaringp100in CRIC; queues with no model info are also skipped