Skip to content

Fix GPU model brokerage: asymmetric include/exclude logic and case-insensitive matching#682

Merged
EdwardKaravakis merged 1 commit intomasterfrom
add_gpu_model_exclusion
Mar 18, 2026
Merged

Fix GPU model brokerage: asymmetric include/exclude logic and case-insensitive matching#682
EdwardKaravakis merged 1 commit intomasterfrom
add_gpu_model_exclusion

Conversation

@EdwardKaravakis
Copy link
Copy Markdown
Member

@EdwardKaravakis EdwardKaravakis commented Mar 17, 2026

Summary

Add re.IGNORECASE to GPU model regexp matching so that patterns like .*A100.* and .*a100.* are equivalent. CRIC stores model names in lowercase (e.g. a100, v100, p100), so without this fix uppercase patterns always returned 0 candidates.

Queues that do not publish model information in CRIC are skipped whenever a model constraint is specified (inclusion or exclusion), since the actual GPU model cannot be determined.

Test plan

  • Submit with --architecture '{"gpu_spec": {"vendor": "nvidia", "model": ".*a100.*"}}' → should route to queues declaring a100 in CRIC
  • Submit with --architecture '{"gpu_spec": {"vendor": "nvidia", "model": ".*A100.*"}}' → same result (case-insensitive)
  • Submit with exclusion {"model": {"pattern": ".*P100.*", "excl": true}} → should exclude queues declaring p100 in CRIC; queues with no model info are also skipped

@EdwardKaravakis
Copy link
Copy Markdown
Member Author

This should fix the inclusion (string) and the exclusion (if defined in CRIC). @tmaeno please check once again that the logic is the correct one, thank you!

):
continue
if model_excl:
# exclusion: only skip if the site declares a model that matches the pattern;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the logic here. If the model is not contained in the CRIC dict, doesn't it mean that the PQ has no GPU so it should be skipped both for inclusion and exclusion?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "gpu" not in architecture_map check at line 1085 above already skips sites with no GPU entirely, so by the time we reach the model check the site is guaranteed to have a GPU.

The "model" not in architecture_map["gpu"] is a different case: the site has a GPU but its model isn't declared in CRIC. Several GPU queues (e.g. CERN-GPU, UKI-SOUTHGRID-RALPP_GPU) have real hardware but no model field in their CRIC architecture entry. This behavior is intentional for these sites: for inclusion we can't confirm the right model is there so we skip (strict); for exclusion we can't confirm the excluded model is there so we allow through (non-strict). Does this logic make sense?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a PQ does not publish GPU models, shouldn’t it be skipped for both inclusion and exclusion? E.g. what if a task wants to avoid a specific GPU and you send it to a PQ that actually has that GPU but does not publish this information? Such PQs can be used only when the user has not specified any model-related constraints—neither inclusion nor exclusion.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, if a PQ doesn't publish model info we can't make any model-based routing decision so we can't confirm it has the right model (inclusion) nor that it doesn't have the excluded model (exclusion). I will update my PR with the original logic but with "re.IGNORECASE" to catch the a100 vs A100 use-case

Add re.IGNORECASE to GPU model regexp matching so that patterns like
.*A100.* and .*a100.* are equivalent. CRIC stores model names in
lowercase (e.g. a100, v100, p100), so without this fix uppercase
patterns always returned 0 candidates.

PQs that do not publish model info in CRIC are skipped when any
model constraint (inclusion or exclusion) is specified, since the
actual GPU model cannot be determined.
@EdwardKaravakis EdwardKaravakis force-pushed the add_gpu_model_exclusion branch from 9bceba9 to 5c6eddf Compare March 18, 2026 10:03
@EdwardKaravakis EdwardKaravakis merged commit 88ad18d into master Mar 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants