[AIROCMLIR-707] Add split_kv memory checks for attention#2343
Open
bogdan-petkovic wants to merge 20 commits intoROCm:developfrom
Open
[AIROCMLIR-707] Add split_kv memory checks for attention#2343bogdan-petkovic wants to merge 20 commits intoROCm:developfrom
bogdan-petkovic wants to merge 20 commits intoROCm:developfrom
Conversation
dc38c7f to
b328f8e
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an early-validation guard for attention splitKV configurations to prevent late OOM/timeout failures by estimating extra temporary storage and rejecting oversized cases up front, with a user override via environment variable.
Changes:
- Add overflow-safe
splitKVextra-memory estimation inrock.attentionverification and emit a clear validation error when exceeding a limit. - Add HIP-based device global-memory query helper to derive a dynamic default limit per target architecture (with clamping and fallback).
- Minor: adjust bf16 attention sweep default
-RMS_threshold; treat 0-byte ROCm allocations as a no-op.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| mlir/utils/performance/parameterSweeps.py | Changes default bf16 attention sweep RMS threshold. |
| mlir/lib/Dialect/Rock/IR/RockDialect.cpp | Implements splitKV extra-memory estimation + verifier limit/diagnostic. |
| mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp | Adds HIP query to retrieve total GPU global memory for an arch. |
| mlir/include/mlir/Dialect/Rock/IR/AmdArchDb.h | Exposes the new device global-memory query API. |
| external/llvm-project/mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp | Makes 0-byte mgpuMemAlloc return nullptr early. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b328f8e to
adc577d
Compare
d9d09c8 to
3473ab6
Compare
Made-with: Cursor
Made-with: Cursor
Signed-off-by: bogdan-petkovic <bogdan.petkovic@htecgroup.com>
Signed-off-by: bogdan-petkovic <bogdan.petkovic@htecgroup.com>
Signed-off-by: bogdan-petkovic <bogdan.petkovic@htecgroup.com>
Signed-off-by: bogdan-petkovic <bogdan.petkovic@htecgroup.com>
Signed-off-by: bogdan-petkovic <bogdan.petkovic@htecgroup.com>
Made-with: Cursor
- Replace isa<> + cast<> pairs with dyn_cast<> in verifyCommonAttnGemmParameters - Replace std::unordered_map<std::string> cache with llvm::StringMap - Extract safeGlobalMemBytes helper to remove duplicated overflow-guard logic in lookupDeviceGlobalMemorySizeBytes Made-with: Cursor
48c179b to
d5ca8d5
Compare
Signed-off-by: bogdan-petkovic <bogdan.petkovic@htecgroup.com>
- Use dyn_cast<AttentionOp>(op.getOperation()) instead of dyn_cast<AttentionOp>(op) since op is a RockGemmGemmWrapperInterface value, not a raw Operation* - Move safeGlobalMemBytes inside the #ifndef _WIN32 guard as a lambda to avoid referencing hipDeviceProp_t where HIP headers are not included Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Some attention configurations on gfx1201 with large split_kv values were consuming too much memory and failing late (OOM/timeout behavior).
This PR makes those cases fail early with a clear validation error, so bad configurations are rejected before expensive lowering and runtime execution.
Technical Details
Added support to read total GPU global memory from HIP for the target architecture.
In attention verification, added a split_kv extra-memory estimator for output and LSE buffers, using overflow-safe arithmetic.
The verifier now compares required extra bytes against a limit and rejects oversized configurations with a clear error message that prints both required and allowed bytes.
The limit is chosen in this order:
ROCMLIR_ATTENTION_SPLITKV_MAX_EXTRA_BYTES environment override.
A device-based limit computed from GPU memory (device memory divided by 8, with safety clamping).
A fallback default when device memory cannot be queried.
Test Plan
Build MLIRRockOps and rocmlir-gen.
Run a large split_kv attention case and verify early rejection with an explicit memory-limit message.
Re-run the same case with ROCMLIR_ATTENTION_SPLITKV_MAX_EXTRA_BYTES override and verify generation succeeds.
Replay the 4 reported failing attention configurations through the same parameter sweep pipeline (gen -> driver -> runner).
Test Result
Build status: PASS.
Replay of the 4 reported configurations:
3 configurations are now INVALID at rocmlir-gen stage with explicit splitKV memory-limit errors.
1 configuration passes end-to-end.
Submission Checklist