Skip to content

Refactor blockwise reduce lowering with DPP SubgroupReduceOp#2301

Open
stefankoncarevic wants to merge 8 commits intoROCm:developfrom
stefankoncarevic:dpp-refactor-blockwise-reduce
Open

Refactor blockwise reduce lowering with DPP SubgroupReduceOp#2301
stefankoncarevic wants to merge 8 commits intoROCm:developfrom
stefankoncarevic:dpp-refactor-blockwise-reduce

Conversation

@stefankoncarevic
Copy link
Copy Markdown
Contributor

@stefankoncarevic stefankoncarevic commented Mar 17, 2026

Resolves: https://amd-hub.atlassian.net/browse/AIROCMLIR-149

Motivation

The BlockwiseBroadcastReduceOp lowering in BlockwiseGemmToThreadwise.cpp handles the reduction of partial results across threads within a workgroup. In the blockSize > nonReductionDimSizeProduct path, all inter-thread reductions currently use an LDS-based tree reduction loop requiring log2(N) barrier-synchronized LDS round-trips. This works correctly but leaves performance on the table for cases where hardware-accelerated subgroup (wave-level) reduction is available.

This PR adds a DPP-based reduction path using gpu::SubgroupReduceOp with cluster_size for eligible configurations, while keeping the existing LDS-Tree as fallback for all other cases. Works correctly on both CDNA (waveSize=64) and RDNA (waveSize=32) architectures.

Technical Details

Two reduction paths (blockSize > nonReductionDimSizeProduct)
The lowering now selects one of two paths based on DPP eligibility:

DPP path (canUseDPP = true) — new:

  • All 5 conditions met: power-of-2 reduction threads, more than 1 thread, partial_r > 2, threads fit within a single wave (<= waveSize), and block has enough threads or non-reduction dim is trivial

  • Threadwise pre-reduction in registers → gpu::SubgroupReduceOp with cluster_size → leader thread (rtid == 0) writes result to LDS → broadcast

  • Thread layout: Contiguous — rtid = tid & (cluster-1), nrtid = tid >> log2(cluster)

Tree path (existing, unchanged) — fallback:

  • DPP conditions not met (non-power-of-2 threads, partial_r <= 2, threads exceed waveSize, etc.)

  • log2(N) LDS tree reduction loop with barrier per step → broadcast

  • Thread layout: Scattered — rtid = tid / nonReductionDimSizeProduct, nrtid = tid % nonReductionDimSizeProduct

Test Plan

  • All existing integration tests pass (lit test suite for reduce/blockwise_reduce/)
  • New blockwise_reduce_dpp_cluster_sizes.mlir test covers cluster sizes 4, 8, 16, 32, 64
    with both sum and max reductions

Test Result

Submission Checklist

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the Rock BlockwiseBroadcastReduceOp lowering to use gpu.subgroup_reduce (with clustered reductions where applicable) and adds a Rock backend pass to lower gpu.subgroup_reduce to AMD DPP instructions, improving inter-thread reduction performance on supported architectures.

Changes:

  • Update blockwise broadcast-reduce lowering to select between shuffle+DPP, serial XOR shuffle, and LDS tree fallback paths, with shared helper functions.
  • Introduce rock-subgroup-reduce-to-dpp pass and wire it into the backend pipeline before convert-gpu-to-rocdl.
  • Extend/adjust tests and pipelines to cover the new DPP clustered reduction behavior and pass ordering.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp Refactors reduction lowering, adds shuffle/DPP paths, helpers, and emits gpu.shuffle + gpu.subgroup_reduce.
mlir/lib/Dialect/Rock/Transforms/SubgroupReduceToDPP.cpp New Rock pass to lower gpu.subgroup_reduce into AMD DPP sequences via GPU transform patterns.
mlir/lib/Dialect/Rock/Transforms/CMakeLists.txt Adds the new pass source and links GPU transforms library.
mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp Inserts rock-subgroup-reduce-to-dpp into the backend pipeline after lowering affine.
mlir/include/mlir/Dialect/Rock/Passes.td Declares the new pass and its chip option.
mlir/include/mlir/Dialect/Rock/Passes.h Adds the generated pass decl macro for the new pass.
mlir/test/rocmlir-driver/pipelines.mlir Updates expected printed pipelines to include rock-subgroup-reduce-to-dpp{chip=...}.
mlir/test/Dialect/Rock/lowering_blockwise_broadcast_reduce.mlir Updates lowering checks and parameterizes arch via token substitution.
mlir/test/Dialect/Rock/integration/reduce/blockwise_reduce/blockwise_reduce_dpp_cluster_sizes.mlir New integration test covering multiple cluster_size cases and both sum/max reductions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlir/lib/Dialect/Rock/Transforms/SubgroupReduceToDPP.cpp Outdated
Comment thread mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a DPP-accelerated reduction path for rock.blockwise_broadcast_reduce by lowering eligible inter-thread reductions to gpu.subgroup_reduce with cluster_size, and adds a Rock backend pass to lower gpu.subgroup_reduce to AMD DPP instructions in the backend pipeline.

Changes:

  • Add a DPP-capable lowering path in BlockwiseGemmToThreadwise.cpp that emits gpu.subgroup_reduce for eligible configurations, keeping the LDS tree reduction as fallback.
  • Add rock-subgroup-reduce-to-dpp backend pass and wire it into the Rock backend pipeline before convert-gpu-to-rocdl.
  • Add/update lit + integration tests to cover clustered subgroup-reduce scenarios and ensure the new pass is present in dumped pipelines.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp Emits gpu.subgroup_reduce for eligible blockwise reductions; refactors final LDS-readback into a helper.
mlir/lib/Dialect/Rock/Transforms/SubgroupReduceToDPP.cpp New Rock pass that lowers gpu.subgroup_reduce (clustered and non-clustered) to AMD DPP patterns.
mlir/lib/Dialect/Rock/Transforms/CMakeLists.txt Registers new transform and links GPU transform library support.
mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp Inserts rock-subgroup-reduce-to-dpp into the backend gpu.module pipeline.
mlir/include/mlir/Dialect/Rock/Passes.td Declares the new rock-subgroup-reduce-to-dpp pass and its chip option.
mlir/include/mlir/Dialect/Rock/Passes.h Adds the generated pass decl macro for the new pass.
mlir/test/rocmlir-driver/pipelines.mlir Updates pipeline-dump checks to include the new pass for gfx90a/gfx942/gfx950.
mlir/test/Dialect/Rock/lowering_blockwise_broadcast_reduce.mlir Updates lowering checks to reflect the new DPP path IR patterns.
mlir/test/Dialect/Rock/integration/reduce/blockwise_reduce/blockwise_reduce_dpp_cluster_sizes.mlir New integration test covering multiple cluster_size values and sum/max reductions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp Outdated
Comment thread mlir/lib/Dialect/Rock/Transforms/SubgroupReduceToDPP.cpp Outdated
mirza-halilcevic added a commit that referenced this pull request Apr 25, 2026
…2c2f

eaf55a972c2f merge main into amd-staging (#2315)
aa65d53f62e3 merge main into amd-staging
1348766d1d68 [SLP]Initial support for non-power-of-2 vectorization
0ad0e899d456 [libc++] Remove full header path from assertion messages (#190060)
2f28e1db535b [libc++] Implement P1899 `ranges::stride_view` (#65200)
3e10b2fe2169 [clang] Fix incorrect register information for AVR (#193940)
61f311d93ed0 [AVR] Fix a bug in printing assembly operand with extra code (#193964)
5c73c7a3a057 [lldb] Propose MultiBreakpoint extension to GDB Remote (#192910)
4ed36386a276 [asan] API for getting multiple pointer ranges (#181446)
6443e9b8a5bc [clang] Tests for CWG1670 and CWG1878: `auto` in conversion functions (#187850)
0ec82abfd2c1 [clang] Enable part of CWG2598 test in C++20 mode (#189310)
55af9c26151b [libc][math] Refactor dsub family to header-only (#182160)
3bbfa3e5f07a [LoongArch] Combine rounded vector shifts to VSRLR/VSRAR (#192921)
80fc7f447952 merge main into amd-staging (#2312)
a693efcc40b1 [RISCV][GlobalISel] Support RISC-V specific inline asm constraints: 'I', 'J', 'K' and 'S' (#193765)
3dc4fd6dd411 [compiler-rt][sanitizer] Remove linux/scc.h (#194116)
320a5154ecd3 [LoongArch] Add tests for vector shift-right-and-round combines (#192920)
aeea3191d416 [MC] Change MCContext::getTargetOptions to return a reference. NFC (#194112)
5e09af5f30a7 [clang][bytecode] Reject inc/dec on non-numbers (#193954)
7bbfee35775e [clang][bytecode][NFC] Add record/union names in Descriptor::dump() (#194002)
90ec315dcb31 merge main into amd-staging
84b0809a84f4 [GIsel] Add constant-folding for bit-counting ops (#194010)
2428fbb613be [NFC][ThinLTO] Remove JumpTableToSwitchPass from the test (#194103)
7059fc556bfe Revert "[Clang][CodeGen] Report when an alias points to an incompatible target" (#194106)
c3df8f8c8337 [SPIRV] Add 64 bit lowering for bitreverse (#193068)
69797daef12f [revPat] revert [flang][OpenMP] Move branching verification to semantic checks (#193324)
34f75f040637 Revert "[flang][OpenMP] Move branching verification to semantic checks (#193324)"
6dd373f8aaab [sanitizer] Relax pthread_join tests for different glibc versions (#194100)
bd1c30811723 [Clang][CodeGen] Report when an alias points to an incompatible target (#192397)
e66613f38124 [CSKY] Fix build after #191460 (#194102)
f478220dd062 [LoongArch] Add support for vector FP_EXTEND from vxf32 to vxf64 (#164746)
4a7ae4b900fd Revert "Reland: [LowerTypeTests] Add debug info to jump table entries" (#194095)
87317d39f44a [compiler-rt][WebAssembly] Use an int as CMP_RESULT (#194093)
0c472c140158 [lldb] Handle partial memory region coverage in IRMemoryMap::FindSpace (#194001)
a1b9a3d6ce63 merge main into amd-staging
a041024da2f9 Revert "[compiler-rt] Improve ubsan-minimal runtime for GPU use (#193597)"
44638abac851 merge main into amd-staging
e38b8da23b0b [RISCV][P-ext] Remove dead code from LowerOperation handling of ISD::STORE. NFC (#194088)
ec9d7d18bdfe Revert "[llvm-profgen] Add support for ETM trace decoding" (#194087)
ecdcd40e233b [DirectX] Emit `dx.precise` metadata when fast math is not present (#192526)
b4c1e1a14e47 [RISCV] Expand fcanonicalize on vector types (#193842)
5064b936bec6 [clang][deps] Always initialize module cache out params (#194082)
e3bd61890e68 [llvm-profgen] Add support for ETM trace decoding (#191584)
4e4b91c6b690 merge main into amd-staging (#2308)
cf183f4509a3 Manual update of LLVM_MAIN_REVISION to 577912
327f027f108e [offload] Fix compilation (#194081)
ca934b892fdd [dsymutil] Report error when section offsets exceed DWARF32 limit (#193867)
5536a4c7122e [LFI][AArch64] Add rewrites for control flow (#192602)
d1c9b4a53975 [MLIR][XeVM] Update API usage. Some OpenCL APIs are not supported. (#193320)
d0c91de53e3b [clang][NFC] Linux/Windows Multilib Include Path Tests (#193869)
d14866f029e9 gn build: Port a4538a3ad902
2b43da5ac0fa [NewPM] Port for AArch64StackTaggingPreRA (#194021)
b49855fc5684 [AMDGPU][MC] Allow the nolds modifier (#185129)
7ea78deff2d1 Revert "workflows/issue-release-workflow: Use GitHub app for generating tokens" (#194058)
ebbaa93e005e [llvm] Implement the BPF ABI (#194031)
da2c4a9efe99 [clang] Add constant evaluation support for CK_ToUnion. (#193370)
5b570d1b3b1d [NFC][MLIR] Use `getIntrinsicSignature` to verify overloaded intrinsics (#194035)
fa2588e3110f [NFC][NSAN] Use `getIntrinsicSignature` instead of `matchIntrinsicSignature` (#194025)
bfa88a8d3c3b [libc] Implement wcscoll (#192778)
0da7c12e43bf [AMD-GPU]  Fix smfmac builtin target (#193999)
ef739b97b108 [AMDGPU] Correct gfx950 smfmac sparse index verifier (#193541)
7ec8037f3243 workflows/issue-release-workflow: Use GitHub app for generating tokens (#193825)
39e02c141d03 [Offload][AMDGPU] Use ROCr API for APU check (#193887)
833025532955 [lldb] Fix build logic in TestPtrAuthExpressions.py (#193847)
a47eca0ae99f [lldb] Rewrite make rules for TestFileBreakpointsSameCUName.py (#193871)
9805381e5f01 [libc] Move mul_overflow to math_extras.h (#194033)
6fb343619009 [dsymutil] Handle DW_OP_GNU_push_tls_address in markEverythingAsKept (#193870)
f098aa3b5b58 [lldb][Darwin] debugserver expedite new binary info, lldb use (#192754)
32eb90bfdf9f [clang][deps] Keep module cache in memory (#192347)
ad5a7609df9b [SLP]Do not cache sentinel position for SplitVectorize nodes
125f69d71ad1 [libc][math] Refactor fabs family to header-only (#182173)
719c38062a3b [NFC][lld] Avoid hex address case sensitivity in fill-trap tests (#194037)
7de8b11607bf [LoopUnroll] Make optimization remarks more precise (#190714)
ed02685dd70f Revert "AMDGCN: Skip -fgnuc-version version for amdgcn (temporary wor… (#1961)
b08ec97d37cc [LoopVectorize] Don't replace widen with replicate for ExtractValueInst (#193404)
d4ba0194f52b [mlir] Add analysis filter in dataflow solver (#192998)
ec5862a28e59 Re-apply: workflows: Use main-branch-only environment when using ISSUE_SUBSCRIBER_TOKEN (#179990) (#193801)
72ca372fa7c9 Revert "AMDGPU: Implement getInstSizeVerifyMode" (#194026)
25ec1baf2eb9 [CIR] Fix remaining (part 2) FlattenCFG rewriter contract violations (#192503)
dc41953559a6 [MLIR][NVVM] Add `nvvm.ex2` OP (#193790)
2948f9a784e8 [CIR] Add `__attribute__((annotate(...)))` support (#193329)
a3285a1a14db clang: Check -Xarch compatibility using Triple parsed architecture. (#189651)
600efe3dd9bb AMDGPU: Implement getInstSizeVerifyMode (#191461)
7b336dcfce1d [lldb] Allow forks to occur in expression evaluation (#184815)
f24bfb8967cb [SPIR-V][NewPM] Register IR-level passes with the new pass manager (#193660)
a497f90dc091 [compiler-rt] Improve ubsan-minimal runtime for GPU use (#193597)
3f3c26039f13 [DirectX] Resolve unreachable default branches in switch statements (#193592)
b785dc42c7b1 [CIR] Update uses of no-prototype GetGlobalOp (#193868)
589d337d3d44 [SLP] Update analyzeRtStrideCandidate() to correctly handle types widen than i8 (including revectorization)  (#191878)
8e4f0ce69853 [HLSL] Remove support for user-defined constructors and destructors (#193375)
a1a4da0416ad [libc][math] Refactor ddiv family to header-only (#182149)
b06f62f7fc82 [llvm] Introduce TargetInfo (#190730)
323c3da8dcb8 [flang] [flang-rt] Implement AT edit descriptor for Fortran 202X with appropriate handling and tests (#189157)
b3cc1929966e [NFC][Clang][ByteCode] Apply rule of three to Context and EvalIDScope (#193856)
61bfd7db9f55 [CIR] Tolerate identical source and destination in cir.copy (#193852)
7c043b7a571c [MLIR][XeGPU][VectorToXeGPU] Fixed lowering of transfer_read/write for rank > 2 (#193308)
0dc6e8c41e5f [AMDGPU][NFC] Refactor TryGetMCExprValue into evaluateMCExprs helper (#193859)
63e36755decf [DirectX] Denote `dx.resource.getpointer` with `IntrInaccessibleMemOnly` and `IntrReadMem` (#193593)
df359d81659a [SLP] Skip FMulAdd conversion for alt-shuffle FAdd/FSub nodes (#193960)
1ba6cc0e318b [clang][CIR] Add lowering for vcvtd_n_ and vcvts_n_ conversion intrinsics (#190961) (#193273)
cbda767c2a41 [ThinLTO] Reduce the number of renaming due to promotions in distribu… (#188074)
e8f32abba9a1 CodeGen: Fix double counting bundles in inst size verification (#191460)
1823355d06b8 [ARM] Fold SELECT (AND(X,1) == 0), C1, C2 -> XOR(C1,AND(NEG(AND(X,1)),XOR(C1,C2)) in Thumb1 (#185898)
a614cd391a40 [lldb-dap][windows] fix a race condition in runInTerminal mode (#193773)
0b449f66927f [flang] Fix abort on invalid -fdo-concurrent-to-openmp value. (#193929)
deb84db5b405 [DirectX] Apply DXIL op fnattrs to declarations (#193622)
eb17a2e19c85 [libclc] Make sure PACKAGE_VERSION is set for libclc (#193966)
1bd6f6636a39 Revert "[PreISelIntrinsicLowering] Expand binary elementwise intrinsics (#193552) (#193580) (#193990)
0f861ec33ae6 [MLIR][NVVM] Add `nvvm.cos` OP (#193792)
0642d03c76b3 [MLIR][NVVM] Remove ptx version for consistency (#193991)
cd4ac81779e0 [SLP] Add new test for widened strided loads of > i8 width (#193901)
94cdc55d8a77 [MLIR][XeGPU] Remove use-by-broadcast-only restriction for ShapeCast op in Wg-to-Sg distribution pass (#193640)
87a9cbaed1d6 [compiler-rt][TySan] Add Hexagon target support (#191603)
b0931483280d [libc][math] Refactor dmul family to header-only (#182151)
25a035d94486 [clang][bytecode] Reject float-to-int casts on non-numbers (#193968)
08992737d231 [VPlan] Use early continue in ::buildVPlansWithVPRecipes (NFC). (#193979)
4837b0a476eb [LifetimeSafety] Suppress suggestion/inference for moved loans (#193899)
785d7246bf16 [AMDGPU][Disassembler] Permit unneeded VOPD3 operands to be non-zero (#193974)
c92bf56cd7bb [lldb][AArch64][Linux] Rename "por" register to "por_el0" (#193983)
52534a1e1207 Revert "[C++20] [Modules] Don't profiling the callee of CXXFoldExpr (#190732)" (#193975)
f1f2022f4f18 [libc][docs] Add sys/uio.h implementation status (#122006) (#193980)
2168f4b3d3bf [flang][NFC] Converted five tests from old lowering to new lowering (part 48) (#193889)
c49b1773b223 [clangd] [C++20] [Modules] Introduce GC for clangd built modules (#193973)
771440f5bb2e [libc][docs] Add dlfcn.h implementation status (#122006) (#193972)
6c7d16c0bfd3 [libclc] Use 'LLVM_DEFAULT_TARGET_TRIPLE' instead of 'LLVM_RUNTIMES_TARGET' (#193969)
8e5b38383f1e [flang][OpenMP] Rename dirSpec to spec in openmp-parsers.cpp, NFC (#193967)
8baf33522df3 [BOLT][AArch64] Refuse to run JTFootprintReduction pass (#193946)
aca5d1ed27f7 [LifetimeSafety] Remerge "Add support for `new`/`delete`" (#193776)
97b7cee34583 [CIR] Introduce LocalInitOp, & lower static locals (#193576)
ee4d927dfc01 [mlir][tosa] Fix integer bilinear (quantized) tosa.resize lowering to use floordivsi (#193821)
b5d253cf31c2 [flang][NFC] Converted five tests from old lowering to new lowering (part 47) (#193886)
48e0e16886f9 [lldb-dap] extend env when testing reverse request (#193743)
629f81599d6e [X86] freeze-binary.ll - regenerate to show VPADD constant asm comments (#193953)
7b68b4b2c196 [lldb][docs] Document AArch64 Linux Permission Overlay support (#184119)
d3f4fc750db9 [AArch64][clang] Fix typos in `arm_sve.td` (NFC) (#192981)
bb3d25167abe [libc][docs][POSIX] Add sys/select.h implementation status (#122006) (#193948)
4c66205b070c [lldb][Linux] Add overlay and effective permissions to "memory region" (#184115)
a528529064db [clang][bytecode] Allow constexpr-unknown values in GetPtrBase{,Pop} (#193903)
e2170a0f18cd [VPlan] Remove unused LVer arg from tryToBuildVplanWithVPRecipes (NFC). (#193950)
e3443a1189b9 [libc] add `pthread_cond_*` public interfaces (#193656)
17861903e6bb [LoongArch] Custom legalize vector_shuffle to `xvshuf4i.d` (#164213)
e7edfd81ca1a [X86] Regenerate vector ext tests to show VPADD constant asm comments (#193942)
c0844b7b65b4 [X86] known-never-zero.ll - regenerate to show VPADD constant asm comments (#193943)
0844fdfd7255 [clang][bytecode] Start lifetime when activating pointers (#192589)
c59c19bf5921 [MachineSSAUpdater][AMDGPU] Add faster version of MachineSSAUpdater class. (#145722)
0571ce414ec0 [flang][OpenMP] Move branching verification to semantic checks (#193324)
8141a4351c5e [flang][OpenMP] Make OpenMPLoopConstruct inherit from OmpBlockConstruct (#193823)
48e65b6c4337 [AArch64][GlobalISel] Add a variant of gi_extract_high_v8bf16 (#193345)
a227dc7c0980 [DAG] visitIS_FPCLASS - fold to constant when result is fully determined by KnownFPClass (#193737)
e6db81282c7f merge main into amd-staging
f64e8d1567ad [AMDLIBM] Remove the mapping of the deleted vector call (#193760)
a48159df9ce5 [AArch64][llvm] Remove support for FEAT_MPAMv2_VID (#193191)
cedcf1876cb1 [Clang][Sema] Change `ExtnameUndeclaredIdentifiers` to MapVector. (#193924)
90020486aa89 [X86] masked div/rem tests - fix avx512 and add sse4/avx2 test coverage (#193933)
3fa0ac20705b Reland "[lldb][Linux] Read memory protection keys for memory regions (#193934)" (#193936)
cc0913eacf88 [clang][bytecode] Fix `MemberExpr`s with a static member (#193902)
9df51a964ddb [MLIR][NVVM] Add `nvvm.sin` OP (#193775)
23ea7363ff76 [AAEval] Print ModRefInfo for atomic operations (#193935)
390a29ea8339 Revert "[lldb][Linux] Read memory protection keys for memory regions" (#193934)
e2159f8e3c1a [LangRef] Allow monotonic & seq_cst accesses to inter-operate with other accesses (#189014)
b1af74c44790 [flang][OpenMP] Remove duplicate code block in MapInfoFinalization (#2306)
69724c88e3e0 [lldb][Linux] Read memory protection keys for memory regions (#182246)
faf37adb3b33 [Clang] Use const std::string & in ClangOptionDocEmitter. NFC. (#193926)
ca136926c125 Fix formatting of changes in recent redefine_extname changes. (#189938)
f780e46d6e0e [llvm][ExpandMemCmp] Avoid making copy of loop value (#193915)
8685fb10938f [mlir][math] Add constant folding for `math.fpowi` (#193761)
839a22f449b3 [Flang] Add `INLINEALWAYS` Compiler Directive (#192674)
666e2af1d575 merge main into amd-staging (#2305)
170f030c22c5 [mlir][math] Use APFloat::SemanticsToEnum in constant folding (#193914)
3ec9bbc3da9c [DSE] Merge two test files and generate checks (NFC) (#193922)
297fb9377a6a [LICM] Generate test checks (NFC) (#193921)
4852c5b159c8 [ARM][MC] Gate Thumb hvc alias on virtualization (#193532)
c55a73c44e4a [lldb] Remove full stop from AppendErrorWithFormat format strings (part 1) (#193750)
340ba1191cfc [SPIRV] Do not add aliasing decorations to OpAtomicStore/OpAtomicLoad (#193779)
bd469e8a1d47 [SDAG] Minor cleanup to TargetLowering::expandFP_ROUND. NFC (#193793)
347aa3f6fbcc [GISel] Disable opt_brcond_by_inverting_cond combine at O0 (#193417)
efffb04c2b7b [DirectX] Fix DILocalVariable (#192573)
3de31988417d [DirectX] Replace non-const count of DISubrange with -1 (#192576)
8b3eac05aa74 [DirectX] Convert DICompileUnit versioned language (#192574)
eb6eb9fa0a0c [DirectX] Convert debug values to old style (#192162)
6114cbb611c8 [DirectX] Fix debug dump of ValueEnumerator (#191251)
6c16fc8a1a1d [lldb][test] Remove full stop from expected error messages (#193748)
248192d5bef1 [RISCV] Add bf16 tests for interleave and deinterleave (#193720)
e4f1530edcc9 [flang][debug] generate llvm.fake.use for arguments at -g and O0 (#187044)
fc9f14e42422 [libc] Switch check-libc from CTest to lit (#193798)
65e766dfda81 [libc] Honour LIBC_GPU_TEST_JOBS in lit test runs (#193797)
7758ee59e7a2 [libc] Fix implicit conversion warning in mktime_test (#193504)
0d332848bf28 [SPIRV] Lower load/store atomic to OpAtomicLoad/OpAtomicStore (#185696)
b5483871391b [LV] Simplify live-out extraction for first-order recurrence phis when tail folding (#176108)
99c9a1f566df [mlir][EmitC] Add tests for arith.max/min float/signed int conversions (#190160)
b565800d99a4 [lldb] Add regression test for stale Symbol pointer crash in statusline (#193854)
e7e85a744871 [IR] Remove pointer arguments from loop.dependence.{war|raw}.mask (#188248)
347f1ac86d54 [MLIR][Vector] Add fastmath attribute to vector.contract (#192788)
a7368c3b48f8 [NFC][Clang][docs] Clarify the status of P1949R7 (unicode identifiers) (#193483)
3041708a17b3 [Tooling][clang-tools-extra] Consume CommonOptionsParser errors in tools (#193675)
aadf3959eb0d [libcxx][Github] Add generic-llvm-libc config to CI (#193822)
ab27c601b01a [Comgr] Add end-to-end LIT coverage for amd_comgr_hotswap_rewrite (#2291)
5d4b17e963b1 [Clang][SPIRV] Add getSRetAddrSpace() for SPIRV (#193875)
104ee2aed28d [NFC] [clangd] [C++20] [Modules] Add a test for testing transtive change detection (#193888)
70e2e7e63f98 merge main into amd-staging
dbaa12a89f45 [AMDGPU] Add MC tests for scalar operands for packed fp32 instructions (#193866)
eef81b7a0a63 [lldb/test] Fix TestModuleLoadedNotifys duplicate module check (#193846)
70fcb235250b [lldb/test] Fix TestCompletion on Windows after realpath change (#193878)
28d2537af2b6 [clangd] [C++20] [Modules] Introduce persistent cache for clangd built module file (#193883)
af166f419fb9 [LoongArch][NFC] Pre-commit tests for vector fpext from vxf32 to vxf64 (#164740)
1249cb6aea88 [clang-scan-deps] Fixes an assertion in clang-scan-deps (#193619)
61b0de5f14a3 [RISCV] Remove codegen for vp_fneg, vp_fma. NFC (#193214)
67e1411de836 [VPlan] Fold lhs | (headermask && rhs) -> vp.merge rhs, true, lhs, evl (#193511)
e3ab3688e1c6 [X86][COFF] Enable basic-block-address-map emission (#191347)
6e2f5e9679cd [OpenCL] Diagnose error for zero-length array (#193163)
528e673fec47 [Clang][CodeGen] Fix sret lifetime marker AS mismatch after #186275 (#193850)
969247cc47a3 [libclc] Allow testing unresolved symbols on multiple libraries (#193647)
f3192382c336 [libclc][CMake] Remove CMAKE_C_COMPILER_ID check (#186717)
a95a1c40edba [LazyValueInfo] Support vector types in ICmp condition handling (#192900)
c6b998443589 [NFC][Clang][Sema] Apply rule of three to Sema helper classes (#193835)
deb238e224b7 [gn build] Port 3081d52d8242 (#193862)
2611f151c3a4 [gn build] Port a4538a3ad902 (#193863)
9b7b83b3499b [gn build] Port d137e6601f1c (#193864)
0cd635ca4504 [gn build] Port d64dd5a2afea (#193865)
4f877e47e69b [gn build] Port 2039a51881bb (#193861)
6b4cdb036471 Revert "[gn] port 40fcd2517a110 (#193293)" (#193860)
2a74f30cc20e [CIR] Add coroutine cleanup handling and update co_return semantics (#189281)
0bdcf4ee4dd5 Revert "Reapply "[clang][modules-driver] Add support for C++ named modules and `import std`" (#193857)
0d1bf3ac8eb1 [HLSL][NFC] Refactor worklist loop in HLSLEmitter.cpp to use index-based iteration (#193638)
3174c94eaf20 Revert "[flang][cuda] Preserve fir.rebox captured by cuf.kernel in SimplifyArrayCoorOp" (#193855)
8e736e102bd5 merge main into amd-staging
9473873906b6 Reapply "[clang][modules-driver] Add support for C++ named modules and `import std`" (#193815)
689dc6c58c01 [CIR] Handle boolean expression as array indexes (#193814)
2709f4872c66 [Flang][OpenMP] Support for parallel regions in Generic kernels (II) (#2276)
dca61067836e [mlir][xegpu] Add support for `vector.transfer_read/write` on SLM buffers (#192757)
28027f8ffee1 [MachineOutliner] Do not allow debug instructions to affect liveness computations. (#192336)
6364bf68058b [lldb] Remove unused ValueObject::IsBaseClass(uint32_t &depth) (NFC) (#193849)
c7c48e51a43d merge main into amd-staging (#2299)
8b96c2104e74 [flang] Add comparison operators for c_devptr (#192687)
464392e9d3b5 [TySan] add internal interface support (#192413)
b756be64cad6 [flang][cuda] Preserve fir.rebox captured by cuf.kernel in SimplifyArrayCoorOp (#193837)
42077db9afb5 [lldb/test] Fix TestOSIndSYM for Darwin embedded platforms (#193839)
281f993aafa6 [CIR] Add nonnull on returns and pointer params (#188281)
3cd4a795d9fd clang: Avoid hardcoding some offload triple strings (#193811)
8ebc7307fa4a [llvm-rc] Add support for MIPS machine (#193830)
7ed9d965f29e [AArch64][PreISelIntrinsicLowering] Adjust tests to include -march=+sve (#193833)
81b827f2e71a [Comgr] Add agent-config files (CLAUDE.md, .cursor/rules/comgr.mdc) (#2301)
de82b4790943 [Clang] Fix sret AS for non-trivial-copy returns. (#186275)
6b31a99ee4f5 Revert "[Darwin] Remove linker version checks for objc_msgSend selector stubs (#193637)" (#193828)
3baafed3e779 [NFC][offload][OpenMP] Fix kernel replay documentation (#193832)
ecefc4a2ec2b [VPlan] Shallow-traverse vector-loop in dropPoisonGen (NFC) (#193635)
6d826cb602f8 [flang] Add parser support for Fortran 2023 conditional arguments (F2023 R1526-R1528) (#191303)
0b255fe83f17 [mlir][canonicalize] Add filter-dialects option (#193041)
ad4cd22cebf5 [libcxx] Use debug() instead of note() for substitutions (#193667)
c9014d34522b [PreISelIntrinsicLowering] Use index type for index in intrinsic expansion (#193807)
b96263ce6c68 [HLSL] Update global array convergence test (#193380)
bd8b9934f7c7 [SPIR-V] Fix half precision OpConstant for log10/exp10 lowering (#193730)
1e690a2c16fd [HLSL][DXIL][SPIRV] Added DeviceMemoryBarrier() and AllMemoryBarrier() intrinsics (#190633)
b53aeab1d237 [X86] Add test coverage for #193700 (#193819)
36d19f50db3f [MLIR][Mem2Reg] Ensure dominance of default value in regions (#193708)
a6ab955369ae [Darwin] Remove linker version checks for objc_msgSend selector stubs (#193637)
f63bd03c1db0 [CIR] Handle CK_UserDefinedConversion and related casts in emitCastLValue (#193611)
44a1d740333b [lld][WebAssembly] Always initialize fixed `__tls_base` in single threaded mode (#193563)
0bdaf63d0159 [mlir] Enhance error messages for attribute type mismatch in properties (#193758)
86230d50912c [CIR] Implement VLA cast for ComplexType (#193583)
1b6c29ad5cfb [X86] resolveTargetShuffleInputsAndMask - match repeated vector sources through bitcasts (#193810)
d2553595793f [NFC][AMDGPU] Remove `amdgpu-link-time-lds` module flag (#193806)
df1c7ebac75f [lldb] Speculative fix for crash in Function::GetCallEdges() (#193636)
3e432057bfca [CIR] Add restrict→noalias on non-builtin pointer params (#191483)
6d67286e5cf1 [lldb/test] Fix TestDataFormatterObjCNSBundle.py following 8212cab4128d (NFC) (#193816)
8c1081531323 [scudo] Adjust PROT_MTE page count for secondary allocator (#192202)
c26c714d53f0 AMDGPU: Use preferred --target=triple flag in documentation (#193817)
80a00e1de18f [AMDGPU] Implement amdgpu.dot op (#193371)
9bda9bdf142b [CIR] Allow multi-block ctor regions on GlobalOp (#193596)
6fe957a8423f [clang][lit] Don't substitute cir-opt if it's not enabled (#193665)
c71780cf5879 Triple: Add constructor from enum entries (#190632)
3ec98d6c15bc [CIR] Implement handling for destroying delete (#193607)
ca90ff511250 [CIR] Handle negative offsets in pointer constants (#193624)
56fd2c016fd4 [CIR] Upstream __builtin_astype int_to_ptr (#193519)
e3a65e9175cb [ProfCheck] Add test from #193580 to xfail list (#193799)
d707870ceb1a [libcxx] XFAIL some tests for LLVM libc
e4c44c6cc903 [libcxx][Github] Build container images in separate jobs (#193346)
f650fba569c4 [libcxx] Add Testing Configuration for LLVM libc
dc34d163d8c9 Re-apply "[AMDGPU][Scheduler] Use MIR-level rematerializer in rematerialization stage (#189491)" (#192443)
18e4f3be2d49 [flang][OpenMP] Add parallel loop to loop directive parser set (#193621)
4a74a4346c34 [flang][cuda] Flatten memref descriptors in GPU kernel argument packing (#193651)
a1d11348aba4 [PreISelIntrinsicLowering] Expand binary elementwise intrinsics (#193552) (#193580)
712b058c1222 [lit] Fix `progress-bar.py` flaky test (#193741)
254fcbeface8 [offload][OpenMP] Add basic documentation for kernel record replay (#193699)
df77a292371c [offload] Fix envar description in docs (#193642)
5506829f7fd7 Remove unused parameter; NFC (#193767)
d060b496ebf7 [flang] Route elemental CHARACTER MIN/MAX OPTIONAL cases through custom lowering (#191244)
84eb64d64f78 [llubi] Implement intrinsics for integer arithmetic/bit manipulation (#193702)
44d1832283ca Reland #2 "[STLExtras] Add a template for detecting whether a type has an equality comparison operator" (#177415)
87a8d40fdd44 [HLSL] Add codegen for accessing resource members of a struct (2nd merge attempt) (#193584)
0459273e347a [ConstantRange] Expand makeAllowedICmpRegion to use samesign to give tighter range (#174355)
9f75001f924a [lldb] Fix flaky TestRunLocker by using lldb.target instead of lldb.frame (#193788)
7779ee8c0e95 [flang] Support polymorphic types in conditional expressions (#192684)
ee3ed4a0f688 [InstCombine] Fold neg arg in hyperbolic lib functions (#193586)
6f999388adeb Reland "[Lit][NFC] Refactor shell environment functionality and in-process builtins from TestRunner.py into new modules" (#193759)
dd8c77765646 [RISCV][GlobalISel] Add intial support for inline asm (#193314)
2248253e7f97 [PowerPC] fixed issue "Failure to optimize (x == 0) ? 0xFF : 0 to addic+subfe instead of cntlzw+srwi+neg" (#190606)
cf2b30aa2aca [libc] Honor per-test timeout in lit test format (#193772)
321db053f50d Revert "[AArch64][GlobalISel] Do not run the Localizer at -O0 (#177359)" (#193781)
7df553349756 [libc++] Implement `ranges::fold_left_first` and `ranges::fold_left_first_with_iter` (#180214)
023e2e6c9d99 [DAGCombiner] Fold bswap of single-byte-known-nonzero value to a shift (#193473)
f7828886920c [X86] Regenerate bit integer tests to show VPADD constant asm comments (#193763)
602cc92c19c1 [Hexagon] Add SafeStack runtime libcall to HexagonSystemLibrary (#191673)
d6ebdf4a1989 [NFC][Clang][Analyses] Fix AccessPath to have deleted copy assignment (#193639)
37cd9addde72 [TySan] Fix size type mismatch in instrumentMemInst for 32-bit targets (#191601)
2d84862fb3f4 [CodeGenPrepare] Drop nuw on gep unmerging if the new index is negative (#193488)
183168aa5035 [OpenMP][OMPIRBuilder] Convert cmpxchg memory order to C ABI constants (#193536)
a54364a7dc4a [DAGCombine] Relax restriction on (bswap shl(x,c)) combine (#193679)
ef82b673fb56 [RISCV] Pass Subtarget to CC_RISCVAssign2XLen. NFC (#193609)
f99880ef8cc6 [libc][NFC] Fix typo in GPU test warning message (#193762)
4300a3967bd2 [Comgr] Fix hotswap asm parser SourceMgr crash on bad input (#2295)
91b0fbc6bc1e [X86] Use getTargetVShiftByConstNode helper to reduce code duplication. NFCI. (#193736)
f19f3cde7fa5 [NFC][AMDGPU] Make code consistent in MCResourceInfo::gatherResourceInfo (#193735)
a829194012f2 [mlir][vector] Generalize castAwayContractionLeadingOneDim (#187312)
bced9f751e89 [MergeICmps] Check for libfunc emittability (#193764)
838fcbb9aa8a merge main into amd-staging
2f5fe2cf312c [analyzer] Fix typo in ExprEngine.h (#193535)
19d97727aafb [LLVM][CodeGen] Ensure SystemZTDCPass::convertFCmp only accepts scalar floating point types. (#193738)
e9ef76d7e27e [LifetimeSafety] Simplify `AccessPath` root `PointerUnion` (#193520)
47523f7f079b [clang][docs] open details of C++{17,14,11} implementation by default (#193141)
72b061857e0b [clang][docs] fix typo; NFC (#193648)
c32d2d1f2951 [flang] Add the MLIR pass pipelines for dumping (#183144)
a88516baa735 [lldb/test] Update remaining `filecheck` call sites to use `filecheck_log` (NFC) (#193654)
f6c4280ea906 [libc][docs] Add poll.h POSIX header documentation (#122006) (#193734)
8212cab4128d [lldb/test] Relax NSBundle formatter test for Darwin embedded platforms (#193659)
0ff393f5dac1 [NFC][Target] Fixed rule-of-three for RegisterTargetPassConfigCallback class (#193470)
ff125ae1bae7 [ConstantFolding] Constant fold nextafter and nexttoward (#168794)
2a5126748fb5 [AMDGPU] Fix s_cselect scc clobber issue (#193498)
49dad1672430 [flang] Ignore -fno-realloc-lhs for polymorphic allocatable LHS with warning (#192697)
41d05aef9c63 Revert "[Lit][NFC] Refactor shell environment functionality and in-process builtins from TestRunner.py into new modules (Reopened)" (#193740)
9b986d49d6ca [Lit][NFC] Refactor shell environment functionality and in-process builtins from TestRunner.py into new modules (Reopened) (#177358)
1ee288a2b51d [SPIR-V] Combine storage class bit with atomic memory semantics (#193696)
5a9d0cf7190f [PowerPC] Add mnemonics to paddis (#179979)
1881b1cf436d [Bazel] Fixes cd26e99 (#193729)
ba0e4af50fd4 [CodeGen][NFC] Do not iterate in DCE unless needed (#193355)
1bcdf716ae23 [AMDGPU] Add a sched group mask for LDSDMA instructions (#190872)
ea7a1782fbea [clang-tidy][cmake] clangTransformer cmake fix
4997815e883e merge main into amd-staging (#2297)
1b325745f134 [X86][GISel] lower GOT-relative G_GLOBAL_VALUEs (#181983)
a9731960adc6 [AMDGPU] misched: avoid subregister dependencies (#140255)
d28eeaa99735 [LangRef] Make volatile loads non-willreturn (#192992)
cd26e990ebed [mlir][memref][NVGPU] Move NVGPU ops to IndexedAccessOpInterface (#190430)
25ad2ee86da1 [mlir][IntegerRangeAnalysis] Don't unsoundly update constant lattice (#193546)
01291a8ea59a [LoongArch] Custom legalize vector_shuffle to `xvextrins` (#164375)
43aa40ddc67c [flang][OpenMP] Remove OmpEndLoopDirective from PFT (#193602)
263e4f22fd14 [NFC][LLVM] Simplify IIT encoding for scalable vectors (#191737)
b9a2e843d9b2 [NFC][SPIR-V] Add urem, srem, and snegate tests for integer arithmetic (#193170)
148f5509e6f2 [lldb-dap] Make Breakpoint ids unique. (#193526)
eb29a502b6e8 [Clang] Fix constexpr __builtin_(add|sub|mul)_overflow bugs (#192568)
3f3b50054165 [LV][NFC] Remove more unnecessary passes from RUN lines (#193686)
4645dc7bac06 [NFC][AArch64] Regenerate ldst-opt.ll checks to use update_llc_test_checks (#193712)
e407fc3f3bca [AArch64][GlobalISel] Do not run the Localizer at -O0 (#177359)
d64dd5a2afea [LV] Factor out VF-independent code from cost model (NFC). (#192426)
cd050a0fe32d [Mips] Support mips1 and singlethread ATOMIC_FENCE (#190129)
12411c1e6ac5 [clang][bytecode] 0 bitwidth IntAP values also use one word (#193224)
882527f9f5ae [NFC][SPIR-V] Remove dead non-intrinsic path in selectAtomicCmpXchg (#193692)
4f0af7adaac9 [ISel][AArch64] Add CodeGen support for partial sub reductions. (#186809)
d19e954b83cb [LLVM] Make -use-constant-fp-for-fixed-length-splat the default. (#193264)
1f332ae4f1b3 Fix -Wformat diagnostic after #190965 (#193704)
b7e2f7838974 [Comgr] Fix Windows build: use LLVM_ATTRIBUTE_WEAK for hotswap stubs
7e9561ffd11d [SystemZ] Enable LoopVectorizer interleaving for vectorized loops. (#184306)
e3ebeeca9531 [Coverage] Skip coverage mapping for consteval member functions (#190870)
fc843236738b AMDGPU: Set transient stack alignment to 4 (#193517)
848be2d800db Add missed variable change from a refactor (#193684)
de281fe22055 [SystemZ] Implement getCFInstrCost(). (#191017)
51325ba110fb [clang-tidy] Fix FP in bugprone-exception-escape for bodyless non-throwing functions (#192658)
57c10b0b8032 Add C++20 diagnostic to macro-braces-recovery.cpp. (#192654)
d0a38203bf2f [NFC] Remove assert from AArch64TargetLowering::LowerCTTZ. (#193474)
124fd5997e5f [DA] Remove monotonicity-related code and tests (#193697)
7425ab9d9577 [AArch64] Fix `shufflevector` miscompilation on `aarch64_be` (#193076)
378cd9a307c2 [libc++] Avoid using ranges::upper_bound in <format> (#186781)
de9830fe40bf [InstCombine] Treat sdiv as udiv in foldICmpDivConstant when both operands are non-negative (#188731)
5502053d5786 [MemoryDependenceAnalysis] Disambiguate visited state in non-local pointer dep tracking (#193220)
7239415b6f29 [X86] Add crash test coverage for #193475 (#193690)
2aadaae9a0ba [LLD][MinGW] Introduce --native-def argument (#193598)
1bf0787a1638 [lldb] Remove trailing newlines from AppendErrorWithFormat calls (part 3) (#193527)
9baca0126178 [mlir][tensor] Consolidate tensor fold patterns and rename related file (#192820)
6a06c8bdcbda [BOLT][AArch64] Refuse to run ThreeWayBranch pass (#193252)
5673b0215c06 [RISCV][MC] Remove tautological CHECK-UNKNOWN disassembly checks (NFC) (#193682)
5a45fbb35ea3 [llvm][RISCV] Split LMUL=8 fixed vector fcmp for zvfhmin and zvfbfmin (#193424)
8e2c42b3bcb5 [LV][NFC] Stop running DCE pass in tests (#193521)
4209849cd3db [lldb-dap] Add valueLocationReference for member function pointers (#186837)
52e5d65561c5 [AArch64][GlobalISel] Add fpext bf16 legalization. (#193342)
019cf510ba39 [RISCV][NFC] Rename isZipEven/isZipOdd to isPairEven/isPairOdd (#193674)
3f6aa4dd8052 [CodeGen][NFC] Remove InsertPt since it's always the same as MI (#193668)
582db3c2371d Revert "[clang][modules-driver] Add support for C++ named modules and `import std`" (#193677)
9152f212208b [compiler-rt] Set CMAKE_INSTALL_MESSAGE to NEVER for custom libcxx (#193666)
dd13552783ce Reland: [LowerTypeTests] Add debug info to jump table entries (#193670)
c1ff819a184e [mlir][LLVMIR] Extend FP array-splat constant lowering (#192378)
739c45916d4c [clang][modules-driver] Add support for C++ named modules and `import std` (#193312)
e3b3706ceeea Revert "[compiler-rt][asan] Add asan checks for __builtin_assume_dereferencable" (#193655)
b25ccddf84d0 [libc] Readd instructions on building kernel headers from sources
85c13cea0ba6 [Flang][Semantics] Allow EVENT_TYPE, LOCK_TYPE and NOTIFY TYPE to be deallocate (#192940)
26cc17f4bcde [libc] Drop elf.h include from dl_phdr_info header
cc91dbbd274d [clang][modules-driver] Reject module definitions in non-module inputs (#193629)
968e34e09ab5 AMDGPU/GlobalISel: RegBankLegalize AGPR support and gfx908 MFMA rules (#192603)
642d0167de9c [WebAssembly] Fix wide bitmask fallback in performBitcastCombine (#190915)
46bb3789ef6c [ScheduleDAG] Avoid duplicate worklist entries in ComputeDepth/ComputeHeight. NFC (#192023)
a3f1035b48a0 Revert "[LowerTypeTests] Add debug info to jump table entries" (#193663)
66beeecd4694 [NFC][GlobalISel] Use move capture for SmallVector in LegalityPredicates lambdas (#193464)
2a7313a2ff42 [LV] Relax OutOfLoopUses check in `getMinMaxRecurrence()` (#189906)
60cd34d22175 [SPIR-V] Fix OpTypeImage capability requirements for Vulkan (#192626)
2fe8966d029b [SPIR-V][docs] Document supported extensions missing from SPIRVUsage.rst (#193449)
abb4ff508e8f [LowerTypeTests] Add debug info to jump table entries (#192736)
fafcafd6eb0f merge main into amd-staging (#2292)
c74951c6c307 Revert "Reapply "[JTS][Passes] Enable JTS By Default" (#193409)" (#193649)
2855525c4a1f [ELF] Handle INCLUDE like a call stack (#193427)
96bc719fbad5 [flang] Add Flang Community Call notes for 4/22/2026 (#193575)
50916c4319a1 [CIR][RISCV] Support zbb builitin codegen (#188932)
80f540b6e312 merge main into amd-staging
09602502a9fd [PSDB][Linux] add render group access to aomp smoke test container
b2ae992193e3 [RISCV][CodeGen] Add initial CodeGen support of vpair{e,o} (#192918)
06a7d41eb12b [flang] Disable copy-out to INTENT(IN) args (#192382)
1a772bc616cf [X86] Improve FREEZE node elimination for SETCC operations (#192362)
a1a40cb725f4 [lldb/test] Fix shared library symlinks for remote testing (#189177)
0d0595b50f84 [SPIR-V] Encode Atomic metadata as UserSemantic string decoration (#193019)
46e09c516ffc [ExpandMemCmp] Pre-collect memcmp calls to improve compile time (#193415)
793bdd859789 [libc][CndVar] reimplmement conditional variable with FIFO ordering (#192748)
ccc608f11937 [DirectX] Implement lowering of Texture Load and Texture .operator[] (#193343)
417f5bc95296 [NFC][sanitizer_common] Fix getpw_getgr.cpp test for large groups (#193625)
cdbb6704a4ce [SLP][NFC] Precommit test for strided store revectorization (#191569)
90209202d336 [CIR][NFC] Delete unnecessary errorNYI call in emitDelegateCallArg (#193608)
819aabfe1ba1 [lldb] Update filecheck_log to use direct input (NFC-ish) (#193618)
2ca5abe3f593 [SPIR-V] Handle ASM with multiple outputs (#187128)
2c7b820bc8fd Ensure that the Synthetic children of a ValueObject are managed by their parents ClusterManager (#192561)
9f5e0ac8a8f2 [libc] Add some more segment type macros
802de7ebd18e [offload] Allow replay repetitions and report basic timing (#193388)
e68d91afdff3 [NFC][SPIRV] Introduce function to handle 64 bits overflow (#193088)
107701b3e619 Revert "Reapply "[SimplifyCFG] Reuse function comdat for switch lookup table"" (#193582)
8a12b26feb0c Revert "[libc] Replace check-libc with lit-based test execution" (#193610)
19b40f71fd70 [SPIR-V] Add SPV_AMD_weak_linkage extension (#193307)
4b44e2039c78 [fuzzer] Set target_cflags instead of target_flags in lit config (#191510)
083cab66fa7c [SLP] Precommit tests for strided store reordering (#193565)
e4e8bcb54da5 [RISCV] Expand vp.and, vp.or, vp.xor (#193542)
7431a4f984be [SLP]Fix dominance for multi-use copyable scalars in scheduled bundle
24be43f5c5f1 [VPlan] Pick correct insert point after creating canonical IV. (#193587)
1bec68a29602 [RISCV] Remove codegen for vp_abs (#193533)
dec3b1fea9b9 [lldb] Fix empty backtraces for scripted threads with no artificial frames (#193387)
ed2f5f42ee26 AMDGPU: Skip last corrections in afn f64 reciprocal (#183696)
59596d789eac [AArch64][GlobalISel] Add hadd-combine globalisel test coverage. NFC (#193591)
b9b472c8740a [NFC] Add check lines to concepts-out-of-line-def.cpp to fix failure (#193579)
698dce153ab5 [flang] Fix inline transfer for unsigned integer types (#193570)
fed79d4c91b5 [RISCV] Expand vp.inttoptr, vp.ptrtoint (#193530)
97015ad916c4 [HLSL] Disallow `volatile` keyword (#193322)
0062071c2759 [CIR] Fix a dangling reference to a replaced global (#193561)
6ef1b80feff5 [BOLT] Fix null pointer dereference in DWP processing with split DWARF (#191474)
5d9a1c172bae [lldb] Eliminate linear scan in SetSectionLoadAddress (#193560)
1c6ab1136504 Add expand-fp-math.ll to profcheck-xfail.txt (#193577)
1fe66f66d239 [llvm-objdump][offload] Fix offload bundle decompressing (#192729)
ff87dca5c5b4 merge main into amd-staging
99b369246cb0 Revert "[offload] Fix synchronization when record replay is enabled (#193291)"
335f9f95bc0b [HLSL] Reuse temporaries of aggregate types in list initialization (#191605)
1aad4a25a917 [PreISelIntrinsicLowering] Expand all unary elementwise intrinsics (#193552)
e2f66182fa60 [clang][Modules] Avoid checking for duplicating module definitions when a module does not have a valid definition location (#193534)
fb024337497f  [Clang][AST] Introduce `ExplicitInstantiationDecl` to preserve source info and fix diagnostic locations  (#191658)
cfa133a90772 merge main into amd-staging
50d7c990d9b3 [flang][OpenMP] Support user-defined declare reduction with derived types (#190288)
ebf14ed8b8e0 [CIR] Fix lowering of strings in constant array attributes (#193553)
0dbf7373b51e [LangRef] inline asm: the instructions are treated opaquely (#157080)
55762f305866 IR: Allow !fpmath metadata on homogeneous float structs (#193537)
be529fc55f2f [SLP]Fix scheduling of copyable bundle with commutative op used outside parent PHI
80efad535e81 [CIR] Support guard COMDAT for weak linkage in LoweringPrepare (#193274)
37be0841b30b Reland: [MemProf] Dump inline call stacks as optimization remarks (#193545)
da7ee36ad521 Revert "[clang] fix matching constrained out-of-line definitions of class specialization member function templates" (#193558)
8f2935c2ebe3 Loosen check for clang version string in test to work when setting CLANG_VENDOR. (#192961)
38874e1897e3 [GlobalISel] Change SSUBO to do (LHS < RHS) XOR (RESULT < 0) (#191744)
9e649076b9d6 [libc] Replace check-libc with lit-based test execution (#184163)
6e4fb52144fe [VPlan] Use MaxRuntimeStep in materializeVectorTC to simplify middle br. (#193067)
4f1be838a9f1 [compiler-rt] [Darwin] Enable arm64e tests on macOS (#193391)
440872232bbe [NFC][MachineBlockHashInfo] Add static asserts to guard agains hash_16_bytes changes (#192862)
91fe498ccc31 Revert "[SelectionDAG] Salvage debuginfo when combining load and z|s ext instrs. (#188544)" (#193554)
5d0143187437 [lldb] Scope symbol lookups to specific modules in ObjC/SystemRuntime plugins (#193379)
c3c8e40b6cb1 [Runtimes] Allow HandleLibc.cmake to be called multiple times (#193540)
d9bbb902fe8f [LegalizeTypes][DAG] Use SHL(X,1) instead of ADD(X,X) for variable vector indices for extraction/insertion legalization (#188277)
8f1b0f632756 [lldb] Decorate tests that use threading (#193117)
eb427a4cbbed [libc][NFC] Fix minor RPC warnings (#192997)
18bd7e409217 [Bazel] Fixes e52df04 (#193548)
2f5ccd4aaa3e [MLIR][XeGPU] Do not use ocloc lib if LLVM_BUILD_LLVM_DYLIB is ON (#193259)
b0166e7a2094 [libc] Fix .params file generation for integration tests (#193544)
bd09b03b1b09 [NFC][ADT] Make a few functions constexpr (#193302)
7a633290d6b1 Revert "[Support][JSON] Use `std::unordered_map` for object storage" (#193549)
7136a4b39b05 [ELF] Factor linker-script dispatch loops into helpers. NFC (#193547)
d2673ad6b0eb [RISCV] Expand vp.fshl, vp.fshr (#193225)
e52df047f762 [Support][JSON] Use `std::unordered_map` for object storage (#171230)
c3bd0c12943c [lldb] add terminfo name (#191740)
fb3ab402c1dc [lldb/test] Fix BacktraceRecording path for Darwin embedded devices (NFC) (#193436)
f1c4db6aef43 [SelectionDAG] Change SSUBO to do (LHS < RHS) XOR (RESULT < 0) (#191747)
836c77bc0282 [libc][docs][NFC] Rename Maintainers.rst to Maintainers.md (#191882)
94b9accfab9e [RISCV] Remove codegen for vp.fcmp (#193529)
5ea5b9eb8eec [profcheck] Fix assert in getInitializer call on global with no initializer (#193514)
8c82aa0a5070 [lldb] Log clang module loads (#193389)
cdc0a9073733 Revert "[ASan][Windows] Fix memmove/memcpy interception on x64" (#193524)
bb092120f1f0 [Hexagon] Non-pie default on hexagon-unknown-elf (#193184)
2072474a24c4 [OpenMP][OMPIRBuilder] Support complex types in atomic update/capture (#191490)
fb6f1bde00c4 [ItaniumDemangle] Strip __alloc_token_ to transparently demangle allocation functions (#191048)
42ef1321e6cb [MLIR][BUILD]: Fix for 36331abd8cbb630fc174e182f1580e7cdefd2616 (#193523)
9a63d044471d [lldb] Fix inappropriate uses of LLDB_INVALID_IMAGE_TOKEN (#193365)
fc910693c344 [AMDGPU] comgr: add HotSwap B0-to-A0 policy and public API (3/3) (#2203)
3950da0bc764 [RISCV] Add isKnown method to VSETVLIInfo. NFC (#193406)
58f3d7810211 [RISCV][P-ext] Custom legalize vector (setne X, allzeros) and (setgt X, allones) (#193360)
8fc58340444b [GIsel] Use changeElementType for cond types in LegalizerHelper (#193049)
36331abd8cbb [mlir] targeted verification for transform "inlining" (#192956)
6f115abef8c9 [lldb] Remove unused ExpressionPathOptions: NoFragileObjcIvar, NoSyntheticArrayRange (NFC) (#193336)
9d704b490d5a [flang][OpenMP] Remove unused member, NFC (#193512)
93419bf99ede [AArch64][ISel] Use TripleOpVT in LowerVECTOR_INTERLEAVE (#193506)
b13867d5146b [compiler-rt] Initial support for compiler-rt builtins on SPIRV64 target (#192897)
20005a09f950 [flang] Update LIT test for big-endian platform (NFC) (#193309)
8144c14a742f [lldb] Fix assert frame recognizer for non-macOS Apple platforms (#193435)
d154ccd8d4a1 [AArch64][ISel] Add lowering for fixed-width deinterleave3 (#192972)
54fcd8620313 [libc] Add struct sockaddr_storage (on linux) (#192978)
e1ab08a5bd26 [RISCV] Functional llvm.vector.reduce.mul on scalable types (#193094)
59bf8960f28d [lldb] Remove trailing newlines from AppendErrorWithFormat calls (part 2) (#193168)
8f51fe418509 arm: fix float to integer conversion with `+mve` (#193319)
b77a894afa2e [clang][bytecode] print array root state in Pointer::print() (#193494)
1364f522a773 [LoopInterchange] Fix out-of-bounds accesses in tests (NFC) (#193272)
4dbb7ee833d5 [DAG] visitFREEZE - revisit frozen node after merging with unfrozen uses (#188206)
ba767d0bbbde [MachineCopyPropagation][NFC] Refactor EliminateSpillageCopies (#192609)
85dc81166ac3 [X86] Add TODO for nsw+nuw handling to (add (add X, Y), X) --> add(add(X, X), Y) (#193503)
475639a10ecd Add SPIRV to excluded profcheck targets (#193509)
20a7d26297c2 [SPIRV] Fix legalization of zero-sized intrinsic globals (#192730)
c8b526f76b63 [bolt] AArch64: Fix TLSDESC to LE relaxation by mold (#190370)
eda6c60a051f [offload] Get kernel argument sizes from Level Zero (#192487)
f5e80c985804 [Flang] Add SIMD Compiler Directive (#192969)
a623bd913507 [lldb] Add full stop to "memory tag" help (#193505)
f66b3baed254 [libc][math] small typo (#193349)
94278b30b957 [flang] Move ResolveAccParts and ResolveOmpParts into better location… (#193497)
385e7eaafedd [clang] isConvertingBoolWithCmp0 - fix MSVC "not all control paths return a value" warning. (#193495)
124f14cbf7d3 [Passes] Remove Os/Oz from pass option listings (#193491)
c6e90081a7bc [InstCombine] Remove support for volatile in phi of load transform (#193154)
2e0112ff197d Enable disable LSPs in extension (#191957)
e7fd6fe12faa [NFC][SPIR-V] Consolidate OpVariable insertion point logic into getOpVariableMBBIt (#193433)
7f703cabf728 [MLIR][AsmParser] Fix non-deterministic SSA value completion order under LLVM_REVERSE_ITERATION (#192150)
39865a002e6b Revert "[lldb][test] Add support for building Wasm test inferiors" (#193493)
1b7de19f86ee [mlir][vector] Prevent masked transfer read/write identity folding (#192966)
90fa375b8c18 [X86][NFC] Reorganize fadd, fsub, fmul and fdiv selection tests (#193012)
b3319caafa4c [LLVM]Codegen][X86] Add vector ConstantInt/FP support to CollectConstantBits. (#193249)
3bb7c2c6f799 [MLIR][BUILD] Fix for c1cff89b (#193489)
4d18c1061dad [VPlan] Prefer checking opcode over underlying value (NFC) (#193463)
e43f3232f21f [VPlan] Permit licm-sinking recipes with no users (#189957)
58127f3ebffd [OpenMP] Fix OpenMP device subdir installation w/ multilibs (#193378)
62ae7e4786d9 [ASan][Windows] Fix memmove/memcpy interception on x64 (#192060)
64c9a758394c Revert "[LifetimeSafety] Add support for `new`/`delete`" (#193482)
2039a51881bb [libc++][ranges] P2164R9: Implements `views::enumerate` (#73617)
61e5c13fba2b [X86] Add baseline tests for #144231 (#193484)
d137e6601f1c [libc++] Remove apple_availability.h (#192851)
cd60aed5f8d7 [flang][OpenMP] Move directive deprecation check to semantic checks (#192796)
9673f1f8ce8b [SPIR-V] Handle [N x i8] byte addressing in SPIRVEmitIntrinsics (#192994)
8e1095ff01c1 [AArch64] Only prefer partial reductions if cost is lower. (#191369)
a166f0b2c7c4 [AMDGPU] performSraCombine - SRA(X,BW-1) - don't freeze HI operand for single (repeated) shift (#193468)
567583cbfee5 [Clang][SystemZ] Fix unwanted unsequenced volatile accesses in codegen tests (#190212)
b965d52bbbcc [LLVM][GlobalISel] Remove unnecessary comment (#193333)
5659f86af5ab [clang] Implement -fstrict-bool (#160790)
2a3639cd085f [DAG] computeKnownFPClass - add ISD::EXTRACT_VECTOR_ELT handling (#190307)
6fcc4d701da8 Revert "[Clang] Diagnose UB and emit error when identifier has both internal and external linkage" (#193462)
17be5a7dee8d [debugserver][NFCI] Factor out logic handling breakpoint packets (#192912)
5fc5c1120230 [clang-tidy][readability-identifier-length] Add a line count threshold (without std::transform_reduce) (#193276)
6d097d240dd6 [clang] Suppress glibc C11 extension warning in `c-index-test` (#193335)
b2f3532e9fb1 Revert "[Bazel] Fixes 8e56a89" (#193459)
c1cff89bdcea [mlir][GPU] Refactor GPUOps lowering (#188905)
b313bb714528 [Clang][AArch64] Lower NEON fcvtz{u/s} intrinsics into fpto{u/s}i.sat (#191365)
a5f7f4962751 [mlir][linalg] Fix crash when folding tensor.cast into unpack using static packed shape for inner tiles (#188000)
34a8d497d29a [Bazel] Fixes 8e56a89 (#193450)
1d9775f68440 [LV] Change VPLane::getAsRuntimeExpr to use constant 64-bit indices (#193206)
f5f107e01778 [LLVM][SelectionDAG] Reduce chances of a split VSETCC being rewidened. (#191438)
8f6598133b37 [clangd][test] Fix test failures when LLVM_WINDOWS_PREFER_FORWARD_SLASH is ON (#193160)
808af6fd987e [LLVM][ConstantFolding] Use correct type when flushing denormals. (#193254)
d368c5728fcd [DA] Remove unnecesasry SCEV negation operation (NFCI) (#193447)
582958c4337f Revert "[clang][ssaf][NFC] Rework how the Force linker anchors are defined and used" (#193451)
8fea7910b0b4 [CIR] Fix __builtin_clz/__builtin_ctz poison_zero to respect target (#192865)
c37764cc00f2 [IRBuilder][NFC] Add `CreateFAbs` helper (#193421)
8e56a89c8f28 [clang][ssaf][NFC] Rework how the Force linker anchors are defined and used (#189409)
3e7c207ebe35 [Flang][OpenMP] Fix DEFAULT(NONE) check for Cray pointers in nested OpenMP directives (#190764)
7a154470f51f [Flang] bug: preprocessor increases backslash to double backslash (#191512)
86b9775612f8 [Passes] Remove Os and Oz optimization pipelines (#191363)
34a917a53e51 [mlir][spirv] Add SPV_EXT_FP8 type support to SPIR-V TOSA ops (#193199)
d24ebe3f00b3 [Support] Add std::string overload for llvm::sys::path::native (#193228)
d3ee88b18e22 [lldb] Fix pexpect detection with LLDB_ENFORCE_STRICT_TEST_REQUIREMENTS (#193444)
2711d8a50a13 [X86] Recognise vectors with zeros in all upper elements to improve VMOVS*Z folding (#193263)
0da0163d3ace [SelectionDAG] Preserve poison in IS_FPCLASS folds (#193246)
4aecd0454574 [libcxx][test] Skip cas_non_power_of_2.pass.cpp in Picolibc build (#191415)
7af4283bc96c [NFC][SPIR-V] Use getScalarOrVectorComponent{Count,Type} instead of raw operand access (#193410)
9ec6788421ac [lldb] Add HTTPS tests for SymbolLocatorSymStore (#192274)
35480b22737f [SPIRV] Migrate NSDI emission from a machine pass to DebugHandlerBase (#191212)
22bb938f873f [CIR][AArch64] Lower NEON vminv intrinsics (#192901)
83b4c5cd71c8 [NVPTX] Add intrinsics for narrow-fp to bf16 conversions (#191376)
cd145989585e [AMDGPU][NFC] Eliminates the redundant code in the AMDGPUTargetMachine.cpp (#193169)
5ef29d1d8bb5 [ADT] Add predicate based match support to StringSwitch (#188046)
f4cc934dcbca [LV] NFCI: Create VPExpressions in transformToPartialReductions. (#182863)
9435160a040b [MLIR][NVVM] Update SM version requirements of Ops (#192257)
a6d14db61db4 [clang][bytecode] Fix DefaultInitExpr base pointer in IndirectFieldDecls (#193149)
1edcd7473fb1 [clangd] [Modules] Refactor cache to support duplicated module name (#193413)
a61de4b50c04 [mlir][spirv][nfc] Clean up FP8 and BF16 SPIR-V type tests (#193196)
bfd6ea0241ec [ELF] Improve allocateHeaders tests (#193419)
61f9516af963 merge main into amd-staging (#2283)
cfa67454e9d4 [clangd] Add go-to-definition support for fields in offsetof expressions (#192953)
efa0f22883b5 [RISCV][MC] Emit ISA mapping symbols on .option arch/rvc/norvc/pop (#193123)
fde2e27a017c [clang][modules] Fix false positive -Wweak-vtables in named modules (#193136)
bb762095af80 Reapply "[JTS][Passes] Enable JTS By Default" (#193409)
0f4fb3b2426e [NFC] [MC] Fixed rule-of-five for MCPseudoProbeDecoder class (#193181)
554edb289bb2 [ELF,test] Cover empty INCLUDE inside MEMORY { ... } (#193411)
ee06802dc4a8 [JTS] Correctly handle all zero profile values in VP metadata (#193402)
dee5769870ce [lldb] Fix potential TestAlwaysRunThreadNames flakiness (#193405)
bf7ced3bc901 [lldb] Remove ENABLE_THREADS from Makefiles that don't need it (#193363)
d0bb0c837876 [ELF,test] Convert INCLUDE tests to split-file pattern (#193403)
b3a5d146fc20 [lldb] Doxygenify comments in AppleObjCRuntimeV2 (NFC) (#193401)
eb4296f98ad7 [llvm-mc][AsmMatcherEmitter] Fix the minimum ConversionTable entry size (#191977)
a843c699cc89 Revert "[JTS][Passes] Enable JTS By Default" (#193399)
52914600bc77 [revPat] update revert_patches
c643aa496ba7 merge main into amd-staging
f4bf7297963b [lldb] Add exe_ctx to examples commands (#193347)
e02c089a857e Revert "[compiler-rt][asan] Add asan checks for __builtin_assume_dereferencable (#190871)"
3aeb3c191d36 merge main into amd-staging
a680361bcdaa [clang-tidy] Suggest materializing temporary ranges in readability-use-anyofallof (#185791)
b5f7bc55573c [Bazel] Fixes 3081d52 (#193376)
2531a6730ddd [clang][DebugInfo] Set linkage name for dynamic initializer/destructor debug info (#189794)
653030b4c627 [PDB] Refactor cache strategy for function symbol lookups (#188927)
523c26f9c2c7 [clang-tidy][NFC] Add a unittest for checking list.rst (#193134)
3081d52d8242 [MC][debug_frame] Fix a bug in MCDwarfFrameEmitter::emit() so that per-function CIE can be generated when CIEs are different (#192727)
506c3f17b450 [clang-tidy] Fix false positive in readability-convert-member-functions-to-static for const overloads (#191712)
f8ab74283f74 [runtimes] Enable Fortran only with explicit CMAKE_Fortran_COMPILER (#193332)
20c8f4ca8e50 merge main into amd-staging
820654dca4f8 [UnsafeBufferUsage][SSAF] Change -Wunsafe-buffer-usage API for SSAF-based analysis (#191934)
03bfba583532 [AMDGPU] comgr: add HotSwap MC/LLVM infrastructure (2/3) (#2202)
f2e4fcd134d1 [NFC][LowerTypeTests] Add AArch64 and X86 jump table tests with debug info (#192735) (#193358)
0903c314622a [Extractor] Use function return for the one and only output (#191824)
1cbd27f1ddb0 [ConstantFolding] Increase folding limit for vector loads to 128 bytes (#192775)
4d83691e290b [lldb] Fix a couple of return type / return value mismatches (#191464)
d87ac8716018 [offload] Fix synchronization when record replay is enabled (#193291)
404609d013f6 [runtimes] Enable Fortran only with explicit CMAKE_Fortran_COMPILER (#193332)
5892e34a9613 [BoundsSafety][NFC] Move LateParsedAttribute outside Parser class; move LateParsedAttrList to DeclSpec.h (#192145)
b48d8a54e29f Support loader arguments in GPU hermetic tests (#193341)
54c1b3050cb9 [libcxx][Github] Bump container version (#193351)
fc0f32196d27 [libc][NFC] Remove trailing whitespace from LLVMLibCTestRules.cmake (#193350)
7014ce846164 [bazel][lldb] Fix missing dep in ScriptedProcess (#193348)
2b63e5e75dba [LFI][libunwind] Avoid writing to reserved registers on the `aarch64_lfi` target (#192739)
daade8e56f82 [CI] Fix cross-project-tests dependencies (#193323)
d623ee801ce4 [SSAF] Fix -Wunused-variable (#193344)
d7d2c0ca6afd [bazel][clang] Fix build for #191932 (#193337)
e07f4b2d54f3 [lldb/DWARF] Support 5-component Swift version in DW_AT_producer (#193305)
d76111a9650d [libcxx][Github] Bump Github Runner to 2.334.0 (#193339)
9a6b93d5388d [X86] Regenerate vector shifts tests to reduce diffs in #188206 (#193325)
a1d0a0246161 [mlir][func] Avoid to create duplicate symbol during conversion (#192342)
2ed87bad8581 [clang-format] Add c++23 and 26 to the configuration (#193327)
e0b4a7063f78 [compiler-rt][profile] Use runtimes-libc-headers in the GPU runtimes build (#192814)
92958a0631c4 AMDGPU/GlobalISel: RegBankLegalize rules for DS barrier arrive atomics (#192767)
5ee4c51c1a9c [SSAF][Analyses] Add an AST visitor for the contribution model (#191933)
368ee151c774 [bazel][lldb] Add target for new plugin (#193316)
d4650168f0ea [lldb] Directly access object variable in GetObjectPointerValueObject (NFC) (#193120)
f29d0b4329d9 [CIR] Cache isSafeToConvert results to avoid redundant record layout … (#193122)
bffb208404fd [libunwind] Add SME detection for ZA test on OpenBSD / FreeBSD (#193148)
20ce456138fa [LegalizeTypes][RISCV] Don't widen expandload or compresstore with VP_LOAD/VP_STORE. (#193294)
6688763f44a3 [libc] Improve lit test discovery and execution (#192993)
27cc83383d19 [cross-project-tests] Add llvm-modextract as a dependency (#193296)
0bbd61a33960 [Github] Bump Github Runner Version to 2.334.0 (#193318)
99e4f6a05f59 [lldb] Add synthetic variable support to Get*VariableList. (#181501)
e7b103798d0d [CIR] add pairwise-addition-and-widen support (#191845)
b1175088531d [LifetimeSafety] Add support for `new`/`delete` (#192504)
7318bc734a5b [Clang][AMDGPU] Use unsigned for D# (#193310)
21dcb13f6d67 [flang][acc] Update fir.convert rematerialization handling (#193301)
346480e0abb5 [AArch64] Add more scalar testing for hadd patterns. NFC (#193313)
24464f6c2c36 [RISCV][GlobalISel] Lower G_MEMCPY_INLINE (#192671)
f9437779f447 [Attributor] Use trivial no side effects check in isAssumedSideEffectFree (#193303)
760bc124c189 Reapply "[JTS][Passes] Enable JTS By Default" (#193300)
f1baa17f7920 [libc] Add wcsxfrm (#191692)
a4538a3ad902 [NFC][SSAF] Extract common code in Analyses to a shared file (#191932)
11515959b571 [BOLT] Fix stream position before appendPadding in writeEHFrameHeader (#193126)
7bf48ec95c7c [AArch64] Remove xtn.ll test. NFC (#193306)
8bebc5847663 merge main into amd-staging (#2278)
ae9cb64645fd [VPlan] Clean up VPWidenPHIRecipe constructor (NFC). (#193297)
f1b65b96aa50 [PowerPC] Fix ADJCALLSTACKUP and ADJCALLSTACKDOWN def (#184696)
50241dcd08c8 [AMDGPU] Reland "Fixed verifier crash because of multiple live range components." (#190719) (#193286)
6a9ed459ce22 [JTS] Add a temporary workaround for multiple zero GUIDs (#193292)
b0fe500e7842 [lldb] Make global lookup in DIL configurable by caller (#192592)
bf8cf4b7b31b [compiler-rt][asan] Add asan checks for __builtin_assume_dereferencable (#190871)
1e2175ec4df7 [Clang] Diagnose UB and emit error when identifier has both internal and external linkage (#192116)
facb9ab34ffb [LV] Remove IV use restrictions for epilogue vectorization. (#190552)
0a8ed875075b [clang][deps] Simplify scanner VFS (#190843)
6997cc8c0f84 [docs] Add missing command line options to llvm-profgen.rst (#192890)
8cc12bcf730e [clang][bytecode][HLSL] Complete the HLSL aggregate splat and elementwise cast implementations, and enable the new constant interpreter on all HLSL tests with static asserts (#189126)
dd5632f51d3f [gn] port 40fcd2517a110 (#193293)
d4e2850a8412 Update [Github] Update GHA Dependencies (#176676)
99457c368586 [CIR] Upstream VectorType __builtin_astype (#192859)
9c2e67721adf [X86][GlobalISel] Ignore non-vregs in regbank mapping (#182880)
5299e00a73f7 [RISCV][TableGen] Use ArrayRef instead of vector&. NFC (#193267)
0a59b51a783a [RISCV] Add a getTargetStreamer helper to RISCVAsmPrinter. NFC (#193250)
bde6226911f6 [Github] Set persist-credentials in libclang-python-tests.yml (#193282)
4cdd9883943d [Clang] Allow VDBPSADBW intrinsics in constexpr (#188887)
bddd3d32bc64 [lld/mac] For catalyst outputs, tolerate implicitly linking against arm64e mac tbd files (#193065)
d794e04651f3 [Clang][PowerPC] Add DMF crypto builtins for extended mnemonics (#185961)
06b85c8bb2ed [SSAF][UnsafeBufferUsage] Make UnsafeBufferUsageExtractor a registered ASTConsumer (#191931)
849de61619cc [APINotes][unsafe-buffer-usage] Add [[clang::unsafe_buffer_usage]] support in APINotes (#189775)
0d45876e43a8 [ROCDL] Add dot intrinsics to rocdl (#193129)
40fcd2517a11 [lldb][test] Add support for building Wasm test inferiors (#192872)
8d21e4e692bc AMDGPU/GlobalISel: RegBankLegalize rules for BVH intersect ray (#192583)
febd3de07dba [clang] Get the directory identity from `ModuleCache` instead of `FileManager` (#193070)
c7eea85b8046 Revert "[llvm-cov] Fix error propagation in CoverageMapping::load() (… (#193266)
981a9e5acb49 AMDGPU/GlobalISel: RegBankLegalize rules for amdgcn_ballot (#193105)
5f33bbeb8375 [clang] Exclude trailing colons from param command names (#192598)
4f2d572349c4 [clang][NFC] Prevent scope pollution from repeat type specifiers (#193144)
10f8205c6f7e [SLP]Fix stale deps for operands of non-scheduled expanded-binop parents
6c5b4a73fb25 [lldb] Move GetTypeSystemFromCU to DILEval.cpp (NFC) (#193245)
655f38fad8bf [gn build] Port b799d7e8f8bc (#193262)
e86ed67d9f24 [LV] Improve code around all_of, any_of (NFC) (#193150)
57409d7558f1 [gn build] Port acc3f73113ab (#193261)
002b2dc6b2a0 [gn build] Port 9b8635f3247d (#193260)
adf12074a580 [gn build] Port 4acbf997891c (#193258)
93dce0bf4332 [VPlan] Strip null-check in partial-red casts (NFC) (#193162)
980ddce138a3 [CIR] Implement variably modified type parameter handling (#193072)
8226604fbbc9 [libc][math] Implement a code-size optimized version of powf. (#190984)
53a33fa6ec9a Add missing comment (#193050)
9576adbb96c7 [RISCV] Expand vp.frem (#193218)
9a1f716941b2 AMDGPU: ds.atomic.barrier.arrive.rtn.b64 is a source of divergence (#192765)
b8386954ac11 [flang][OpenACC] Limit hoisting out of compute operations. (#193099)
b799d7e8f8bc [libc++] Implement `std::constant_wrapper` (#191695)
e39e73c91ceb [IR] Avoid redundant TrackingMDRef reassignments and DebugLoc copies (#193018)
edd6797bd0f4 [LLVM][BUILD] Fix for #177158 (#193238)
e535fbfb9851 [clang] Add typed variants for C23 stdbit.h builtins (#192718)
3ee56ef44322 [runtimes] Ensure INSTALLed directory exists (#193243)
66cd11f73ef1 [bazel][mlir] Port a1dfc8d64e1faa752f020a8212782362b179416d (#193241)
e7103645c2e9 [compiler-rt] Implement __clear_cache for Hexagon (#188411)
406dc4e34008 [Hexagon] Handle FK_Data_8 fixups in ELF object writer (#192149)
2b76ec744d78 llvm/test: Fix incremental bots after revert of #190719 (#193234)
347dc1321ed5 Reapply "[SimplifyCFG] Reuse function comdat for switch lookup table" (#193229)
7a7d5936532a [LoopVectorize] Add minsize attribute to test (NFC) (#193223)
95a960114e92 [lldb] Skip memory region probing in FindSpace when process can't JIT (#193124)
665f5c0ee89d [lldb][docs] Add FOSDEM talks to the links page (#193015)
e23c053d25fa Fixed issue of use after move (#193175)
583f2949a841 [SLP] Normalize copyable operand order to group loads for better vectorization
e4afaa1fcfcc [RISCV] Further improved exact VLEN lowering for mul reductions (#192688)
d841a9383d5d [SPIR-V] Deduce argument types before doing GEP (#193046)
6a6f3b07e607 [SLP][NFC]Add a test with non-reordable operands of non-commutative copyables, NFC
216bccbee137 [SPIR-V] Handle constant expression uses of PushConstant globals (#193005)
1a0269771e0f [LIT] Add -nostdinc so system headers aren't searched with implicit module maps (#192125)
074965c22152 merge main into amd-staging
a2011b113acd [LV][RISCV][NFC] Update strided-accesses.ll to UTC version 6 (#193211)
e5925fb3a7fe [NFC][llvm-objdump] Use CHECK-NEXT in MachO tests (#192696)
dc73cabfa38c [X86][AVX10.2] Skip FP2I/I2FP customizations for bf16 (#193137)
8abcce069978 [LoopVectorize] Generate test checks (NFC) (#193216)
941e8ef04ebb [mlir][arith] Add support for `arith.flush_denormals` emulation (#192660)
1566b6344a2b [X86][clang-cl] Make AVX10.2 map to the same target-cpu as AVX10.1 (#193147)
9c2d9448238d [DAG] Reassociate (add (add X, Y), X) --> add(add(X, X), Y) (#162242)
1697b964ffcf [runtimes] Protect use of undefined CMAKE_Fortran_COMPILER (#193210)
d629a221707e [Polly] Disable PCH reuse for unit tests (#193209)
300285ed5f4f [CIR][NFCI] Remove 'isConstant' from getCIRLinkageForX (#193100)
f6f39c6fc172 [LV] Add test for interaction between interleaved and strided load. nfc (#192990)
a976a72c12c5 [AMDGPU] Multi dword spilling for unaligned tuples (#183701)
b7cfcfe03deb [llvm-cov] Fix error propagation in CoverageMapping::load() (#193197)
037a48aa4b9a [InstCombine] fold fabs(uitofp(i16 a) - uitofp(i16 b)) < 1.0 to a == b (#191378)
7134ce5d7215 Revert "[clang-tidy][NFC] add numeric include for transform_reduce" (#193200)
744279b9f173 [mlir][arith] Add `arith.flush_denormals` operation (#192641)
95c583697192 [AMDGPU] Add legalizer rule support for AMDGPU's regbank fminimumnum and fmaximumnum (#192719)
a1dfc8d64e1f [mlir] Add option to run CSE between greedy rewriter iterations (#193081)
ed34ee3a728d [mlir] Assert region is within config scope in RegionPatternRewriteDriver (#193177)
797fc5dde02d [AMDGPU] Prefer mul24 over mad24 on SDWA targets (#193033)
78cb9fbbb08f [DAG] Add Srl combine for extracting last element of BUILD_VECTOR (#181412)
3de6b5c685b3 [mlir][spirv] Fix Float8EXT type conversion legality (#192466)
044e21f04311 [SystemZ] Fix wrong mask for float vec_insert (#192967)
cf1f7c533425 [Attributor] Regenerate test checks (NFC) (#193192)
8e132f78bfb0 [runtimes][CMake] Move Fortran support code from flang-rt (#171610)
af5fb3870a00 [Attributor] Clarify volatile null pointer behavior (NFCI) (#193190)
47918c2c0a88 [CIR] Make array decay and get_element op perserve address spaces (#192361)
c2139f13606f Revert "[SLP] Normalize copyable operand order to group loads for better vectorization"
b3647eb0830f Revert "[clang-tidy][readability-identifier-length] Add a line count threshold" (#193182)
3600cd824d5a [AMDGPU] Unmark wave reduce intrinsics for constant folding (#193142)
853d7c9b2347 AMDGPU/GlobalISel: RegbankLegalize rules for merge-like opcodes (#193026)
fc7c25738635 [libc++] Fix any.cpp not compiling with the minimum header version >= 7 (#193183)
45db5e46b2ef [RISCV][NFC] Remove unused RISCVExtBit (#193153)
d1f4b79ec888 [LICM] Remove unnecessary check during store hoisting (#187529)
b460f296d6dd [RISCV] Remove codegen for vp_sqrt (#191837)
337ad44a3e58 [llvm] Errorize DebuginfodFetcher for inspection at call-sites (#191191)
9584e9c9b269 [LLVM][CodeGen][SVE] Implement custom lowering for insert_vector_elt_nxv1i1. (#192494)
a47551f22099 [lldb][windows] fix script interpreter file parsing (#193006)
a99dd8344ea9 [LV][NFC] Remove unnecessary extra passes from some tests (#193155)
b78a0a02c181 [AArch64][SelectionDAG] Improve codegen for and(sext(Op), splat(1)) (#192405)
bf24d742d942 [RISCV][NFC] Use IfDefEmitter in RISCVTargetDefEmitter (#193151)
5cc7956a2542 [RISCV] Remove codegen for vp_fadd, vp_fmul (#191842)
06e70f60c9cc [flang][debug] Handle USE statements inside modules (#186184)
49f159faa6b0 [clangd] [C++20] [Modules] Read module mappings from commands (#193158)
3db991b5c287 [clang-tidy][NFC] add numeric include for transform_reduce (#193165)
dac2cb9a5a73 [LLVM][BUILD] Fix for #192887 (#193167)
357d61fe48dd [MIR] Always print symbolic INLINEASM operands (#192991)
b550a5e09420 [libc] Fix riscv32 build after #192927 (#193152)
ab4283959fd1 [LICM] Remove unnecessary call arg in test (NFC) (#193159)
e7a2cf1243ba [mlir][SPIR-V] Lower boolean vector reductions (#192267)
b00e3a098681 [libc++] Fix numeric_limits::digits and digits10 for _BitInt(N) (#193002)
615678b37d73 [Coroutines] Add verifier checks for llvm.coro.begin and llvm.coro.id (#192887)
174783d157cf merge main into amd-staging (#2270)
68a27a07be36 AMDGPU/GlobalISel: RegbankLegalize rules for G_BITCAST (#193025)
5dab433d5bdc [lldb] Remove trailing newlines from AppendErrorWithFormat calls (#192965)
766607ca643e [lldb] Add EXPORT to lldb-tblgen (#192610)
59a9aa30e21e [LV] Add flag to always force a scalable VF when feasible. (#182467)
a35e21861f91 [lldb] Fix ambiguous call to create_directories in TraceIntelPTBundleSaver (#191967) (#192025)
547c3ad159fd AMDGPU/GlobalISel: RegbankLegalize rules for undef and constants (#193024)
3c88abe3206b [clang-tidy][readability-identifier-length] Add a line count threshold (#185319)
163d0b1b697c [ConstantMerge] don't merge constants with COMDAT…
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp Outdated
Comment thread mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp
@stefankoncarevic stefankoncarevic force-pushed the dpp-refactor-blockwise-reduce branch 2 times, most recently from f1020ea to 289f2b3 Compare May 4, 2026 13:30
stefankoncarevic and others added 8 commits May 5, 2026 08:24
…per extraction

Restructure the blockwise reduce rewrite pattern in BlockwiseGemmToThreadwise.cpp
to improve clarity, maintainability, and enable DPP-based reductions via
gpu.SubgroupReduceOp.

Shuffle decision logic:
- Introduce has2DThreadLayout guard (mTidPerWave > 0 && nTidPerWave > 0) to
  clearly separate GEMM-style 2D thread layouts from general cases
- Path 1 (Shuffle+DPP): activates when blockSize > nrDimProduct and the
  per-thread subtile is [1,1] with rDim == 1, using gpu.shuffle to transpose
  data from WMMA/MFMA strided layout into contiguous DPP-compatible layout
- Path 2 (Serial XOR): activates when blockSize <= nrDimProduct, performing
  log2(rDim) XOR butterfly reduction steps within a wave at stride nTidPerWave
- Initial LDS store is deferred: only performed when neither shuffle path applies,
  avoiding unnecessary LDS traffic for shuffle-eligible configurations

Parallel reduction with DPP:
- Use gpu.SubgroupReduceOp with cluster_size for DPP-eligible reductions
  (power-of-2 active threads, cluster_size <= waveSize)
- Only the reduction group leader (rtid == 0) writes the result back to LDS,
  followed by a barrier and broadcast read
- Use bitwise AND/SHRU for thread ID decomposition (rtid, nrtid) on the DPP
  path and for power-of-2 non-reduction dimensions; fall back to DIV/REM
  for non-power-of-2 cases
- Force scalar accumulation (vectorLen = 1) during threadwise pre-reduction
  on the DPP path to ensure correct element-wise reduction before SubgroupReduceOp

Helper extraction:
- getPerWaveThreadCounts: promote to static member function; extracts m_tid and
  n_tid counts from the tid slice view Merge transform
- shuffleRearrangeForDPP: encapsulates the gpu.shuffle-based transposition from
  strided WMMA/MFMA layout to contiguous DPP layout
  (sourceLane = (lane % clusterSize) * stride + lane / clusterSize)
- readReducedResultsFromLDS: consolidates the repeated pattern of barrier +
  ThreadwiseReadInto from LDS into output registers (and optional extra output)

Tree reduction path:
- Retained as fallback for non-DPP-eligible configurations
  (non-power-of-2 thread counts or cluster_size > waveSize)
- Scope ceilPowerOf2 computation and treeMaxActiveThreads naming to this path

New test: blockwise_reduce_dpp_cluster_sizes.mlir
- Integration test covering DPP reduction with cluster sizes 2, 4, 8, 16, 32, 64
- Validates both sum (rand=none, all ones) and max (rand=fixed) reductions
- All test configurations use blockSize <= waveSize to ensure single-wave
  execution on both RDNA (waveSize=32) and CDNA (waveSize=64)
- cluster_size=64 falls back to tree reduction on RDNA since 64 > waveSize=32
…ion kernels

Remove the shuffle+DPP transpose path and serial XOR butterfly reduction
from BlockwiseBroadcastReduceOp lowering. These paths used gpu.shuffle
to rearrange data between WMMA/MFMA strided layout and contiguous DPP
layout, adding complexity without consistent performance benefit.
The DPP reduction path now uses gpu::SubgroupReduceOp directly with
cluster_size, which handles cross-lane communication within a wavefront
without requiring explicit data rearrangement through shuffle or LDS.
Key changes:
- Remove shuffleRearrangeForDPP() and all shuffle optimization logic
  (canUseShuffleOptimization, canUseSerialShuffle, XOR butterfly)
- Restrict DPP activation to partial_r > 2, as configurations with
  partial_r = 2 do not benefit from DPP due to insufficient work to
  amortize the instruction overhead; these fall back to LDS-Tree
- Remove forced scalar vectorization for DPP threadwise reduction
- Simplify LDS store to be unconditional (no longer skipped by shuffle)
…rch DB for wave size

- Change canUseDPP condition from >= to == for blockSize vs
  clusterSize * nonReductionDimSizeProduct to prevent potential
  out-of-bounds LDS writes by extra threads when blockSize exceeds
  the exact thread count needed for the DPP layout.
- Replace hard-coded chipset major version heuristic in
  SubgroupReduceToDPP with rock::lookupArchInfo(chip).waveSize
  for more robust subgroup size derivation.
- Update lowering_blockwise_broadcast_reduce test to use dimensions
  where blockSize == clusterSize * nrDimProd (8 == 2 * 4).
With the escape, the DPP path could be taken when blockSize >
maxActiveReductionThreads, leaving extra threads with nrtid >= 1
(out of the valid [0, 1) range) that would compute out-of-bounds
LDS coordinates. Tuning data across f16/f32/int8 attention configs
shows nrDimProd is always >= 16, so this escape was never actually
triggered and removing it does not change behavior for any current
configuration.
@stefankoncarevic stefankoncarevic force-pushed the dpp-refactor-blockwise-reduce branch from 289f2b3 to d848c74 Compare May 5, 2026 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants