brgemm matmul: conditional use of destination type for bf16/f16 intermediate accumulation#5341
Draft
ankalinin wants to merge 4 commits into
Draft
brgemm matmul: conditional use of destination type for bf16/f16 intermediate accumulation#5341ankalinin wants to merge 4 commits into
ankalinin wants to merge 4 commits into
Conversation
added 3 commits
June 16, 2026 14:14
Add bf16/f16 conversion support in jit_brgemm_amx_uker store/beta paths. Enable ONEDNN_C_BUF_DST_DT environment variable usage in brgemm configuration.
Add cpu_accumulator_2d_batched_t template for bf16/f16/f32/s32 types. Extend brgemm_types.hpp with batched accumulator configuration fields.
Add c_buf_dst_dt configuration to bypass f32 C buffer and use dst dtype. Support both single-threaded (nthr_k<=1) and parallel K-reduction (nthr_k>1) with bf16/f16 batched accumulators.
Contributor
Author
|
make test disable os_win disable compiler_gnu9 disable build_vendor_amd disable benchdnn_all |
Add helpers to compute effective C buffer footprint based on c_buf_dst_dt mode. Update cost model to account for dst dtype accumulation and K-division.
462a01b to
3da35d0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces support for using bf16/f16 data types for the C buffer (accumulation buffer) in AMX-based matrix multiplication operations, controlled by the ONEDNN_C_BUF_DST_DT environment variable. This optimization enables direct writes to destination memory, eliminating or reducing scratch buffer overhead depending on the K-blocking strategy.
Motivation
Traditional AMX matmul implementations use f32 scratch buffers for intermediate accumulation results. By using narrow data types (bf16/f16) matching the destination type, we unlock significant optimizations across different execution scenarios:
1. Single-threaded K, single K-chunk (nthr_k == 1, K fits in one chunk):
Eliminates accumulation buffers entirely - direct write to destination
Zero L2 residency cost for partial results
Zero RMW bandwidth overhead
Preserves wide-N parallelism on small shapes
Best case scenario
2. Single-threaded K, multiple K-chunks (nthr_k == 1, K split across chunks):
No separate accumulation buffer - uses destination D as accumulator
RMW to destination per K-chunk (read D, accumulate, write D back)
Saves buffer allocation vs f32 scratch
Per-chunk RMW cost amortizes over larger k_blk, favoring fewer K-chunks
3. K-parallel execution (nthr_k > 1):
Requires separate partial buffers per K-thread for parallel accumulation
Reduces memory traffic by ~50% (2 bytes vs 4 bytes per element)
Smaller L2 footprint for resident partial results
Final reduction combines partial results into destination
Changes
Commit 1: x64: brgemm: add support for narrow dt_c (bf16/f16) in AMX kernels
Extends AMX BRGEMM kernels to support bf16/f16 accumulation data types
Adds kernel logic to handle read-modify-write operations with narrow accumulation types
Enables direct destination writes when no intermediate buffer is needed
Foundation for bypass mode optimization
Commit 2: x64: cpu_reducer: add bf16/f16 batched accumulator support
Implements reduction operations for bf16/f16 data types in parallel K-split scenarios
Enables efficient accumulation across multiple K-thread partial results
Supports batched reduction for improved performance on multi-threaded workloads
Commit 3: x64: matmul: enable bf16/f16 accumulation controlled by ONEDNN_C_BUF_DST_DT
Enables the narrow C buffer feature in the matmul primitive
Adds runtime checks and configuration for bf16/f16 accumulation mode
Integrates kernel and reducer support into the matmul execution path
Implements bypass mode detection logic
Commit 4: x64: matmul: add AMX blocking heuristics for ONEDNN_C_BUF_DST_DT
Updates blocking heuristics that account for low data type accumulation
Notes / Open Questions
The blocking heuristics updated in commit 4 are tentative and warrant careful review