Skip to content

brgemm matmul: conditional use of destination type for bf16/f16 intermediate accumulation#5341

Draft
ankalinin wants to merge 4 commits into
mainfrom
akalinin/brgemm_matmul_low_dt_C_pr
Draft

brgemm matmul: conditional use of destination type for bf16/f16 intermediate accumulation#5341
ankalinin wants to merge 4 commits into
mainfrom
akalinin/brgemm_matmul_low_dt_C_pr

Conversation

@ankalinin

@ankalinin ankalinin commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR introduces support for using bf16/f16 data types for the C buffer (accumulation buffer) in AMX-based matrix multiplication operations, controlled by the ONEDNN_C_BUF_DST_DT environment variable. This optimization enables direct writes to destination memory, eliminating or reducing scratch buffer overhead depending on the K-blocking strategy.

Motivation

Traditional AMX matmul implementations use f32 scratch buffers for intermediate accumulation results. By using narrow data types (bf16/f16) matching the destination type, we unlock significant optimizations across different execution scenarios:

1. Single-threaded K, single K-chunk (nthr_k == 1, K fits in one chunk):
Eliminates accumulation buffers entirely - direct write to destination
Zero L2 residency cost for partial results
Zero RMW bandwidth overhead
Preserves wide-N parallelism on small shapes
Best case scenario

2. Single-threaded K, multiple K-chunks (nthr_k == 1, K split across chunks):
No separate accumulation buffer - uses destination D as accumulator
RMW to destination per K-chunk (read D, accumulate, write D back)
Saves buffer allocation vs f32 scratch
Per-chunk RMW cost amortizes over larger k_blk, favoring fewer K-chunks

3. K-parallel execution (nthr_k > 1):
Requires separate partial buffers per K-thread for parallel accumulation
Reduces memory traffic by ~50% (2 bytes vs 4 bytes per element)
Smaller L2 footprint for resident partial results
Final reduction combines partial results into destination

Changes

Commit 1: x64: brgemm: add support for narrow dt_c (bf16/f16) in AMX kernels
Extends AMX BRGEMM kernels to support bf16/f16 accumulation data types
Adds kernel logic to handle read-modify-write operations with narrow accumulation types
Enables direct destination writes when no intermediate buffer is needed
Foundation for bypass mode optimization

Commit 2: x64: cpu_reducer: add bf16/f16 batched accumulator support
Implements reduction operations for bf16/f16 data types in parallel K-split scenarios
Enables efficient accumulation across multiple K-thread partial results
Supports batched reduction for improved performance on multi-threaded workloads

Commit 3: x64: matmul: enable bf16/f16 accumulation controlled by ONEDNN_C_BUF_DST_DT
Enables the narrow C buffer feature in the matmul primitive
Adds runtime checks and configuration for bf16/f16 accumulation mode
Integrates kernel and reducer support into the matmul execution path
Implements bypass mode detection logic

Commit 4: x64: matmul: add AMX blocking heuristics for ONEDNN_C_BUF_DST_DT
Updates blocking heuristics that account for low data type accumulation

Notes / Open Questions
The blocking heuristics updated in commit 4 are tentative and warrant careful review

akalinin added 3 commits June 16, 2026 14:14
Add bf16/f16 conversion support in jit_brgemm_amx_uker store/beta paths.
Enable ONEDNN_C_BUF_DST_DT environment variable usage in brgemm configuration.
Add cpu_accumulator_2d_batched_t template for bf16/f16/f32/s32 types.
Extend brgemm_types.hpp with batched accumulator configuration fields.
Add c_buf_dst_dt configuration to bypass f32 C buffer and use dst dtype.
Support both single-threaded (nthr_k<=1) and parallel K-reduction (nthr_k>1)
with bf16/f16 batched accumulators.
@github-actions github-actions Bot added the platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 label Jun 16, 2026
@ankalinin

Copy link
Copy Markdown
Contributor Author

make test
set test_scope=NIGHTLY

disable os_win

disable compiler_gnu9
disable compiler_icx-previous
disable compiler_icx-oss
disable compiler_vs2022

disable build_vendor_amd
disable build_vendor_nvidia
disable build_graph
disable build_cpu_runtime_tbb
disable build_cpu_runtime_sycl
disable build_gpu_runtime_sycl
disable build_gpu_runtime_ocl
disable build_mode_no_cpu
disable test_device_gpu
disable test_experimental_bnorm_sop

disable benchdnn_all
enable benchdnn_matmul
enable benchdnn_lstm
enable benchdnn_rnn

Add helpers to compute effective C buffer footprint based on c_buf_dst_dt mode.
Update cost model to account for dst dtype accumulation and K-division.
@ankalinin ankalinin force-pushed the akalinin/brgemm_matmul_low_dt_C_pr branch from 462a01b to 3da35d0 Compare June 17, 2026 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant