brgemm matmul: conditional use of destination type for bf16/f16 intermediate accumulation by ankalinin · Pull Request #5341 · uxlfoundation/oneDNN

ankalinin · 2026-06-16T21:23:16Z

Summary

This PR introduces support for using bf16/f16 data types for the C buffer (accumulation buffer) in AMX-based matrix multiplication operations, controlled by the ONEDNN_C_BUF_DST_DT environment variable. This optimization enables direct writes to destination memory, eliminating or reducing scratch buffer overhead depending on the K-blocking strategy.

Motivation

Traditional AMX matmul implementations use f32 scratch buffers for intermediate accumulation results. By using narrow data types (bf16/f16) matching the destination type, we unlock significant optimizations across different execution scenarios:

1. Single-threaded K, single K-chunk (nthr_k == 1, K fits in one chunk):
Eliminates accumulation buffers entirely - direct write to destination
Zero L2 residency cost for partial results
Zero RMW bandwidth overhead
Preserves wide-N parallelism on small shapes
Best case scenario

2. Single-threaded K, multiple K-chunks (nthr_k == 1, K split across chunks):
No separate accumulation buffer - uses destination D as accumulator
RMW to destination per K-chunk (read D, accumulate, write D back)
Saves buffer allocation vs f32 scratch
Per-chunk RMW cost amortizes over larger k_blk, favoring fewer K-chunks

3. K-parallel execution (nthr_k > 1):
Requires separate partial buffers per K-thread for parallel accumulation
Reduces memory traffic by ~50% (2 bytes vs 4 bytes per element)
Smaller L2 footprint for resident partial results
Final reduction combines partial results into destination

Changes

Commit 1: x64: brgemm: add support for narrow dt_c (bf16/f16) in AMX kernels
Extends AMX BRGEMM kernels to support bf16/f16 accumulation data types
Adds kernel logic to handle read-modify-write operations with narrow accumulation types
Enables direct destination writes when no intermediate buffer is needed
Foundation for bypass mode optimization

Commit 2: x64: cpu_reducer: add bf16/f16 batched accumulator support
Implements reduction operations for bf16/f16 data types in parallel K-split scenarios
Enables efficient accumulation across multiple K-thread partial results
Supports batched reduction for improved performance on multi-threaded workloads

Commit 3: x64: matmul: enable bf16/f16 accumulation controlled by ONEDNN_C_BUF_DST_DT
Enables the narrow C buffer feature in the matmul primitive
Adds runtime checks and configuration for bf16/f16 accumulation mode
Integrates kernel and reducer support into the matmul execution path
Implements bypass mode detection logic

Commit 4: x64: matmul: add AMX blocking heuristics for ONEDNN_C_BUF_DST_DT
Updates blocking heuristics that account for low data type accumulation

Notes / Open Questions
The blocking heuristics updated in commit 4 are tentative and warrant careful review

Add bf16/f16 conversion support in jit_brgemm_amx_uker store/beta paths. Enable ONEDNN_C_BUF_DST_DT environment variable usage in brgemm configuration.

Add cpu_accumulator_2d_batched_t template for bf16/f16/f32/s32 types. Extend brgemm_types.hpp with batched accumulator configuration fields.

Add c_buf_dst_dt configuration to bypass f32 C buffer and use dst dtype. Support both single-threaded (nthr_k<=1) and parallel K-reduction (nthr_k>1) with bf16/f16 batched accumulators.

ankalinin · 2026-06-16T22:01:56Z

make test
set test_scope=NIGHTLY

disable os_win

disable compiler_gnu9
disable compiler_icx-previous
disable compiler_icx-oss
disable compiler_vs2022

disable build_vendor_amd
disable build_vendor_nvidia
disable build_graph
disable build_cpu_runtime_tbb
disable build_cpu_runtime_sycl
disable build_gpu_runtime_sycl
disable build_gpu_runtime_ocl
disable build_mode_no_cpu
disable test_device_gpu
disable test_experimental_bnorm_sop

disable benchdnn_all
enable benchdnn_matmul
enable benchdnn_lstm
enable benchdnn_rnn

Add helpers to compute effective C buffer footprint based on c_buf_dst_dt mode. Update cost model to account for dst dtype accumulation and K-division.

akalinin added 3 commits June 16, 2026 14:14

x64: brgemm: add support for narrow dt_c (bf16/f16) in AMX kernels

8be8c1c

Add bf16/f16 conversion support in jit_brgemm_amx_uker store/beta paths. Enable ONEDNN_C_BUF_DST_DT environment variable usage in brgemm configuration.

x64: cpu_reducer: add bf16/f16 batched accumulator support

91dc44d

Add cpu_accumulator_2d_batched_t template for bf16/f16/f32/s32 types. Extend brgemm_types.hpp with batched accumulator configuration fields.

x64: matmul: enable ONEDNN_C_BUF_DST_DT for bf16/f16 accumulation

fc8676f

Add c_buf_dst_dt configuration to bypass f32 C buffer and use dst dtype. Support both single-threaded (nthr_k<=1) and parallel K-reduction (nthr_k>1) with bf16/f16 batched accumulators.

github-actions Bot added the platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 label Jun 16, 2026

x64: matmul: add AMX blocking heuristics for ONEDNN_C_BUF_DST_DT

3da35d0

Add helpers to compute effective C buffer footprint based on c_buf_dst_dt mode. Update cost model to account for dst dtype accumulation and K-division.

ankalinin force-pushed the akalinin/brgemm_matmul_low_dt_C_pr branch from 462a01b to 3da35d0 Compare June 17, 2026 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

brgemm matmul: conditional use of destination type for bf16/f16 intermediate accumulation#5341

brgemm matmul: conditional use of destination type for bf16/f16 intermediate accumulation#5341
ankalinin wants to merge 4 commits into
mainfrom
akalinin/brgemm_matmul_low_dt_C_pr

ankalinin commented Jun 16, 2026 •

edited

Loading

Uh oh!

ankalinin commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ankalinin commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ankalinin commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ankalinin commented Jun 16, 2026 •

edited

Loading