Summary
On AArch64 hosts where the L1 data-cache size is not exposed (VMs, containers, build chroots, and some Neoverse parts whose /sys/devices/system/cpu/cpu*/cache/ lists no L1d size), platform::get_per_core_cache_size(1) returns 0. That 0 flows into two consumers and triggers two distinct, deterministic failures:
- Reorder (silent data corruption):
jit_uni_reorder drops a block of data for padded blocked-weight f32 reorders. 12 reorder_simple_test_t_f32_f32 / PaddedWeights_CPU cases fail; real weight reorders are silently wrong.
- Concat (SIGBUS):
test_graph_unit_dnnl_concat_cpu crashes (test_concat_execute_subgraph_int8) via a heap buffer overflow.
Both pass on x86 and on AArch64 hosts that do expose L1d size.
Affected
v3.12.1 and current main. Introduced by commit f90453a ("cpu: platform: aarch64: query and calculate cache_per_core"), which replaced a fixed guess() with a real query but returns 0 when the per-level size is unavailable. Confirmed by git-bisect (good v3.9.1, bad v3.12.1) on a Neoverse-N1 host with no L1d size in /sys.
Mechanism
platform.cpp, AArch64 branch:
return aarch64::cpu().getDataCacheSize(cache_level)
/ aarch64::cpu().getCoresSharingDataCache(cache_level); // 0 (or div-by-zero) if unavailable
-
Reorder — jit_uni_reorder_utils.cpp prb_block_for_cache:
const size_t L1_cache_sz = 3 * get_per_core_cache_size(1) / 4; // = 0
const bool requires_inner_blocking = inner_block_sz > L1_cache_sz; // > 0 → always true
forces a cache-blocking transform that builds a plan whose kernel skips an inner block (e.g. oihw→OIhw16i16o 17x23x2x3: logical element (o=0, i=16, …) stays 0).
-
Concat — simple_concat.cpp:
if (nelems_to_copy[a] * sizeof(data_t) > L1_size /* == 0 */) { /* hand-vectorised copy */ }
else { std::memcpy(...); }
the size test is always true, so even tiny copies take the hand-vectorised path, which writes past the destination buffer (valgrind: invalid write of size 8 at simple_concat.cpp:148, 4 bytes past a 1320-byte buffer). The overflow clobbers an adjacent std::function; a later parallel_nd call jumps through it → SIGBUS.
Reproducers
ONEDNN_MAX_CPU_ISA=ASIMD ./tests/gtests/test_reorder --gtest_filter='*PaddedWeights*f32_f32*'
ONEDNN_MAX_CPU_ISA=ASIMD ./tests/gtests/graph/unit/test_graph_unit \
--gtest_filter='test_concat*' --engine=cpu
Deterministic (fail at OMP_NUM_THREADS=1); reproduce wherever the L1d size is absent in /sys.
Suggested fix
Return the architectural guess() when the queried per-core size is 0, and guard the divide-by-zero:
const auto cache_sz = aarch64::cpu().getDataCacheSize(cache_level);
const auto sharing = aarch64::cpu().getCoresSharingDataCache(cache_level);
const auto sz = (cache_sz && sharing) ? cache_sz / sharing : 0;
return sz ? sz : guess(level);
// else branch: return guess(level); // was: return 0
Verified: with this change full test_reorder and the int8 concat test pass, and the full suite is 203/203 green on a no-L1d-size host. No change where cache size is exposed.
More to look at
Both downstream bugs are reachable independently of cache==0 and might need to be fixed too:
prb_block_for_cache: treat an unknown/0 cache size as "do not force inner blocking", and fix the cache-blocking plan so it never drops data for padded blocked weights — it is otherwise reachable on any host when inner_block_sz > ¾·L1 (large inner blocks).
simple_concat: bound the hand-vectorised copy so it cannot write past the destination regardless of the L1-size heuristic.
Summary
On AArch64 hosts where the L1 data-cache size is not exposed (VMs, containers, build chroots, and some Neoverse parts whose
/sys/devices/system/cpu/cpu*/cache/lists no L1d size),platform::get_per_core_cache_size(1)returns 0. That 0 flows into two consumers and triggers two distinct, deterministic failures:jit_uni_reorderdrops a block of data for padded blocked-weight f32 reorders. 12reorder_simple_test_t_f32_f32 / PaddedWeights_CPUcases fail; real weight reorders are silently wrong.test_graph_unit_dnnl_concat_cpucrashes (test_concat_execute_subgraph_int8) via a heap buffer overflow.Both pass on x86 and on AArch64 hosts that do expose L1d size.
Affected
v3.12.1 and current
main. Introduced by commit f90453a ("cpu: platform: aarch64: query and calculate cache_per_core"), which replaced a fixedguess()with a real query but returns 0 when the per-level size is unavailable. Confirmed by git-bisect (good v3.9.1, bad v3.12.1) on a Neoverse-N1 host with no L1d size in/sys.Mechanism
platform.cpp, AArch64 branch:Reorder —
jit_uni_reorder_utils.cppprb_block_for_cache:forces a cache-blocking transform that builds a plan whose kernel skips an inner block (e.g.
oihw→OIhw16i16o17x23x2x3: logical element (o=0, i=16, …) stays 0).Concat —
simple_concat.cpp:the size test is always true, so even tiny copies take the hand-vectorised path, which writes past the destination buffer (valgrind: invalid write of size 8 at
simple_concat.cpp:148, 4 bytes past a 1320-byte buffer). The overflow clobbers an adjacentstd::function; a laterparallel_ndcall jumps through it → SIGBUS.Reproducers
Deterministic (fail at
OMP_NUM_THREADS=1); reproduce wherever the L1dsizeis absent in/sys.Suggested fix
Return the architectural
guess()when the queried per-core size is 0, and guard the divide-by-zero:Verified: with this change full
test_reorderand the int8 concat test pass, and the full suite is 203/203 green on a no-L1d-size host. No change where cache size is exposed.More to look at
Both downstream bugs are reachable independently of cache==0 and might need to be fixed too:
prb_block_for_cache: treat an unknown/0 cache size as "do not force inner blocking", and fix the cache-blocking plan so it never drops data for padded blocked weights — it is otherwise reachable on any host wheninner_block_sz > ¾·L1(large inner blocks).simple_concat: bound the hand-vectorised copy so it cannot write past the destination regardless of the L1-size heuristic.