Skip to content

AArch64: get_per_core_cache_size()==0 corrupts reorder output and crashes int8 concat #5313

@happyaron

Description

@happyaron

Summary

On AArch64 hosts where the L1 data-cache size is not exposed (VMs, containers, build chroots, and some Neoverse parts whose /sys/devices/system/cpu/cpu*/cache/ lists no L1d size), platform::get_per_core_cache_size(1) returns 0. That 0 flows into two consumers and triggers two distinct, deterministic failures:

  • Reorder (silent data corruption): jit_uni_reorder drops a block of data for padded blocked-weight f32 reorders. 12 reorder_simple_test_t_f32_f32 / PaddedWeights_CPU cases fail; real weight reorders are silently wrong.
  • Concat (SIGBUS): test_graph_unit_dnnl_concat_cpu crashes (test_concat_execute_subgraph_int8) via a heap buffer overflow.

Both pass on x86 and on AArch64 hosts that do expose L1d size.

Affected

v3.12.1 and current main. Introduced by commit f90453a ("cpu: platform: aarch64: query and calculate cache_per_core"), which replaced a fixed guess() with a real query but returns 0 when the per-level size is unavailable. Confirmed by git-bisect (good v3.9.1, bad v3.12.1) on a Neoverse-N1 host with no L1d size in /sys.

Mechanism

platform.cpp, AArch64 branch:

return aarch64::cpu().getDataCacheSize(cache_level)
        / aarch64::cpu().getCoresSharingDataCache(cache_level);   // 0 (or div-by-zero) if unavailable
  1. Reorder — jit_uni_reorder_utils.cpp prb_block_for_cache:

    const size_t L1_cache_sz = 3 * get_per_core_cache_size(1) / 4;   // = 0
    const bool requires_inner_blocking = inner_block_sz > L1_cache_sz; // > 0 → always true

    forces a cache-blocking transform that builds a plan whose kernel skips an inner block (e.g. oihw→OIhw16i16o 17x23x2x3: logical element (o=0, i=16, …) stays 0).

  2. Concat — simple_concat.cpp:

    if (nelems_to_copy[a] * sizeof(data_t) > L1_size /* == 0 */) { /* hand-vectorised copy */ }
    else { std::memcpy(...); }

    the size test is always true, so even tiny copies take the hand-vectorised path, which writes past the destination buffer (valgrind: invalid write of size 8 at simple_concat.cpp:148, 4 bytes past a 1320-byte buffer). The overflow clobbers an adjacent std::function; a later parallel_nd call jumps through it → SIGBUS.

Reproducers

ONEDNN_MAX_CPU_ISA=ASIMD ./tests/gtests/test_reorder --gtest_filter='*PaddedWeights*f32_f32*'
ONEDNN_MAX_CPU_ISA=ASIMD ./tests/gtests/graph/unit/test_graph_unit \
    --gtest_filter='test_concat*' --engine=cpu

Deterministic (fail at OMP_NUM_THREADS=1); reproduce wherever the L1d size is absent in /sys.

Suggested fix

Return the architectural guess() when the queried per-core size is 0, and guard the divide-by-zero:

const auto cache_sz = aarch64::cpu().getDataCacheSize(cache_level);
const auto sharing  = aarch64::cpu().getCoresSharingDataCache(cache_level);
const auto sz = (cache_sz && sharing) ? cache_sz / sharing : 0;
return sz ? sz : guess(level);
// else branch: return guess(level);  // was: return 0

Verified: with this change full test_reorder and the int8 concat test pass, and the full suite is 203/203 green on a no-L1d-size host. No change where cache size is exposed.

More to look at

Both downstream bugs are reachable independently of cache==0 and might need to be fixed too:

  • prb_block_for_cache: treat an unknown/0 cache size as "do not force inner blocking", and fix the cache-blocking plan so it never drops data for padded blocked weights — it is otherwise reachable on any host when inner_block_sz > ¾·L1 (large inner blocks).
  • simple_concat: bound the hand-vectorised copy so it cannot write past the destination regardless of the L1-size heuristic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    platform:cpu-aarch64Codeowner: @oneapi-src/onednn-cpu-aarch64sightingSuspicious library behavior. Should be promoted to a bug when confirmed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions