AArch64: get_per_core_cache_size()==0 corrupts reorder output and crashes int8 concat


## Summary
On AArch64 hosts where the L1 **data**-cache size is not exposed (VMs, containers, build chroots, and some Neoverse parts whose `/sys/devices/system/cpu/cpu*/cache/` lists no L1d size), `platform::get_per_core_cache_size(1)` returns **0**. That 0 flows into two consumers and triggers two distinct, deterministic failures:

- **Reorder (silent data corruption):** `jit_uni_reorder` drops a block of data for padded blocked-weight f32 reorders. 12 `reorder_simple_test_t_f32_f32 / PaddedWeights_CPU` cases fail; real weight reorders are silently wrong.
- **Concat (SIGBUS):** `test_graph_unit_dnnl_concat_cpu` crashes (`test_concat_execute_subgraph_int8`) via a heap buffer overflow.

Both pass on x86 and on AArch64 hosts that do expose L1d size.

## Affected
v3.12.1 and current `main`. Introduced by commit **f90453adbf** ("cpu: platform: aarch64: query and calculate cache_per_core"), which replaced a fixed `guess()` with a real query but returns 0 when the per-level size is unavailable. Confirmed by git-bisect (good v3.9.1, bad v3.12.1) on a Neoverse-N1 host with no L1d size in `/sys`.

## Mechanism
`platform.cpp`, AArch64 branch:
```cpp
return aarch64::cpu().getDataCacheSize(cache_level)
        / aarch64::cpu().getCoresSharingDataCache(cache_level);   // 0 (or div-by-zero) if unavailable
```

1. Reorder — `jit_uni_reorder_utils.cpp` `prb_block_for_cache`:
   ```cpp
   const size_t L1_cache_sz = 3 * get_per_core_cache_size(1) / 4;   // = 0
   const bool requires_inner_blocking = inner_block_sz > L1_cache_sz; // > 0 → always true
   ```
   forces a cache-blocking transform that builds a plan whose kernel skips an inner block (e.g. `oihw→OIhw16i16o` 17x23x2x3: logical element (o=0, i=16, …) stays 0).

2. Concat — `simple_concat.cpp`:
   ```cpp
   if (nelems_to_copy[a] * sizeof(data_t) > L1_size /* == 0 */) { /* hand-vectorised copy */ }
   else { std::memcpy(...); }
   ```
   the size test is always true, so even tiny copies take the hand-vectorised path, which writes past the destination buffer (valgrind: invalid write of size 8 at `simple_concat.cpp:148`, 4 bytes past a 1320-byte buffer). The overflow clobbers an adjacent `std::function`; a later `parallel_nd` call jumps through it → SIGBUS.

## Reproducers
```
ONEDNN_MAX_CPU_ISA=ASIMD ./tests/gtests/test_reorder --gtest_filter='*PaddedWeights*f32_f32*'
ONEDNN_MAX_CPU_ISA=ASIMD ./tests/gtests/graph/unit/test_graph_unit \
    --gtest_filter='test_concat*' --engine=cpu
```
Deterministic (fail at `OMP_NUM_THREADS=1`); reproduce wherever the L1d `size` is absent in `/sys`.

## Suggested fix
Return the architectural `guess()` when the queried per-core size is 0, and guard the divide-by-zero:
```cpp
const auto cache_sz = aarch64::cpu().getDataCacheSize(cache_level);
const auto sharing  = aarch64::cpu().getCoresSharingDataCache(cache_level);
const auto sz = (cache_sz && sharing) ? cache_sz / sharing : 0;
return sz ? sz : guess(level);
// else branch: return guess(level);  // was: return 0
```
Verified: with this change full `test_reorder` and the int8 concat test pass, and the full suite is 203/203 green on a no-L1d-size host. No change where cache size is exposed.

## More to look at
Both downstream bugs are reachable independently of cache==0 and might need to be fixed too:
- `prb_block_for_cache`: treat an unknown/0 cache size as "do not force inner blocking", and fix the cache-blocking plan so it never drops data for padded blocked weights — it is otherwise reachable on any host when `inner_block_sz > ¾·L1` (large inner blocks).
- `simple_concat`: bound the hand-vectorised copy so it cannot write past the destination regardless of the L1-size heuristic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AArch64: get_per_core_cache_size()==0 corrupts reorder output and crashes int8 concat #5313

Summary

Affected

Mechanism

Reproducers

Suggested fix

More to look at

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

AArch64: get_per_core_cache_size()==0 corrupts reorder output and crashes int8 concat #5313

Description

Summary

Affected

Mechanism

Reproducers

Suggested fix

More to look at

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions