cpu: rv64: support RVV f16 k3s1 and k3s2 nhwc dwconv#5345
Conversation
|
Regarding the development plan for RV64, I think file naming should be unified under the prefix |
| const dim_t padded_w = iw + l_pad + r_pad; | ||
| const dim_t stride_h = cd->strides[0]; | ||
|
|
||
| std::vector<float16_t> packed_weights; |
There was a problem hiding this comment.
Recommend to use scratchpad instead of heap allocation in oneDNN.
There was a problem hiding this comment.
Thanks for your recommendation. We have updated the temporary buffers in jit_uni_dwconv to use oneDNN scratchpad instead of heap allocations.
| VDISPATCH_CONV(dilate_h == 0 && dilate_w == 0, VERBOSE_BAD_DIM, | ||
| "dilates", 0); | ||
|
|
There was a problem hiding this comment.
Do we need a check on whether padding matches the kernel's zero-fill assumption? If padding is asymmetric (e.g., t_pad != l_pad), the packing code would produce incorrect results silently.
There was a problem hiding this comment.
Thanks for the reminder. We re-examined the implementation and performed additional verification. Based on our current understanding, additional checks related to padding does not appear to be necessary.
The implementation does not assume t_pad == l_pad or otherwise require symmetric padding. In the packing path, a buffer of size (ih + t_pad + b_pad) x (iw + l_pad + r_pad) is allocated and zero-initialized, and the source tensor is copied into the region starting at (h + t_pad, w + l_pad). The JIT kernel then operates on this zero-padded buffer.
We also validated this behavior with benchdnn correctness tests. All of the following cases selected jit_dw:uni and passed successfully:
| Scenario | Descriptor | Result |
|---|---|---|
| No padding | g8mb1ic8ih8oc8oh6kh3ph0iw8ow6kw3pw0 |
Passed |
| Symmetric padding | g8mb1ic8ih8oc8oh8kh3ph1iw8ow8kw3pw1 |
Passed |
| Top-only padding | g8mb1ic8ih8oc8oh7kh3ph1iw8ow6kw3pw0 |
Passed |
| Bottom-only padding | g8mb1ic8ih8oc8oh7kh3ph0iw8ow6kw3pw0 |
Passed |
| Left-only padding | g8mb1ic8ih8oc8oh6kh3ph0iw8ow7kw3pw1 |
Passed |
| Right-only padding | g8mb1ic8ih8oc8oh6kh3ph0iw8ow7kw3pw0 |
Passed |
Small asymmetric padding, with bottom/right padding inferred by benchdnn |
g8mb1ic8ih8oc8oh9kh3ph1iw8ow9kw3pw1 |
Passed |
| Stride-2 asymmetric padding | g8mb1ic8ih8oc8oh5kh3sh2ph1iw8ow5kw3sw2pw1 |
Passed |
| Large top/left padding | g8mb1ic8ih64oc8oh92kh3ph30iw64ow102kw3pw40 |
Passed |
| Large bottom/right padding, inferred from the output shape | g8mb1ic8ih64oc8oh92kh3ph0iw64ow102kw3pw0 |
Passed |
| Large asymmetric padding on both sides | g8mb1ic8ih64oc8oh132kh3ph30iw64ow132kw3pw30 |
Passed |
| Stride-2 with large asymmetric padding | g8mb1ic8ih64oc8oh42kh3sh2ph10iw64ow44kw3sw2pw12 |
Passed |
There was a problem hiding this comment.
Thanks for the comment!
Thanks for the suggestion. We have already consolidated the current k3 implementation into jit_uni_dwconv.hpp/cpp following the gnorm style. |
Description
This PR adds an RV64 RVV implementation for f16 depthwise convolution.
The new primitive covers forward 2D depthwise convolution with:
IC == GThe implementation adds JIT RVV kernels for the 3x3 f16 depthwise cases, packs NHWC input and GOIHW weights into the layout expected by the kernels, and registers the primitive before the reference f16 convolution implementation. The stride-1 and stride-2 kernels share the common epilogue path for bias, f32-to-f16 narrowing, and output stores; stride-specific input loading and FMA scheduling are generated separately.
Performance
Benchmarks were run on Spacemit X100 (K3).
Performance was measured with MobileNet depthwise f16 benchdnn cases. The baseline is the latest upstream repository version before this change.
The MobileNet depthwise cases are now covered by the RVV primitive and no longer fall back to
ref:any.Checklist
General
make testandmake test_benchdnn_*) pass locally for each commit?Performance improvements