Skip to content

cpu: rv64: support RVV f16 k3s1 and k3s2 nhwc dwconv#5345

Open
xinghai-zh wants to merge 6 commits into
uxlfoundation:mainfrom
spacemit-com:upstream-spacemit-ops-parallel
Open

cpu: rv64: support RVV f16 k3s1 and k3s2 nhwc dwconv#5345
xinghai-zh wants to merge 6 commits into
uxlfoundation:mainfrom
spacemit-com:upstream-spacemit-ops-parallel

Conversation

@xinghai-zh

Copy link
Copy Markdown

Description

This PR adds an RV64 RVV implementation for f16 depthwise convolution.

The new primitive covers forward 2D depthwise convolution with:

  • f16 source, weights, and destination
  • optional f16 or f32 bias
  • grouped depthwise layout where IC == G
  • 3x3 kernel, stride 1 or stride 2, dilation 1
  • NHWC source and destination tensors
  • GOIHW weights

The implementation adds JIT RVV kernels for the 3x3 f16 depthwise cases, packs NHWC input and GOIHW weights into the layout expected by the kernels, and registers the primitive before the reference f16 convolution implementation. The stride-1 and stride-2 kernels share the common epilogue path for bias, f32-to-f16 narrowing, and output stores; stride-specific input loading and FMA scheduling are generated separately.

Performance

Benchmarks were run on Spacemit X100 (K3).

Performance was measured with MobileNet depthwise f16 benchdnn cases. The baseline is the latest upstream repository version before this change.

Case Impl Ref avg time New avg time Speedup
mobilenet:conv2_1/dw dw_k3s1s2:rvv 21.8489 2.16469 10.09x
mobilenet:conv2_2/dw dw_k3s1s2:rvv 11.5791 2.97556 3.89x
mobilenet:conv3_1/dw dw_k3s1s2:rvv 24.4496 1.62424 15.05x
mobilenet:conv3_2/dw dw_k3s1s2:rvv 8.19451 1.26879 6.46x
mobilenet:conv4_1/dw dw_k3s1s2:rvv 10.9909 0.869389 12.64x
mobilenet:conv4_2/dw dw_k3s1s2:rvv 2.98457 0.685749 4.35x
mobilenet:conv5_1/dw dw_k3s1s2:rvv 5.46905 0.671989 8.14x
mobilenet:conv5_6/dw dw_k3s1s2:rvv 1.53196 0.515312 2.97x
mobilenet:conv6/dw dw_k3s1s2:rvv 2.78519 0.711036 3.92x

The MobileNet depthwise cases are now covered by the RVV primitive and no longer fall back to ref:any.

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

Performance improvements

  • Have you submitted performance data that demonstrates performance improvements?

@zhangfeiv0

Copy link
Copy Markdown
Contributor

Regarding the development plan for RV64, I think file naming should be unified under the prefix jit_uni*. Including rvv in the names is confusing, since the code actually also references the Zvfh extension and may introduce more extensions in the future. In addition, the CPU_INSTANCE_RV64GCV macro should evolve toward CPU_INSTANCE_RV64; you can check #5343 for details.

Comment thread src/cpu/rv64/rvv_dwconv.cpp Outdated
const dim_t padded_w = iw + l_pad + r_pad;
const dim_t stride_h = cd->strides[0];

std::vector<float16_t> packed_weights;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend to use scratchpad instead of heap allocation in oneDNN.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your recommendation. We have updated the temporary buffers in jit_uni_dwconv to use oneDNN scratchpad instead of heap allocations.

Comment on lines +74 to +76
VDISPATCH_CONV(dilate_h == 0 && dilate_w == 0, VERBOSE_BAD_DIM,
"dilates", 0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a check on whether padding matches the kernel's zero-fill assumption? If padding is asymmetric (e.g., t_pad != l_pad), the packing code would produce incorrect results silently.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder. We re-examined the implementation and performed additional verification. Based on our current understanding, additional checks related to padding does not appear to be necessary.

The implementation does not assume t_pad == l_pad or otherwise require symmetric padding. In the packing path, a buffer of size (ih + t_pad + b_pad) x (iw + l_pad + r_pad) is allocated and zero-initialized, and the source tensor is copied into the region starting at (h + t_pad, w + l_pad). The JIT kernel then operates on this zero-padded buffer.

We also validated this behavior with benchdnn correctness tests. All of the following cases selected jit_dw:uni and passed successfully:

Scenario Descriptor Result
No padding g8mb1ic8ih8oc8oh6kh3ph0iw8ow6kw3pw0 Passed
Symmetric padding g8mb1ic8ih8oc8oh8kh3ph1iw8ow8kw3pw1 Passed
Top-only padding g8mb1ic8ih8oc8oh7kh3ph1iw8ow6kw3pw0 Passed
Bottom-only padding g8mb1ic8ih8oc8oh7kh3ph0iw8ow6kw3pw0 Passed
Left-only padding g8mb1ic8ih8oc8oh6kh3ph0iw8ow7kw3pw1 Passed
Right-only padding g8mb1ic8ih8oc8oh6kh3ph0iw8ow7kw3pw0 Passed
Small asymmetric padding, with bottom/right padding inferred by benchdnn g8mb1ic8ih8oc8oh9kh3ph1iw8ow9kw3pw1 Passed
Stride-2 asymmetric padding g8mb1ic8ih8oc8oh5kh3sh2ph1iw8ow5kw3sw2pw1 Passed
Large top/left padding g8mb1ic8ih64oc8oh92kh3ph30iw64ow102kw3pw40 Passed
Large bottom/right padding, inferred from the output shape g8mb1ic8ih64oc8oh92kh3ph0iw64ow102kw3pw0 Passed
Large asymmetric padding on both sides g8mb1ic8ih64oc8oh132kh3ph30iw64ow132kw3pw30 Passed
Stride-2 with large asymmetric padding g8mb1ic8ih64oc8oh42kh3sh2ph10iw64ow44kw3sw2pw12 Passed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment!

@xinghai-zh xinghai-zh closed this Jun 17, 2026
@xinghai-zh xinghai-zh reopened this Jun 17, 2026
@xinghai-zh

Copy link
Copy Markdown
Author

Regarding the development plan for RV64, I think file naming should be unified under the prefix jit_uni*. Including rvv in the names is confusing, since the code actually also references the Zvfh extension and may introduce more extensions in the future. In addition, the CPU_INSTANCE_RV64GCV macro should evolve toward CPU_INSTANCE_RV64; you can check #5343 for details.

Thanks for the suggestion. We have already consolidated the current k3 implementation into jit_uni_dwconv.hpp/cpp following the gnorm style.
Looking ahead, if we add k5/k7 depthwise kernels later, would it still be preferable to keep all kernel implementations in jit_uni_dwconv.cpp? Or would it also be acceptable to split them into separate files such as jit_uni_dwconv_k{3,5,7}_kernel.cpp, as long as the naming remains under the jit_uni_dwconv* prefix?
cc @zhangfeiv0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants