Let's finally beat System.Drawing on the JPEG Load->Resize->Save scenario!
As discussed in #1064, it's finally possible thanks to the Intel SIMD intrinsics in .NET Core 3.1. Opening an issue so we can track this work, and hopefully get some help & feedback from the community.
/cc @Sergio0694 @saucecontrol
Current pipeline
Summary of steps currently done by ConvertColorsInto:
[D]: Data representation
(T): Bulk transformation between data representations
(case a) Y+Cb+Cr planes --> Single Rgba32 buffer
|
|
| [D] |
3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) |
Color convert and pack into a single Vector4 buffer |
| [D] |
Floating point RGBA data as Memory<Vector4> |
| (T) |
Convert the Vector4 buffer to an Rgba32 buffer. In the Rgba32 case case, the input buffer could be handled as homogenous float buffer, where all individual float values should be converted to byte-s. The conversion is implemented in BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrowing operations through Vector<T> |
| [D] |
The result image as an Rgba32 buffer |
(case b) Y+Cb+Cr planes --> Single Rgb24 buffer
Optimized pipeline
(default Rgb24 case) Y+Cb+Cr planes --> Single Rgb24 buffer
|
|
| D1 |
3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) |
Color convert, the 3 planes, and write them back to the originating buffers |
| D2 |
3 Planes of Buffer2D<float>, R+G+B) |
| (T) |
Narrow the float buffers to byte buffers using SimdUtils.BulkConvertNormalizedFloatToByteClampOverflows |
| D3 |
3 Planes of Buffer2D<byte>, R+G+B |
| (T) |
PACK the separate image planes (color channels) into a single Rgb24 buffer |
| D4 |
The result image as an Rgb24 buffer |
(TPixel case) Y+Cb+Cr planes --> Single TPixel buffer
|
|
| D1 |
3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) |
All the steps from the default Rgb24 case |
| D4 |
Memory<Rgb24> |
| (T) |
Convert the Rgb24 buffer to TPixel buffer using PixelOperations<T> |
| D5 |
The result image as an TPixel buffer |
The magic is mostly in the D3->D4 transition, because of the fact that we can now do the pixel packing with shuffle and permute intrinsics when those are available. The other fun thing is that if we decode to Image<Rgb24> (case b) we can omit an unnecessary step.
API proposal for packing
The best thing is that we can handle this big task incrementally:
The packing API is pretty straightforward:
public class PixelOperations<TPixel>
{
// ...
public void PackFromRgbPlanes(
Configuration configuration,
ReadOnlySpan<byte> redChannel,
ReadOnlySpan<byte> greenChannel,
ReadOnlySpan<byte> blueChannel,
Span<TPixel> destination);
}
We can define a default implementations in the base PixelOperations<TPixel> class, and specialize it for Rgba32 and Rgb24. Optional hardcore task is to T4 a SIMD implementation it for all the RGB(A)-like formats.
Note
It is possible to optimize the conversion even further by doing D1->D3 in a single step, but I consider it a very hard task both implementation and architecture-wise, and prefer incremental evolution instead.
Let's finally beat System.Drawing on the JPEG Load->Resize->Save scenario!
As discussed in #1064, it's finally possible thanks to the Intel SIMD intrinsics in .NET Core 3.1. Opening an issue so we can track this work, and hopefully get some help & feedback from the community.
/cc @Sergio0694 @saucecontrol
Current pipeline
Summary of steps currently done by
ConvertColorsInto:[D]: Data representation
(T): Bulk transformation between data representations
(case a) Y+Cb+Cr planes --> Single
Rgba32bufferfloatjpeg color channels normalized to 0-255 (3 xBuffer2D<float>, Y+Cb+Cr)Vector4bufferMemory<Vector4>Vector4buffer to anRgba32buffer. In theRgba32case case, the input buffer could be handled as homogenousfloatbuffer, where all individualfloatvalues should be converted tobyte-s. The conversion is implemented inBulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrowing operations throughVector<T>Rgba32buffer(case b) Y+Cb+Cr planes --> Single
Rgb24bufferfloatjpeg color channels normalized to 0-255 (3 xBuffer2D<float>, Y+Cb+Cr)Vector4bufferMemory<Vector4>Vector4buffer to anRgba32buffer, utilizingBulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrow operations throughVector<T>Rgba32bufferPixelOperations<Rgb24>.FromRgba32()(sub-optimal, extra transformation!)Rgb24bufferOptimized pipeline
(default
Rgb24case) Y+Cb+Cr planes --> SingleRgb24bufferfloatjpeg color channels normalized to 0-255 (3 xBuffer2D<float>, Y+Cb+Cr)Buffer2D<float>, R+G+B)floatbuffers tobytebuffers usingSimdUtils.BulkConvertNormalizedFloatToByteClampOverflowsBuffer2D<byte>, R+G+BRgb24bufferRgb24buffer(
TPixelcase) Y+Cb+Cr planes --> SingleTPixelbufferfloatjpeg color channels normalized to 0-255 (3 xBuffer2D<float>, Y+Cb+Cr)Rgb24caseMemory<Rgb24>Rgb24buffer toTPixelbuffer usingPixelOperations<T>TPixelbufferThe magic is mostly in the D3->D4 transition, because of the fact that we can now do the pixel packing with shuffle and permute intrinsics when those are available. The other fun thing is that if we decode to
Image<Rgb24>(case b) we can omit an unnecessary step.API proposal for packing
The best thing is that we can handle this big task incrementally:
PixelOperations<T>by new packing operationsJpegImagePostProcessoras described in the Optimized pipeline paragraphThe packing API is pretty straightforward:
We can define a default implementations in the base
PixelOperations<TPixel>class, and specialize it forRgba32andRgb24. Optional hardcore task is to T4 a SIMD implementation it for all the RGB(A)-like formats.Note
It is possible to optimize the conversion even further by doing D1->D3 in a single step, but I consider it a very hard task both implementation and architecture-wise, and prefer incremental evolution instead.