Speed up JPEG Decoder color conversion

Let's finally beat System.Drawing on the JPEG Load->Resize->Save scenario!

As discussed in #1064, it's finally possible thanks to the Intel SIMD intrinsics in .NET Core 3.1. Opening an issue so we can track this work, and hopefully get some help & feedback from the community.

/cc @Sergio0694 @saucecontrol

# Current pipeline
Summary of steps currently done  by [`ConvertColorsInto`](https://github.com/SixLabors/ImageSharp/blob/master/src/ImageSharp/Formats/Jpeg/Components/Decoder/JpegImagePostProcessor.cs#L145):

**[D]:** Data representation
*(T):* Bulk transformation between data representations

### (case a) Y+Cb+Cr planes --> Single `Rgba32` buffer
|         |  |
| ------- | :--- |
| **[D]** | **3 Planes of W\*H sized `float` jpeg color channels normalized to 0-255 (3 x `Buffer2D<float>`, Y+Cb+Cr)** |
|  *(T)*  | *[Color convert](https://github.com/SixLabors/ImageSharp/blob/master/src/ImageSharp/Formats/Jpeg/Components/Decoder/ColorConverters/JpegColorConverter.FromYCbCrSimdAvx2.cs#L77) and [pack](https://github.com/SixLabors/ImageSharp/blob/master/src/ImageSharp/Formats/Jpeg/Components/Decoder/ColorConverters/JpegColorConverter.FromYCbCrSimdAvx2.cs#L105) into a single `Vector4` buffer* |
| **[D]** | **Floating point RGBA data as `Memory<Vector4>`** |
|  *(T)*  | *Convert the `Vector4` buffer to an `Rgba32` buffer. In the `Rgba32` case case, the input buffer could be handled as homogenous `float` buffer, where all individual `float` values should be converted to `byte`-s. The conversion is implemented in [`BulkConvertNormalizedFloatToByteClampOverflows`](https://github.com/SixLabors/ImageSharp/blob/master/src/ImageSharp/Common/Helpers/SimdUtils.ExtendedIntrinsics.cs#L137), utilizing AVX2 conversion and narrowing operations through `Vector<T>`* |
| **[D]** | **The result image as an `Rgba32` buffer** |

### (case b) Y+Cb+Cr planes --> Single `Rgb24` buffer
|         |  |
| ------- | :--- |
| **[D]** | **3 Planes of W\*H sized `float` jpeg color channels normalized to 0-255 (3 x `Buffer2D<float>`, Y+Cb+Cr)** |
|  *(T)*  | *[Color convert](https://github.com/SixLabors/ImageSharp/blob/master/src/ImageSharp/Formats/Jpeg/Components/Decoder/ColorConverters/JpegColorConverter.FromYCbCrSimdAvx2.cs#L77) and [pack](https://github.com/SixLabors/ImageSharp/blob/master/src/ImageSharp/Formats/Jpeg/Components/Decoder/ColorConverters/JpegColorConverter.FromYCbCrSimdAvx2.cs#L105) into a single `Vector4` buffer* |
| **[D]** | **Floating point RGBA data as `Memory<Vector4>`** |
|  *(T)*  | *Convert the `Vector4` buffer to an `Rgba32` buffer, utilizing [`BulkConvertNormalizedFloatToByteClampOverflows`](https://github.com/SixLabors/ImageSharp/blob/master/src/ImageSharp/Common/Helpers/SimdUtils.ExtendedIntrinsics.cs#L137), utilizing AVX2 conversion and narrow operations through `Vector<T>`* |
| **[D]** | **[Temporary `Rgba32` buffer](https://github.com/SixLabors/ImageSharp/blob/master/src/ImageSharp/PixelFormats/Utils/Vector4Converters.RgbaCompatible.cs#L106)** |
|  *(T)*  | *[`PixelOperations<Rgb24>.FromRgba32()`](https://github.com/SixLabors/ImageSharp/blob/master/src/ImageSharp/PixelFormats/PixelOperations%7BTPixel%7D.Generated.cs#L451) (sub-optimal, extra transformation!)* |
| **[D]** | **The result image as an `Rgb24` buffer** |

# Optimized pipeline

### (default `Rgb24` case) Y+Cb+Cr planes --> Single `Rgb24` buffer
|         | |
| ------- | :--- |
| **D1** | **3 Planes of W\*H sized `float` jpeg color channels normalized to 0-255 (3 x `Buffer2D<float>`, Y+Cb+Cr)** |
|  *(T)*  | *Color convert, the 3 planes, and write them back to the originating buffers* |
| **D2** | **3 Planes of `Buffer2D<float>`, R+G+B)** |
|  *(T)*  | *Narrow the `float` buffers to `byte` buffers using `SimdUtils.BulkConvertNormalizedFloatToByteClampOverflows`* |
| **D3** | **3 Planes of `Buffer2D<byte>`, R+G+B** |
|  *(T)*  | *PACK the separate image planes (color channels) into a single `Rgb24` buffer* |
| **D4** | **The result image as an `Rgb24` buffer** |

### (`TPixel` case) Y+Cb+Cr planes --> Single `TPixel` buffer
|         | |
| ------- | :--- |
| **D1** | **3 Planes of W\*H sized `float` jpeg color channels normalized to 0-255 (3 x `Buffer2D<float>`, Y+Cb+Cr)** |
|  *(T)*  | *All the steps from the default `Rgb24` case* |
| **D4** | **`Memory<Rgb24>`** |
|  *(T)*  | *Convert the `Rgb24` buffer to `TPixel` buffer using `PixelOperations<T>`* |
| **D5** | **The result image as an `TPixel` buffer** |

The magic is mostly in the D3->D4 transition, because of the fact that we can now do the pixel packing with [shuffle](https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-intrinsics-for-shuffle-operations-1) and [permute](https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-intrinsics-for-permute-operations) intrinsics when those are available. The other fun thing is that if we decode to `Image<Rgb24>` (case b) we can omit an unnecessary step.

# API proposal for packing
The best thing is that we can handle this big task incrementally:
- [x] First, extend `PixelOperations<T>` by new packing operations
- [ ] Then, adapt the changes in `JpegImagePostProcessor` as described in the *Optimized pipeline* paragraph 

The packing API is pretty straightforward:
```csharp
public class PixelOperations<TPixel>
{
    // ...
    
    public void PackFromRgbPlanes(
           Configuration configuration,
	   ReadOnlySpan<byte> redChannel, 
	   ReadOnlySpan<byte> greenChannel, 
	   ReadOnlySpan<byte> blueChannel,
	   Span<TPixel> destination);
}
```
We can define a default implementations in the base `PixelOperations<TPixel>` class, and specialize it for `Rgba32` and `Rgb24`. Optional hardcore task is to T4 a SIMD implementation it for all the RGB(A)-like formats.

#### Note
It is possible to optimize the conversion even further by doing D1->D3 in a single step, but I consider it a very hard task both implementation and architecture-wise, and prefer incremental evolution instead.


[D]	*3 Planes of WH sized `float` jpeg color channels normalized to 0-255 (3 x `Buffer2D<float>`, Y+Cb+Cr)**
(T)	Color convert and pack into a single `Vector4` buffer
[D]	Floating point RGBA data as `Memory<Vector4>`
(T)	Convert the `Vector4` buffer to an `Rgba32` buffer. In the `Rgba32` case case, the input buffer could be handled as homogenous `float` buffer, where all individual `float` values should be converted to `byte`-s. The conversion is implemented in `BulkConvertNormalizedFloatToByteClampOverflows`, utilizing AVX2 conversion and narrowing operations through `Vector<T>`
[D]	The result image as an `Rgba32` buffer


[D]	*3 Planes of WH sized `float` jpeg color channels normalized to 0-255 (3 x `Buffer2D<float>`, Y+Cb+Cr)**
(T)	Color convert and pack into a single `Vector4` buffer
[D]	Floating point RGBA data as `Memory<Vector4>`
(T)	Convert the `Vector4` buffer to an `Rgba32` buffer, utilizing `BulkConvertNormalizedFloatToByteClampOverflows`, utilizing AVX2 conversion and narrow operations through `Vector<T>`
[D]	Temporary `Rgba32` buffer
(T)	`PixelOperations<Rgb24>.FromRgba32()` (sub-optimal, extra transformation!)
[D]	The result image as an `Rgb24` buffer


D1	*3 Planes of WH sized `float` jpeg color channels normalized to 0-255 (3 x `Buffer2D<float>`, Y+Cb+Cr)**
(T)	Color convert, the 3 planes, and write them back to the originating buffers
D2	3 Planes of `Buffer2D<float>`, R+G+B)
(T)	Narrow the `float` buffers to `byte` buffers using `SimdUtils.BulkConvertNormalizedFloatToByteClampOverflows`
D3	3 Planes of `Buffer2D<byte>`, R+G+B
(T)	PACK the separate image planes (color channels) into a single `Rgb24` buffer
D4	The result image as an `Rgb24` buffer


D1	*3 Planes of WH sized `float` jpeg color channels normalized to 0-255 (3 x `Buffer2D<float>`, Y+Cb+Cr)**
(T)	All the steps from the default `Rgb24` case
D4	`Memory<Rgb24>`
(T)	Convert the `Rgb24` buffer to `TPixel` buffer using `PixelOperations<T>`
D5	The result image as an `TPixel` buffer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up JPEG Decoder color conversion #1121

Current pipeline

(case a) Y+Cb+Cr planes --> Single `Rgba32` buffer

(case b) Y+Cb+Cr planes --> Single `Rgb24` buffer

Optimized pipeline

(default `Rgb24` case) Y+Cb+Cr planes --> Single `Rgb24` buffer

(`TPixel` case) Y+Cb+Cr planes --> Single `TPixel` buffer

API proposal for packing

Note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Speed up JPEG Decoder color conversion #1121

Description

Current pipeline

(case a) Y+Cb+Cr planes --> Single Rgba32 buffer

(case b) Y+Cb+Cr planes --> Single Rgb24 buffer

Optimized pipeline

(default Rgb24 case) Y+Cb+Cr planes --> Single Rgb24 buffer

(TPixel case) Y+Cb+Cr planes --> Single TPixel buffer

API proposal for packing

Note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

(case a) Y+Cb+Cr planes --> Single `Rgba32` buffer

(case b) Y+Cb+Cr planes --> Single `Rgb24` buffer

(default `Rgb24` case) Y+Cb+Cr planes --> Single `Rgb24` buffer

(`TPixel` case) Y+Cb+Cr planes --> Single `TPixel` buffer