Windows NT comes with multiple built-in compression methods which are provided by the RtlCompressBuffer and RtlDecompressBuffer functions.
| Identifier | Description |
|---|---|
COMPRESSION_FORMAT_LZNT1 |
LZNT1 compression (LZ77). Sometimes also referred to as New Technologies File System (NTFS) compression. |
COMPRESSION_FORMAT_XPRESS |
LZXPRESS compression (LZ77 + DIRECT2). |
COMPRESSION_FORMAT_XPRESS_HUFF |
LZXPRESS with Huffman compression. |
Other compression methods used by Windows in various data formats are:
-
LZX compression
-
LZX DELTA compression
The LZNT1 compression method is based on LZ77 compression.
The LZNT1 compressed data consists of multiple chunks. A chunk header value of 0 indicates the end of the compressed data. The following pseudo code indicates how to decompress LZNT1 compressed data.
while "compressed data" available:
read "chunk header"
if "chunk data size" is 0:
break
if the "is compressed flag" is not set:
chunk data is uncompressed
else:
read a "flag byte"
for "flag bit" in "flag byte" starting with the LSB:
if "flag bit" is not set:
read a literal (uncompressed) byte
else:
read 16-bit compression tuple
use the compression tuple to read previously
decompressed data
An LZNT1 chunk consists of:
-
a chunk header
-
chunk data
The chunk data (or compression block) consists of tagged compression groups.
The LZNT1 chunk header is 2 bytes of size and consists of:
| Offset | Size | Value | Description |
|---|---|---|---|
0.0 |
12 bits |
Chunk data size |
|
1.4 |
3 bits |
Signature value |
|
1.7 |
1 bit |
Is compressed flag |
A tagged group consist of 8 values (not bytes) preceded by a tag byte:
tag A B C D E F G H
The LSB of the tag byte represents the first value in the group, the MSB the last
-
a tag bit of 0 indicates an uncompressed byte;
-
a tag bit of 1 indicates compressed data using a little-endian 16-bit (2-byte) compression tuple (meaning combination of two values).
The compression tuple contains an offset (back reference) and a size value.
Where the size is the actual size minus 3. Use the following calculation to correct the size value in the tuple.
size = size + 3
And the offset a positive representation of a back reference minus 1. Use the following calculation to correct the offset value in the tuple.
offset = -1 * ( offset + 1 )
The compression tuple uses a dynamic amount of bits to store the offset and size values.
The calculation of the amount of bits used for the offset and size values is as following:
-
at the uncompressed data block offset 0, the size is stored in the least significant 12 bits of size and the offset 4 bits
-
the larger the uncompressed data block offset, the larger the amount of bits are used for the offset value and the smaller the amount of bits for the size .
The following calculation is used to determine the amount of bits to store the offset and size values.
compression_tuple_size_offset_shift = 12;
compression_tuple_size_mask = 0xfff;
for( iterator = uncompressed_data_block_offset - 1;
iterator >= 0x10;
iterator >>= 1 )
{
/* bit shift for the offset value */
compression_tuple_size_offset_shift--;
/* bit mask for size value */
compression_tuple_size_mask >>= 1;
}
The tuple is uncompressed by copying the byte at the offset in the uncompressed data to the end of the uncompressed data. This is repeated for the size value of the tuple.
|
Note
|
The offset value itself does not change, the offset remains fixed relative to the end of the uncompressed data. However this means that for every increment of the size value the offset refers to another byte in the uncompressed data. Consider the following example. |
Consider the following tagged compression group:
0x02 0x20 0xfc 0x0f
The tag byte consists of:
0x02 => 00000010b
This means that the 2nd and 3rd values contain a 16-bit compression tuple.
0x0ffc
Because this compression tuple is near the start of the uncompressed data the offset shift is 12 and the size mask is 0x0fff.
offset: 0x0ffc >> 12 => -1 * ( 0 + 1 ) => -1 size: 0x0ffc & 0x0fff => 4092 + 3 => 4095
The algorithm starts with an uncompressed value of 0x20 which represents the space character (ASCII). This value is added to the uncompressed data. Next the algorithm reads the compression tuple and determines the offset and size values. The offset refers to the previous space value in the uncompressed data and add this to uncompressed data. And so on. Note that the offset remains referring to the last value in the uncompressed data. In the end we end up with a block of 4096 spaces.
Now consider the following uncompressed data:
#include <ntfs.h>\n #include <stdio.h>\n
Note that the \n is the string representation of the newline character (ASCII: 0x0a)
This is logically compressed to:
#include <ntfs.h>\n(-18,10)stdio(-17,4)
In the example above the tuples are represented by (offset,size).
The first part of this is data stored with tag bytes looks like:
00000000b '#' 'i' 'n' 'c' 'l' 'u' 'd' 'e' 00000000b ' ' '<' 'n' 't' 'f' 's' '.' 'h' 00000100b '>' '\n' 0x07 0x88 's' 't' 'd' 'i' 'o' 00000001b 0x01 0x80
And in hexadecimal representation:
00000000 00 23 69 6e 63 6c 75 64 65 00 20 3c 6e 74 66 73 |.#include. <ntfs| 00000010 2e 68 04 3e 0a 07 88 3c 73 74 64 69 01 01 80 |.h.>...stdio... |
For the first tuple the offset shift is 11 and the size mask is 0x07ff. The tuple consists of:
offset: 0x8807 >> 11 => -1 * ( 17 + 1 ) => -18 size: 0x8807 & 0x07ff => 7 + 3 => 10
This tuples refer to:
(-18,10) => #include <
LZX compression is used in various data formats used on Windows, including:
-
Cabinet (CAB) file
-
Microsoft Compiled HTML Help (CHM) file
-
Windows Imaging (WIM) file
-
Windows Overlay Filter (WOF) compressed data in the New Technologies File System (NTFS)
LZX compressed data consist of:
-
one or more LZX blocks stored as a bit-stream
The bit-stream is stored in 16-bit little-endian integers, where bit values are stored front-to-back. So that the most-significant bit (MSB) of the first 16-bit integer is the first bit in the stream.
The LZX block header is variable of size and consists of:
| Offset | Size | Value | Description |
|---|---|---|---|
0 |
3 bits |
Block type |
|
0.3 |
1 bit |
Has default block size |
|
If has default block size is not set |
|||
0.4 |
16 bits |
Block size |
|
If has block size is set and compression window (unit) size > 32 KiB |
|||
2.4 |
8 bits |
Unknown (Extended block size?) |
|
If has block type is LZX_BLOCKTYPE_ALIGNED |
|||
… |
8 x 3 bits |
Array of aligned offset code sizes |
|
If has block type is LZX_BLOCKTYPE_ALIGNED or LZX_BLOCKTYPE_VERBATIM |
|||
… |
… |
256 literal symbol code sizes |
|
… |
… |
240 match header code sizes |
|
… |
… |
249 length code sizes |
|
If has block type is LZX_BLOCKTYPE_UNCOMPRESSED |
|||
… |
… |
0 |
16-bit alignment padding, which is 16-bit if already aligned |
… |
4 |
Recent offset (R0) value |
|
… |
4 |
Recent offset (R1) value |
|
… |
4 |
Recent offset (R2) value |
|
|
Note
|
Both [LZXFMT] and [MS-PATCH] indicate the block size is 24-bits of size
where [LZXFMT] indicates it is only present when the block is compressed.
|
| Value | Identifier | Description |
|---|---|---|
0 |
Undefined |
|
1 |
LZX_BLOCKTYPE_VERBATIM |
Verbatim block |
2 |
LZX_BLOCKTYPE_ALIGNED |
Aligned block |
3 |
LZX_BLOCKTYPE_UNCOMPRESSED |
Uncompressed block |
4 - 7 |
Undefined |
| Offset | Size | Value | Description |
|---|---|---|---|
0 |
20 x 4 bits |
Array of bit sizes of the pre-tree Huffman (prefix) codes |
|
… |
… |
Encoded code sizes |
Where pre-tree Huffman (prefix) codes:
| Pre-tree code | Description |
|---|---|
0 - 16 |
Directly encoded code size |
17 |
Run-length encoded code size of 0 |
18 |
Run-length encoded code size of 0 |
19 |
Run-length encoded code size |
The LZXPRESS compression method uses a combination of the LZ77 and DIRECT2 algorithms.
TODO complete section
| Offset | Size | Value | Description |
|---|---|---|---|
0 |
4 |
Compression indicator, stored in little-endian |
The MSB of the compression indicator represents the first value (not bytes) in the chunk, the LSB the last.
-
a compression indicator bit of 0 indicates an uncompressed byte;
-
a compression indicator bit of 1 indicates compressed data using a compression tuple (meaning combination of two values).
The LZXPRESS compression tuple is variable of size and consists of:
| Offset | Size | Value | Description |
|---|---|---|---|
0.0 |
3 bits |
Compression size (length) |
|
0.3 |
12 bits |
Compression offset (distance) |
If the compression size is 7 a first level extended compression size should be added to the previous compression size.
| Offset | Size | Value | Description |
|---|---|---|---|
2.0 |
4 bits |
First level extended compression size (length) |
|
2.4 |
4 bits |
See note. |
|
Note
|
The first level extended compression size is stored in a byte that is shared with other compression tuples, where the first compression tuple uses the lower 4 bits and the second compression tuple the upper 4 bits. |
If the compression size is 22 a second level extended compression size should be added to the previous compression size.
| Offset | Size | Value | Description |
|---|---|---|---|
3.0 |
8 bits |
Second level extended compression size (length) |
If the compression size is 277 a third level extended compression size should be added to the previous compression size.
| Offset | Size | Value | Description |
|---|---|---|---|
4.0 |
32 bits |
Third level extended compression size (length), stored in little-endian |
|
Note
|
The size is stored as size - 3 |
The LZXPRESS Huffman compressed data consists of multiple chunks. Each chunk consists of:
-
a Huffman (prefix) code table
-
encoded bit-stream
The LZXPRESS Huffman (prefix) code table is 256 bytes of size and consists of:
| Offset | Size | Value | Description |
|---|---|---|---|
0 |
512 x 4 bits |
Array of bit sizes of the Huffman (prefix) codes |
The 4 least significant bits (LSB) of byte 0 contain the bit size for Huffman (prefix) code 0, the 4 most significant bits (MSB) the bit size for Huffman (prefix) code 1, etc.
Where Huffman (prefix) codes:
-
0 - 255 represent literals, which mapping to byte values;
-
256 - 511 represent matches (or references), which map to compression tuples.
Code sizes greater than 0 are used to build a binary tree that map the bits in the encoded bit-stream to the Huffman (prefix) codes. The tree is build starting with the smallest code sizes and code values (symbols).
TODO: complete section
A compression tuple symbol
| Offset | Size | Value | Description |
|---|---|---|---|
0.0 |
4 bits |
Compression size (length) |
|
0.4 |
4 bits |
Compression offset (distance) |
|
1.0 |
1 bit |
Compression tuple flag |
TODO: describe compression size and extended compression size need to be read as byte values TODO: describe extended compression size
TODO: decompress up to 65536 bytes of data
The uncompressed chunk size is 65536 (0x10000) or the remaining uncompressed data size for the last chunk.