libfwnt/documentation/Compression methods.asciidoc at main · libyal/libfwnt

Windows NT compression methods

Windows NT comes with multiple built-in compression methods which are provided by the RtlCompressBuffer and RtlDecompressBuffer functions.

Identifier	Description
COMPRESSION_FORMAT_LZNT1	LZNT1 compression (LZ77). Sometimes also referred to as New Technologies File System (NTFS) compression.
COMPRESSION_FORMAT_XPRESS	LZXPRESS compression (LZ77 + DIRECT2).
COMPRESSION_FORMAT_XPRESS_HUFF	LZXPRESS with Huffman compression.

Other compression methods used by Windows in various data formats are:

LZX compression
LZX DELTA compression

LZNT1

The LZNT1 compression method is based on LZ77 compression.

The LZNT1 compressed data consists of multiple chunks. A chunk header value of 0 indicates the end of the compressed data. The following pseudo code indicates how to decompress LZNT1 compressed data.

while "compressed data" available:
    read "chunk header"

    if "chunk data size" is 0:
        break

    if the "is compressed flag" is not set:
        chunk data is uncompressed

    else:
        read a "flag byte"

        for "flag bit" in "flag byte" starting with the LSB:
            if "flag bit" is not set:
                read a literal (uncompressed) byte

            else:
                read 16-bit compression tuple

                use the compression tuple to read previously
                decompressed data

LZNT1 chunk

An LZNT1 chunk consists of:

a chunk header
chunk data

The chunk data (or compression block) consists of tagged compression groups.

LZNT1 chunk header

The LZNT1 chunk header is 2 bytes of size and consists of:

Offset	Size	Description
0.0	12 bits	Chunk data size The value is stored as chunk data size - 1
1.4	3 bits	Signature value
1.7	1 bit	Is compressed flag 0 ⇒ uncompressed 1 ⇒ compressed

LZNT1 chunk data

The LZNT1 chunk data consists of:

tagged groups

LZNT1 tagged compression group

A tagged group consist of 8 values (not bytes) preceded by a tag byte:

tag A B C D E F G H

The LSB of the tag byte represents the first value in the group, the MSB the last

a tag bit of 0 indicates an uncompressed byte;
a tag bit of 1 indicates compressed data using a little-endian 16-bit (2-byte) compression tuple (meaning combination of two values).

LZNT1 compression tuple

The compression tuple contains an offset (back reference) and a size value.

Where the size is the actual size minus 3. Use the following calculation to correct the size value in the tuple.

size = size + 3

And the offset a positive representation of a back reference minus 1. Use the following calculation to correct the offset value in the tuple.

offset = -1 * ( offset + 1 )

The compression tuple uses a dynamic amount of bits to store the offset and size values.

The calculation of the amount of bits used for the offset and size values is as following:

at the uncompressed data block offset 0, the size is stored in the least significant 12 bits of size and the offset 4 bits
the larger the uncompressed data block offset, the larger the amount of bits are used for the offset value and the smaller the amount of bits for the size .

The following calculation is used to determine the amount of bits to store the offset and size values.

compression_tuple_size_offset_shift = 12;
compression_tuple_size_mask         = 0xfff;

for( iterator = uncompressed_data_block_offset - 1;
     iterator >= 0x10;
     iterator >>= 1 )
{
	/* bit shift for the offset value */
	compression_tuple_size_offset_shift--;

	/* bit mask for size value */
	compression_tuple_size_mask >>= 1;
}

The tuple is uncompressed by copying the byte at the offset in the uncompressed data to the end of the uncompressed data. This is repeated for the size value of the tuple.

Note	The offset value itself does not change, the offset remains fixed relative to the end of the uncompressed data. However this means that for every increment of the size value the offset refers to another byte in the uncompressed data. Consider the following example.

Example

Consider the following tagged compression group:

0x02 0x20 0xfc 0x0f

The tag byte consists of:

0x02 => 00000010b

This means that the 2nd and 3rd values contain a 16-bit compression tuple.

0x0ffc

Because this compression tuple is near the start of the uncompressed data the offset shift is 12 and the size mask is 0x0fff.

offset:	0x0ffc >> 12    => -1 * ( 0 + 1 ) => -1
size:	0x0ffc & 0x0fff => 4092 + 3       => 4095

The algorithm starts with an uncompressed value of 0x20 which represents the space character (ASCII). This value is added to the uncompressed data. Next the algorithm reads the compression tuple and determines the offset and size values. The offset refers to the previous space value in the uncompressed data and add this to uncompressed data. And so on. Note that the offset remains referring to the last value in the uncompressed data. In the end we end up with a block of 4096 spaces.

Now consider the following uncompressed data:

#include <ntfs.h>\n
#include <stdio.h>\n

Note that the \n is the string representation of the newline character (ASCII: 0x0a)

This is logically compressed to:

#include <ntfs.h>\n(-18,10)stdio(-17,4)

In the example above the tuples are represented by (offset,size).

The first part of this is data stored with tag bytes looks like:

00000000b '#' 'i' 'n' 'c' 'l' 'u' 'd' 'e'
00000000b ' ' '<' 'n' 't' 'f' 's' '.' 'h'
00000100b '>' '\n' 0x07 0x88 's' 't' 'd' 'i' 'o'
00000001b 0x01 0x80

And in hexadecimal representation:

00000000  00 23 69 6e 63 6c 75 64  65 00 20 3c 6e 74 66 73  |.#include. <ntfs|
00000010  2e 68 04 3e 0a 07 88 3c  73 74 64 69 01 01 80     |.h.>...stdio... |

For the first tuple the offset shift is 11 and the size mask is 0x07ff. The tuple consists of:

offset:	0x8807 >> 11    => -1 * ( 17 + 1 ) => -18
size:	0x8807 & 0x07ff =>  7 + 3          => 10

This tuples refer to:

(-18,10) => #include <

LZX

LZX compression is used in various data formats used on Windows, including:

Cabinet (CAB) file
Microsoft Compiled HTML Help (CHM) file
Windows Imaging (WIM) file
Windows Overlay Filter (WOF) compressed data in the New Technologies File System (NTFS)

LZX compressed data consist of:

one or more LZX blocks stored as a bit-stream

Bit-stream

The bit-stream is stored in 16-bit little-endian integers, where bit values are stored front-to-back. So that the most-significant bit (MSB) of the first 16-bit integer is the first bit in the stream.

LZX block

A LZX block consists of:

block header
compressed or uncompressed block data

LZX block header

The LZX block header is variable of size and consists of:

Offset	Size	Value	Description
0	3 bits		Block type See section: Block type
0.3	1 bit		Has default block size If set the block size is 32 KiB
If has default block size is not set
0.4	16 bits		Block size
If has block size is set and compression window (unit) size > 32 KiB
2.4	8 bits		Unknown (Extended block size?)
If has block type is LZX_BLOCKTYPE_ALIGNED
…	8 x 3 bits		Array of aligned offset code sizes
If has block type is LZX_BLOCKTYPE_ALIGNED or LZX_BLOCKTYPE_VERBATIM
…	…		256 literal symbol code sizes See section: Huffman encoded code sizes
…	…		240 match header code sizes See section: Huffman encoded code sizes
…	…		249 length code sizes See section: Huffman encoded code sizes
If has block type is LZX_BLOCKTYPE_UNCOMPRESSED
…	…	0	16-bit alignment padding, which is 16-bit if already aligned This padding ensures the following data is byte aligned
…	4		Recent offset (R0) value
…	4		Recent offset (R1) value
…	4		Recent offset (R2) value

Note	Both `[LZXFMT]` and `[MS-PATCH]` indicate the block size is 24-bits of size where `[LZXFMT]` indicates it is only present when the block is compressed.

Block type

Value	Identifier	Description
0		Undefined
1	LZX_BLOCKTYPE_VERBATIM	Verbatim block
2	LZX_BLOCKTYPE_ALIGNED	Aligned block
3	LZX_BLOCKTYPE_UNCOMPRESSED	Uncompressed block
4 - 7		Undefined

Huffman encoded code sizes

Offset	Size	Value	Description
0	20 x 4 bits		Array of bit sizes of the pre-tree Huffman (prefix) codes
…	…		Encoded code sizes Encoded with the pre-tree Huffman codes

Where pre-tree Huffman (prefix) codes:

Pre-tree code	Description
0 - 16	Directly encoded code size Where `code size = ( previous code size - pre-tree symbol + 17 ) mod 17`
17	Run-length encoded code size of 0 Where the pre-tree symbol is followed by a 4-bit times to repeat value, which `number of code sizes = 4 + times to repeat value`
18	Run-length encoded code size of 0 Where the pre-tree symbol is followed by a 5-bit times to repeat value, which `number of code sizes = 20 + times to repeat value`
19	Run-length encoded code size Where the pre-tree symbol is followed by a 1-bit times to repeat value, which `number of code sizes = 4 + times to repeat value` and a 4-bit second pre-tree symbol value, which `code size value = ( previous code size - second pre-tree symbol + 17 ) mod 17`

CALL (0xe8) processing

LZX converts Intel 80x86 CALL instructions, which begin with oppcode 0xe8, to improve the compression ratio on 32-bit Intel 80x86 code.

TODO complete section

LZXPRESS

The LZXPRESS compression method uses a combination of the LZ77 and DIRECT2 algorithms.

TODO complete section

LZXPRESS chunk

A LZXPRESS chunk consists of:

compression indicator (or bitmask)
compressed data

LZXPRESS compression indictor

Offset	Size	Value	Description
0	4		Compression indicator, stored in little-endian

The MSB of the compression indicator represents the first value (not bytes) in the chunk, the LSB the last.

a compression indicator bit of 0 indicates an uncompressed byte;
a compression indicator bit of 1 indicates compressed data using a compression tuple (meaning combination of two values).

LZXPRESS compression tuple

The LZXPRESS compression tuple is variable of size and consists of:

Offset	Size	Value	Description
0.0	3 bits		Compression size (length)
0.3	12 bits		Compression offset (distance) Stored as offset - 1

If the compression size is 7 a first level extended compression size should be added to the previous compression size.

Offset	Size	Value	Description
2.0	4 bits		First level extended compression size (length)
2.4	4 bits		See note.

Note	The first level extended compression size is stored in a byte that is shared with other compression tuples, where the first compression tuple uses the lower 4 bits and the second compression tuple the upper 4 bits.

If the compression size is 22 a second level extended compression size should be added to the previous compression size.

Offset	Size	Value	Description
3.0	8 bits		Second level extended compression size (length)

If the compression size is 277 a third level extended compression size should be added to the previous compression size.

Offset	Size	Value	Description
4.0	32 bits		Third level extended compression size (length), stored in little-endian

Note	The size is stored as size - 3

LZXPRESS Huffman

The LZXPRESS Huffman compressed data consists of multiple chunks. Each chunk consists of:

a Huffman (prefix) code table
encoded bit-stream

The LZXPRESS Huffman (prefix) code table is 256 bytes of size and consists of:

Offset	Size	Value	Description
0	512 x 4 bits		Array of bit sizes of the Huffman (prefix) codes

The 4 least significant bits (LSB) of byte 0 contain the bit size for Huffman (prefix) code 0, the 4 most significant bits (MSB) the bit size for Huffman (prefix) code 1, etc.

Where Huffman (prefix) codes:

0 - 255 represent literals, which mapping to byte values;
256 - 511 represent matches (or references), which map to compression tuples.

Code sizes greater than 0 are used to build a binary tree that map the bits in the encoded bit-stream to the Huffman (prefix) codes. The tree is build starting with the smallest code sizes and code values (symbols).

TODO: complete section

A compression tuple symbol

Offset	Size	Description
0.0	4 bits	Compression size (length)
0.4	4 bits	Compression offset (distance)
1.0	1 bit	Compression tuple flag

TODO: describe compression size and extended compression size need to be read as byte values TODO: describe extended compression size

TODO: decompress up to 65536 bytes of data

The uncompressed chunk size is 65536 (0x10000) or the remaining uncompressed data size for the last chunk.

Encoded bit-stream

The encoded bit-stream is stored in 16-bit little-endian integers, where bit values are stored front-to-back. So that the most-significant bit (MSB) of the first 16-bit integer is the first bit in the stream.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows NT compression methods

LZNT1

LZNT1 chunk

LZNT1 chunk header

LZNT1 chunk data

LZNT1 tagged compression group

LZNT1 compression tuple

Example

LZX

Bit-stream

LZX block

LZX block header

Block type

Huffman encoded code sizes

CALL (0xe8) processing

LZXPRESS

LZXPRESS chunk

LZXPRESS compression indictor

LZXPRESS compression tuple

LZXPRESS Huffman

Encoded bit-stream

External Links

FilesExpand file tree

Compression methods.asciidoc

Latest commit

History

Compression methods.asciidoc

File metadata and controls

Windows NT compression methods

LZNT1

LZNT1 chunk

LZNT1 chunk header

LZNT1 chunk data

LZNT1 tagged compression group

LZNT1 compression tuple

Example

LZX

Bit-stream

LZX block

LZX block header

Block type

Huffman encoded code sizes

CALL (0xe8) processing

LZXPRESS

LZXPRESS chunk

LZXPRESS compression indictor

LZXPRESS compression tuple

LZXPRESS Huffman

Encoded bit-stream

External Links