Skip to content

Latest commit

 

History

History
481 lines (349 loc) · 14.6 KB

File metadata and controls

481 lines (349 loc) · 14.6 KB

Windows NT compression methods

Windows NT comes with multiple built-in compression methods which are provided by the RtlCompressBuffer and RtlDecompressBuffer functions.

Identifier Description

COMPRESSION_FORMAT_LZNT1

LZNT1 compression (LZ77). Sometimes also referred to as New Technologies File System (NTFS) compression.

COMPRESSION_FORMAT_XPRESS

LZXPRESS compression (LZ77 + DIRECT2).

COMPRESSION_FORMAT_XPRESS_HUFF

LZXPRESS with Huffman compression.

Other compression methods used by Windows in various data formats are:

  • LZX compression

  • LZX DELTA compression

LZNT1

The LZNT1 compression method is based on LZ77 compression.

The LZNT1 compressed data consists of multiple chunks. A chunk header value of 0 indicates the end of the compressed data. The following pseudo code indicates how to decompress LZNT1 compressed data.

while "compressed data" available:
    read "chunk header"

    if "chunk data size" is 0:
        break

    if the "is compressed flag" is not set:
        chunk data is uncompressed

    else:
        read a "flag byte"

        for "flag bit" in "flag byte" starting with the LSB:
            if "flag bit" is not set:
                read a literal (uncompressed) byte

            else:
                read 16-bit compression tuple

                use the compression tuple to read previously
                decompressed data

LZNT1 chunk

An LZNT1 chunk consists of:

  • a chunk header

  • chunk data

The chunk data (or compression block) consists of tagged compression groups.

LZNT1 chunk header

The LZNT1 chunk header is 2 bytes of size and consists of:

Offset Size Value Description

0.0

12 bits

Chunk data size
The value is stored as chunk data size - 1

1.4

3 bits

Signature value

1.7

1 bit

Is compressed flag
0 ⇒ uncompressed
1 ⇒ compressed

LZNT1 chunk data

The LZNT1 chunk data consists of:

  • tagged groups

LZNT1 tagged compression group

A tagged group consist of 8 values (not bytes) preceded by a tag byte:

tag A B C D E F G H

The LSB of the tag byte represents the first value in the group, the MSB the last

  • a tag bit of 0 indicates an uncompressed byte;

  • a tag bit of 1 indicates compressed data using a little-endian 16-bit (2-byte) compression tuple (meaning combination of two values).

LZNT1 compression tuple

The compression tuple contains an offset (back reference) and a size value.

Where the size is the actual size minus 3. Use the following calculation to correct the size value in the tuple.

size = size + 3

And the offset a positive representation of a back reference minus 1. Use the following calculation to correct the offset value in the tuple.

offset = -1 * ( offset + 1 )

The compression tuple uses a dynamic amount of bits to store the offset and size values.

The calculation of the amount of bits used for the offset and size values is as following:

  • at the uncompressed data block offset 0, the size is stored in the least significant 12 bits of size and the offset 4 bits

  • the larger the uncompressed data block offset, the larger the amount of bits are used for the offset value and the smaller the amount of bits for the size .

The following calculation is used to determine the amount of bits to store the offset and size values.

compression_tuple_size_offset_shift = 12;
compression_tuple_size_mask         = 0xfff;

for( iterator = uncompressed_data_block_offset - 1;
     iterator >= 0x10;
     iterator >>= 1 )
{
	/* bit shift for the offset value */
	compression_tuple_size_offset_shift--;

	/* bit mask for size value */
	compression_tuple_size_mask >>= 1;
}

The tuple is uncompressed by copying the byte at the offset in the uncompressed data to the end of the uncompressed data. This is repeated for the size value of the tuple.

Note
The offset value itself does not change, the offset remains fixed relative to the end of the uncompressed data. However this means that for every increment of the size value the offset refers to another byte in the uncompressed data. Consider the following example.

Example

Consider the following tagged compression group:

0x02 0x20 0xfc 0x0f

The tag byte consists of:

0x02 => 00000010b

This means that the 2nd and 3rd values contain a 16-bit compression tuple.

0x0ffc

Because this compression tuple is near the start of the uncompressed data the offset shift is 12 and the size mask is 0x0fff.

offset:	0x0ffc >> 12    => -1 * ( 0 + 1 ) => -1
size:	0x0ffc & 0x0fff => 4092 + 3       => 4095

The algorithm starts with an uncompressed value of 0x20 which represents the space character (ASCII). This value is added to the uncompressed data. Next the algorithm reads the compression tuple and determines the offset and size values. The offset refers to the previous space value in the uncompressed data and add this to uncompressed data. And so on. Note that the offset remains referring to the last value in the uncompressed data. In the end we end up with a block of 4096 spaces.

Now consider the following uncompressed data:

#include <ntfs.h>\n
#include <stdio.h>\n

Note that the \n is the string representation of the newline character (ASCII: 0x0a)

This is logically compressed to:

#include <ntfs.h>\n(-18,10)stdio(-17,4)

In the example above the tuples are represented by (offset,size).

The first part of this is data stored with tag bytes looks like:

00000000b '#' 'i' 'n' 'c' 'l' 'u' 'd' 'e'
00000000b ' ' '<' 'n' 't' 'f' 's' '.' 'h'
00000100b '>' '\n' 0x07 0x88 's' 't' 'd' 'i' 'o'
00000001b 0x01 0x80

And in hexadecimal representation:

00000000  00 23 69 6e 63 6c 75 64  65 00 20 3c 6e 74 66 73  |.#include. <ntfs|
00000010  2e 68 04 3e 0a 07 88 3c  73 74 64 69 01 01 80     |.h.>...stdio... |

For the first tuple the offset shift is 11 and the size mask is 0x07ff. The tuple consists of:

offset:	0x8807 >> 11    => -1 * ( 17 + 1 ) => -18
size:	0x8807 & 0x07ff =>  7 + 3          => 10

This tuples refer to:

(-18,10) => #include <

LZX

LZX compression is used in various data formats used on Windows, including:

  • Cabinet (CAB) file

  • Microsoft Compiled HTML Help (CHM) file

  • Windows Imaging (WIM) file

  • Windows Overlay Filter (WOF) compressed data in the New Technologies File System (NTFS)

LZX compressed data consist of:

  • one or more LZX blocks stored as a bit-stream

Bit-stream

The bit-stream is stored in 16-bit little-endian integers, where bit values are stored front-to-back. So that the most-significant bit (MSB) of the first 16-bit integer is the first bit in the stream.

LZX block

A LZX block consists of:

  • block header

  • compressed or uncompressed block data

LZX block header

The LZX block header is variable of size and consists of:

Offset Size Value Description

0

3 bits

Block type
See section: Block type

0.3

1 bit

Has default block size
If set the block size is 32 KiB

If has default block size is not set

0.4

16 bits

Block size

If has block size is set and compression window (unit) size > 32 KiB

2.4

8 bits

Unknown (Extended block size?)

If has block type is LZX_BLOCKTYPE_ALIGNED

…​

8 x 3 bits

Array of aligned offset code sizes

If has block type is LZX_BLOCKTYPE_ALIGNED or LZX_BLOCKTYPE_VERBATIM

…​

…​

256 literal symbol code sizes
See section: Huffman encoded code sizes

…​

…​

240 match header code sizes
See section: Huffman encoded code sizes

…​

…​

249 length code sizes
See section: Huffman encoded code sizes

If has block type is LZX_BLOCKTYPE_UNCOMPRESSED

…​

…​

0

16-bit alignment padding, which is 16-bit if already aligned
This padding ensures the following data is byte aligned

…​

4

Recent offset (R0) value

…​

4

Recent offset (R1) value

…​

4

Recent offset (R2) value

Note
Both [LZXFMT] and [MS-PATCH] indicate the block size is 24-bits of size where [LZXFMT] indicates it is only present when the block is compressed.

Block type

Value Identifier Description

0

Undefined

1

LZX_BLOCKTYPE_VERBATIM

Verbatim block

2

LZX_BLOCKTYPE_ALIGNED

Aligned block

3

LZX_BLOCKTYPE_UNCOMPRESSED

Uncompressed block

4 - 7

Undefined

Huffman encoded code sizes

Offset Size Value Description

0

20 x 4 bits

Array of bit sizes of the pre-tree Huffman (prefix) codes

…​

…​

Encoded code sizes
Encoded with the pre-tree Huffman codes

Where pre-tree Huffman (prefix) codes:

Pre-tree code Description

0 - 16

Directly encoded code size
Where code size = ( previous code size - pre-tree symbol + 17 ) mod 17

17

Run-length encoded code size of 0
Where the pre-tree symbol is followed by a 4-bit times to repeat value, which number of code sizes = 4 + times to repeat value

18

Run-length encoded code size of 0
Where the pre-tree symbol is followed by a 5-bit times to repeat value, which number of code sizes = 20 + times to repeat value

19

Run-length encoded code size
Where the pre-tree symbol is followed by a 1-bit times to repeat value, which number of code sizes = 4 + times to repeat value
and a 4-bit second pre-tree symbol value, which code size value = ( previous code size - second pre-tree symbol + 17 ) mod 17

CALL (0xe8) processing

LZX converts Intel 80x86 CALL instructions, which begin with oppcode 0xe8, to improve the compression ratio on 32-bit Intel 80x86 code.

TODO complete section

LZXPRESS

The LZXPRESS compression method uses a combination of the LZ77 and DIRECT2 algorithms.

TODO complete section

LZXPRESS chunk

A LZXPRESS chunk consists of:

  • compression indicator (or bitmask)

  • compressed data

LZXPRESS compression indictor

Offset Size Value Description

0

4

Compression indicator, stored in little-endian

The MSB of the compression indicator represents the first value (not bytes) in the chunk, the LSB the last.

  • a compression indicator bit of 0 indicates an uncompressed byte;

  • a compression indicator bit of 1 indicates compressed data using a compression tuple (meaning combination of two values).

LZXPRESS compression tuple

The LZXPRESS compression tuple is variable of size and consists of:

Offset Size Value Description

0.0

3 bits

Compression size (length)

0.3

12 bits

Compression offset (distance)
Stored as offset - 1

If the compression size is 7 a first level extended compression size should be added to the previous compression size.

Offset Size Value Description

2.0

4 bits

First level extended compression size (length)

2.4

4 bits

See note.

Note
The first level extended compression size is stored in a byte that is shared with other compression tuples, where the first compression tuple uses the lower 4 bits and the second compression tuple the upper 4 bits.

If the compression size is 22 a second level extended compression size should be added to the previous compression size.

Offset Size Value Description

3.0

8 bits

Second level extended compression size (length)

If the compression size is 277 a third level extended compression size should be added to the previous compression size.

Offset Size Value Description

4.0

32 bits

Third level extended compression size (length), stored in little-endian

Note
The size is stored as size - 3

LZXPRESS Huffman

The LZXPRESS Huffman compressed data consists of multiple chunks. Each chunk consists of:

  • a Huffman (prefix) code table

  • encoded bit-stream

The LZXPRESS Huffman (prefix) code table is 256 bytes of size and consists of:

Offset Size Value Description

0

512 x 4 bits

Array of bit sizes of the Huffman (prefix) codes

The 4 least significant bits (LSB) of byte 0 contain the bit size for Huffman (prefix) code 0, the 4 most significant bits (MSB) the bit size for Huffman (prefix) code 1, etc.

Where Huffman (prefix) codes:

  • 0 - 255 represent literals, which mapping to byte values;

  • 256 - 511 represent matches (or references), which map to compression tuples.

Code sizes greater than 0 are used to build a binary tree that map the bits in the encoded bit-stream to the Huffman (prefix) codes. The tree is build starting with the smallest code sizes and code values (symbols).

TODO: complete section

A compression tuple symbol

Offset Size Value Description

0.0

4 bits

Compression size (length)

0.4

4 bits

Compression offset (distance)

1.0

1 bit

Compression tuple flag

TODO: describe compression size and extended compression size need to be read as byte values TODO: describe extended compression size

TODO: decompress up to 65536 bytes of data

The uncompressed chunk size is 65536 (0x10000) or the remaining uncompressed data size for the last chunk.

Encoded bit-stream

The encoded bit-stream is stored in 16-bit little-endian integers, where bit values are stored front-to-back. So that the most-significant bit (MSB) of the first 16-bit integer is the first bit in the stream.