Skip to content

Add docs on QuantState in functional #1652

@mseeger

Description

@mseeger

Feature request

This is just a request for a few lines of documentation.

I'd like to use BnB for 4-bit quantization, related to KV caching. Just that, nothing fancy on top like NN ops with quantization inside. I found the entries in QuantState not being documented, and some non-obvious things seem to happen for quantize_4bit.

Say I call quantize_4bit with x of shape (a, b, c, blocksize). Call num_ch = a * b * c. I am getting qx, state, so that:

  • state.absmax.shape = (num_ch,)
  • qx.shape = (num_ch * blocksize // 2,), qx.dtype = torch.uint8

For my application, I need to be able to quantize and dequantize slices of the full tensor and write them back.

My best guess was the memory layout of qx and state.absmax allows me to do qx.view(a, b, c, blocksize // 2) and state.absmax.view(a, b, c) and then work with that. But this does not seem to work.

# x.shape = (4, 3, 38, 256)
qx, state = quantize_4bit(x, blocksize=256)
start, end = 10, 15
partx = x[:, :, start:end, :]
qx_part, state_part = quantize_4bit(partx, blocksize=256)
full = state.absmax.view(4, 3, 38)
part = state_part.absmax(4, 3, 5)
torch.testing.assert_close(full[:, :, start:end], part)

full = qx.view(4, 3, 38, -1)
part = qx_part.view(4, 3, 5, -1)
torch.testing.assert_close(full[:, :, start:end], part)

Both asserts fail.
Could you tell me what is happening, or better, document the code in functional? As it stands, the code just calls some torch.ops.bitsandbytes.quantize_4bit.default, which I don't even find in the repo, and anyway I am not CUDA knowledgable.

Motivation

There is no 4-bit quantization natively in PyTorch, and even the 8-bit native quantization is also poorly documented.

There is value in quantization as such: I need it to compress KV caches, and also to compress activation checkpoints (for which a CPU support would be nice!).

You seem to just cater to high-level users who want to run their NN training or inference and not bother with anything. But if you do all the low-level work anyway, why not just document it, so folks like myself can use it?

Your contribution

I'd be a happy user of BnB for 4-bit (low level) quantization if this was documented!

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocumentationImprovements or additions to documentation

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions