Feature request
This is just a request for a few lines of documentation.
I'd like to use BnB for 4-bit quantization, related to KV caching. Just that, nothing fancy on top like NN ops with quantization inside. I found the entries in QuantState not being documented, and some non-obvious things seem to happen for quantize_4bit.
Say I call quantize_4bit with x of shape (a, b, c, blocksize). Call num_ch = a * b * c. I am getting qx, state, so that:
state.absmax.shape = (num_ch,)
qx.shape = (num_ch * blocksize // 2,), qx.dtype = torch.uint8
For my application, I need to be able to quantize and dequantize slices of the full tensor and write them back.
My best guess was the memory layout of qx and state.absmax allows me to do qx.view(a, b, c, blocksize // 2) and state.absmax.view(a, b, c) and then work with that. But this does not seem to work.
# x.shape = (4, 3, 38, 256)
qx, state = quantize_4bit(x, blocksize=256)
start, end = 10, 15
partx = x[:, :, start:end, :]
qx_part, state_part = quantize_4bit(partx, blocksize=256)
full = state.absmax.view(4, 3, 38)
part = state_part.absmax(4, 3, 5)
torch.testing.assert_close(full[:, :, start:end], part)
full = qx.view(4, 3, 38, -1)
part = qx_part.view(4, 3, 5, -1)
torch.testing.assert_close(full[:, :, start:end], part)
Both asserts fail.
Could you tell me what is happening, or better, document the code in functional? As it stands, the code just calls some torch.ops.bitsandbytes.quantize_4bit.default, which I don't even find in the repo, and anyway I am not CUDA knowledgable.
Motivation
There is no 4-bit quantization natively in PyTorch, and even the 8-bit native quantization is also poorly documented.
There is value in quantization as such: I need it to compress KV caches, and also to compress activation checkpoints (for which a CPU support would be nice!).
You seem to just cater to high-level users who want to run their NN training or inference and not bother with anything. But if you do all the low-level work anyway, why not just document it, so folks like myself can use it?
Your contribution
I'd be a happy user of BnB for 4-bit (low level) quantization if this was documented!
Feature request
This is just a request for a few lines of documentation.
I'd like to use BnB for 4-bit quantization, related to KV caching. Just that, nothing fancy on top like NN ops with quantization inside. I found the entries in
QuantStatenot being documented, and some non-obvious things seem to happen forquantize_4bit.Say I call
quantize_4bitwithxof shape(a, b, c, blocksize). Callnum_ch = a * b * c. I am gettingqx,state, so that:state.absmax.shape = (num_ch,)qx.shape = (num_ch * blocksize // 2,),qx.dtype = torch.uint8For my application, I need to be able to quantize and dequantize slices of the full tensor and write them back.
My best guess was the memory layout of
qxandstate.absmaxallows me to doqx.view(a, b, c, blocksize // 2)andstate.absmax.view(a, b, c)and then work with that. But this does not seem to work.Both asserts fail.
Could you tell me what is happening, or better, document the code in
functional? As it stands, the code just calls sometorch.ops.bitsandbytes.quantize_4bit.default, which I don't even find in the repo, and anyway I am not CUDA knowledgable.Motivation
There is no 4-bit quantization natively in PyTorch, and even the 8-bit native quantization is also poorly documented.
There is value in quantization as such: I need it to compress KV caches, and also to compress activation checkpoints (for which a CPU support would be nice!).
You seem to just cater to high-level users who want to run their NN training or inference and not bother with anything. But if you do all the low-level work anyway, why not just document it, so folks like myself can use it?
Your contribution
I'd be a happy user of BnB for 4-bit (low level) quantization if this was documented!