System Info
Linux
Reproduction
I am trying to implement BitsAndBytes in vLLM (https://github.com/vllm-project/vllm). My implementation with eager-mode works right and was merged.
However, I found that the weight given by dequantize_4bit() under cuda graph mode is different from the eager mode, which makes the model output nonsense output.
Wonder anybody has some insights on this issue?
I tried to put it in a simple script. Yet it turned out to be hard as it is non-trivial to capture the cuda graph. Yet it is a consistent repro and I would be more than happy to work with the community members to show the data I have collected.
Expected behavior
The cuda graph mode is expected to output the same dequantized tensors as the eager mode.
System Info
Linux
Reproduction
I am trying to implement BitsAndBytes in vLLM (https://github.com/vllm-project/vllm). My implementation with eager-mode works right and was merged.
However, I found that the weight given by dequantize_4bit() under cuda graph mode is different from the eager mode, which makes the model output nonsense output.
Wonder anybody has some insights on this issue?
I tried to put it in a simple script. Yet it turned out to be hard as it is non-trivial to capture the cuda graph. Yet it is a consistent repro and I would be more than happy to work with the community members to show the data I have collected.
Expected behavior
The cuda graph mode is expected to output the same dequantized tensors as the eager mode.