dequantize_4bit() gives wrong output when working in cuda graph mode

### System Info

Linux

### Reproduction

 I am trying to implement BitsAndBytes in vLLM (https://github.com/vllm-project/vllm). My implementation with eager-mode works right and was merged. 
 
 However, I found that the weight given by dequantize_4bit() under cuda graph mode is different from the eager mode, which makes the model output nonsense output. 
 
 Wonder anybody has some insights on this issue?
 
 I tried to put it in a simple script. Yet it turned out to be hard as it is non-trivial to capture the cuda graph. Yet it is a consistent repro and I would be more than happy to work with the community members to show the data I have collected. 

### Expected behavior

The cuda graph mode is expected to output the same dequantized tensors as the eager mode. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dequantize_4bit() gives wrong output when working in cuda graph mode #1308

System Info

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

dequantize_4bit() gives wrong output when working in cuda graph mode #1308

Description

System Info

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions