Enable certain CUDA kernels to accept specified cuda stream#1330
Conversation
|
Dear @jeejeelee, Really cool, we weren't aware vLLM uses cudagraph. Just looked over this with Tim and overall, especially given the performance benefits this may have, this is a very strong contribution, thanks! I checked out your branch and tried running the tests, but do get the below segfault, which doesn't happen on Please also be sure to install the pre-commit hooks 🤗 |
|
@Titus-von-Koeller , Thank you for the feedback, I've corrected the error mentioned above. I'm verifying whether all the unit tests are passing. |
|
On my machine with a 3090 GPU, my test results are as follows: All tests in |
3617b6e to
49ffcdc
Compare
|
@Titus-von-Koeller please review again, thanks~ |
|
@danielhanchen I believe you're directly calling some of these C-API functions in Unsloth, so I want to make sure you've got a heads up here since this changes their signatures. |
|
@jeejeelee Thank you for the contribution! The only nit I have is the one that I noted about using A few test failures in test_kbit_backprop and test_gemv_4bit is OK and not related to this PR. I see similar results on my 4090. The generation tests passed for me. Looks nice! |
Super thanks for the heads up!! Yep we use the C API directly! |
|
I'll be off until Monday, @matthewdouglas will be taking the lead. Thanks both! |
…ytes-foundation#1330) * Done * fix format * fix format * fix format * fix format * Address format error and fix default arg bug * Refine stream argument passing mechanism * Fix bug * Delete unused code
FIX #1308
By passing specified
streamto certain kernel functions,cudagraphcan correctly capture these kernels, enabling downstream repovLLMto run inference in cudagraph mode, resulting in significant speed improvements for BNB models.ping @matthewdouglas @Titus-von-Koeller @TimDettmers
cc @chenqianfzh