Communicate blocksize constraints to kernels that take blocksize as a runtime argument

### Feature request

Looking through the code base I've noticed in places like kgemm_4bit_inference_naive that there is integer division by block_size on GPU where block_size is a runtime argument, not a template argument. On the python front end there is a constraint that blocksize be a power of 2 but that isn't communicated to the kernel. integer division without a bitshift simplification has poor performance on GPU. Rewrite these kernels so that they can replace the integer divisions with bitshifts.

### Motivation

Integer division is slow but not with powers of two, the kernels don't know they can just bitshift because the constraint is only enforced on the python front end.

### Your contribution

I'd be happy to submit a PR to resolve this if there isn't some deeper reason why things are written this way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Communicate blocksize constraints to kernels that take blocksize as a runtime argument #1317

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Communicate blocksize constraints to kernels that take blocksize as a runtime argument #1317

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions