[ROCm] Optimize kgemm_4bit_inference_naive for ROCm, use it for batch sizes other than 1 #255
| Job | Run time |
|---|---|
| 3m 51s | |
| 4m 6s | |
| 5m 26s | |
| 3m 50s | |
| 4m 5s | |
| 4m 32s | |
| 2m 41s | |
| 3m 59s | |
| 3m 26s | |
| 3m 36s | |
| 21s | |
| 21s | |
| 13s | |
| 17s | |
| 17s | |
| 14s | |
| 20s | |
| 13s | |
| 20m 45s | |
| 27m 48s | |
| 14m 53s | |
| 23m 12s | |
| 19m 27s | |
| 22m 43s | |
| 27m 17s | |
| 15m 28s | |
| 4m 15s | |
| 3m 41s | |
| 7m 32s | |
| 9m 10s | |
| 5m 35s | |
| 6m 24s | |
| 6m 11s | |
| 1m 46s | |
| 7m 16s | |
| 9m 24s | |
| 4h 34m 35s |