Architecture and design for the NVIDIA CUDA compute backend. Definitions | Architecture | Phase 6 Roadmap
The CUDA backend provides GPU-accelerated inference on NVIDIA GPUs using CUDA 13. Unlike most .NET CUDA libraries, daisi-llogos uses raw P/Invoke to the CUDA Driver API — no managed wrappers, no CUDA Runtime API, no cuDNN. This gives full control over memory management, kernel loading, and stream orchestration.
Key design choices:
- CUDA Driver API (not Runtime API) — explicit context management, direct kernel loading
- NVRTC JIT compilation with PTX disk cache — architecture-specific codegen, ~0.6s cached startup
- cuBLAS integration — SGEMV for F32 matmul (ships with CUDA Toolkit, no extra dependency)
__dp4ainteger dot products — Q4_0 uses hardware int8 multiply-accumulate with fused RmsNorm+Q8_1 quantization- Architecture-adaptive dispatch — Blackwell (SM 12.x) uses float path for 4-bit quants, pre-Blackwell uses dp4a
- Partial vocab logit computation — lm_head computes VocabSize/32 tokens for greedy decode (+10% speedup)
- Per-quant row count tuning — Q8_0=2, Q4_K=3, Q6_K=10, Q4_0=2, Q4_1=8, Q5_K=1 (optimal per format)
- Aligned block repacking — Q8_0 34→36, Q4_0 18→20 bytes for 4-byte aligned uint32 reads
- Self-contained dispatch — each quant type computes own grid/threads/smem via
AdaptiveLaunch(), no shared variables - Fused kernels — RmsNormResidual+Q8_1, AddRmsNorm+Q8_1, SwiGLU, GPU-side ArgMax, SplitUnequalQKV, RepeatTile
- Q8_1 activation cache — generation-based invalidation (only write ops bump the counter)
classDiagram
class CudaApi {
<<static>>
+cuInit(uint flags) CUresult
+cuDeviceGet(out int device, int ordinal) CUresult
+cuCtxCreate(out CUcontext ctx, uint flags, int device) CUresult
+cuCtxDestroy(CUcontext ctx) CUresult
+cuModuleLoadData(out CUmodule module, byte[] image) CUresult
+cuModuleGetFunction(out CUfunction func, CUmodule module, string name) CUresult
+cuMemAlloc(out CUdeviceptr ptr, ulong bytesize) CUresult
+cuMemFree(CUdeviceptr ptr) CUresult
+cuMemcpyHtoD(CUdeviceptr dst, IntPtr src, ulong byteCount) CUresult
+cuMemcpyDtoH(IntPtr dst, CUdeviceptr src, ulong byteCount) CUresult
+cuLaunchKernel(...) CUresult
+cuStreamCreate(out CUstream stream, uint flags) CUresult
+cuStreamSynchronize(CUstream stream) CUresult
}
class CudaContext {
-CUcontext _handle
+CudaContext(int deviceOrdinal)
+MakeCurrent()
+Dispose()
}
class CudaModule {
-CUmodule _handle
+CudaModule(byte[] cubinData)
+GetFunction(string name) CudaFunction
+Dispose()
}
class CudaDeviceMemory {
-CUdeviceptr _handle
-ulong _byteSize
+CudaDeviceMemory(ulong byteSize)
+CopyFromHost(ReadOnlySpan~byte~ data)
+CopyToHost(Span~byte~ destination)
+Dispose()
}
class CudaStream {
-CUstream _handle
+Launch(CudaFunction func, dim3 grid, dim3 block, void** args)
+Synchronize()
+Dispose()
}
CudaContext --> CudaModule
CudaContext --> CudaDeviceMemory
CudaContext --> CudaStream
CudaModule --> CudaFunction
Every CUDA resource is wrapped in a SafeHandle-derived class that guarantees cleanup:
// Conceptual pattern — actual implementation will follow this structure
class CudaDeviceMemoryHandle : SafeHandleZeroOrMinusOneIsInvalid
{
protected override bool ReleaseHandle()
{
return CudaApi.cuMemFree(handle) == CUresult.CUDA_SUCCESS;
}
}This ensures GPU memory is freed even if exceptions occur or the GC collects the object.
flowchart TD
subgraph Host["Host Memory (CPU)"]
GGUF["GGUF tensor data\n(quantized bytes)"]
LOGITS_H["Logits output buffer"]
end
subgraph Device["Device Memory (GPU)"]
WEIGHTS["Weight tensors\n(quantized, read-only)"]
KV["KV cache\n(FP16, read-write)"]
SCRATCH["Scratch buffers\n(FP32, reused)"]
LOGITS_D["Logits buffer"]
end
GGUF -->|"cuMemcpyHtoD\n(one-time at model load)"| WEIGHTS
LOGITS_D -->|"cuMemcpyDtoH\n(once per forward pass)"| LOGITS_H
| Data | Direction | When | Frequency |
|---|---|---|---|
| Model weights | Host → Device | Model load | Once |
| KV cache | Device only | Inference | Never transferred |
| Scratch buffers | Device only | Inference | Never transferred |
| Input token IDs | Host → Device | Each generate call | Once per call |
| Logits | Device → Host | Each forward pass | Once per decode step |
Key principle: Minimize host-device transfers. Weights are uploaded once. All intermediate computation stays on device. Only the final logits vector (vocab_size floats) is copied back per step.
flowchart LR
CU[".cu source files"]
NVCC["nvcc compiler\n--cubin -arch=sm_120"]
CUBIN[".cubin binary"]
EMBED["Embedded resource\nin Daisi.Llogos.Cuda.dll"]
LOAD["cuModuleLoadData()\nat runtime"]
FUNC["cuModuleGetFunction()\nper kernel name"]
CU --> NVCC --> CUBIN --> EMBED --> LOAD --> FUNC
| Approach | Startup time | Runtime overhead | Deployment |
|---|---|---|---|
| PTX (JIT) | Slow (compile on first run) | None after compile | Single binary, any GPU arch |
| cubin (AOT) | Instant (no compilation) | None | Must ship per target arch |
| Fat binary | Instant | None | Larger file, multiple archs |
daisi-llogos ships pre-compiled cubin for target architectures (sm_120 for Blackwell, sm_89 for Ada Lovelace, sm_86 for Ampere). A fat binary approach may be used to bundle multiple architectures.
| sm_arch | GPU Family | Examples |
|---|---|---|
| sm_86 | Ampere | RTX 3060-3090, A100 |
| sm_89 | Ada Lovelace | RTX 4060-4090, L40 |
| sm_100 | Blackwell | RTX 5060-5090, B200 |
| sm_120 | Blackwell Ultra | B300 |
The most critical optimization: combining dequantization and matrix multiplication into a single kernel pass.
flowchart TD
subgraph Naive["Naive: Two Separate Kernels"]
direction TB
N_DEQ["Kernel 1: Dequantize\nRead Q8_0 → Write FP32"]
N_GMEM1["Global Memory\n(FP32 weights, full size)"]
N_MM["Kernel 2: MatMul\nRead FP32 weights × input"]
N_DEQ --> N_GMEM1 --> N_MM
end
subgraph Fused["Fused: Single Kernel"]
direction TB
F_KERN["Fused Kernel:\nRead Q8_0 → Dequant in registers → MatMul accumulate"]
end
style Naive fill:#fee,stroke:#c00
style Fused fill:#efe,stroke:#0a0
Naive approach: Dequantize all weights to FP32 in global memory (4× the quantized size), then read them again for matmul. Two full passes over the weight data.
Fused approach: Each thread block loads a tile of quantized weights, dequantizes into registers or shared memory, and immediately uses them for the matmul dot product. Weight data is read exactly once from global memory.
flowchart TD
subgraph ThreadBlock["Thread Block (e.g., 256 threads)"]
LOAD_W["Load quantized weight tile\nfrom global memory to shared memory"]
LOAD_X["Load input tile\nfrom global memory to shared memory"]
DEQ["Dequantize weights\nin registers"]
DOT["Dot product accumulate\n(registers)"]
STORE["Write output tile\nto global memory"]
end
GMEM_W["Global Memory\n(quantized weights)"]
GMEM_X["Global Memory\n(input activations, FP32)"]
GMEM_O["Global Memory\n(output activations, FP32)"]
GMEM_W --> LOAD_W --> DEQ --> DOT --> STORE --> GMEM_O
GMEM_X --> LOAD_X --> DOT
For a matmul of [M × K] × [K × N] → [M × N]:
| Parameter | Value | Rationale |
|---|---|---|
| Block size | 256 threads | Good occupancy on most architectures |
| Grid X | ceil(N / tile_N) |
One block column per output tile column |
| Grid Y | ceil(M / tile_M) |
One block row per output tile row |
| Shared memory | tile_K × (tile_M + tile_N) × sizeof(float) |
Tiles for both operands |
| Tile size | 128×128 or 64×64 | Tuned per architecture |
Multiple CUDA streams enable overlapping computation with memory transfers:
sequenceDiagram
participant S1 as Stream 1 (Compute)
participant S2 as Stream 2 (Transfer)
participant Host
Note over S1,Host: Layer N
S1->>S1: Attention matmul
S1->>S1: FFN matmul
Note over S1,Host: Layer N+1
S1->>S1: Attention matmul
S2->>Host: Copy logits (if last layer)
S1->>S1: FFN matmul
Note over S1,Host: Synchronize
S1->>S1: cuStreamSynchronize
In practice, the main benefit of multi-stream for inference is overlapping the final logits D2H transfer with the last layer's computation. The weight data is already on device, so there's no upload to overlap during inference.
classDiagram
class CudaBackend {
+string Name = "CUDA"
-CudaContext _context
-CudaModule _module
-CudaStream _computeStream
-CudaStream _transferStream
+CreateTensor(...) CudaTensor
+LoadTensor(...) CudaTensor
+MatMul(output, a, b)
+RmsNorm(output, input, weight, eps)
+Softmax(output, input)
+SiLU(output, input)
+RoPE(q, k, posOffset, theta)
+ElementMul(output, a, b)
+ElementAdd(output, a, b)
+Dispose()
}
class CudaTensor {
+string Name
+GgmlType Type
+ReadOnlySpan~long~ Dimensions
+long ElementCount
-CudaDeviceMemory _memory
+CopyFrom(ReadOnlySpan~byte~ data)
+CopyTo(Span~float~ destination)
+CUdeviceptr DevicePointer
+Dispose()
}
CudaBackend --> CudaTensor : creates
CudaBackend --> CudaContext
CudaBackend --> CudaModule
CudaBackend --> CudaStream
| Kernel name | Operation | Input types | Notes |
|---|---|---|---|
dequant_matmul_q8_0 |
Fused dequant + matmul | Q8_0 × FP32 | Primary inference kernel |
dequant_matmul_q4_0 |
Fused dequant + matmul | Q4_0 × FP32 | For 4-bit models |
dequant_matmul_q4_k |
Fused dequant + matmul | Q4_K × FP32 | For K-quant models |
rms_norm |
RMSNorm | FP32 | Block-level reduction |
softmax |
Softmax | FP32 | Numerically stable (max subtraction) |
silu |
SiLU activation | FP32 | Element-wise |
rope |
RoPE encoding | FP32 | Paired dimension rotation |
element_mul |
Element-wise multiply | FP32 | For SwiGLU gate |
element_add |
Element-wise add | FP32 | For residual connections |