WebGPU Compute Backend

Browser-native GPU inference via WebGPU compute shaders. Definitions | Architecture | CUDA Backend | Vulkan Backend

Overview

The WebGPU backend (Daisi.Llogos.WebGpu) runs GGUF model inference entirely in the browser using WebGPU compute shaders. Written in TypeScript, it is the browser counterpart to the .NET CUDA and Vulkan backends. It loads GGUF models via HTTP, runs inference on the GPU, and integrates with the DAISI network via gRPC-web.

Published as @daisinet/llogos-webgpu on npm.

Key differences from the .NET backends:

Language: TypeScript (not C#). Runs in any WebGPU-capable browser or Node.js via Dawn bindings.
Shader language: WGSL (WebGPU Shading Language) — similar in capability to GLSL/SPIR-V but browser-native.
No shared memory model: Each compute dispatch is its own pass with explicit barrier semantics.
Model loading: HTTP fetch with Cache API persistence (no file system access in browser).
Architectures: Supports Llama, Qwen 2/2.5, and Qwen 3.5 (DeltaNet hybrid).

Architecture

Browser Tab / Node.js
┌──────────────────────────────────────────────┐
│  LlogosEngine                                 │
│  ┌──────────────────────────────────────────┐ │
│  │ GGUF Parser → Tensor Loader → Model      │ │
│  │ BPE Tokenizer (from GGUF metadata)       │ │
│  │ Sampler (temperature, top-k, top-p)      │ │
│  └──────────────────────────────────────────┘ │
│                      │                        │
│  ┌──────────────────────────────────────────┐ │
│  │ ComputeEngine                             │ │
│  │ ┌─────────────┐ ┌──────────────────────┐ │ │
│  │ │ BufferPool   │ │ ShaderCache          │ │ │
│  │ │ (VRAM track) │ │ (pipeline cache)     │ │ │
│  │ └─────────────┘ └──────────────────────┘ │ │
│  │ 20+ WGSL compute shaders                 │ │
│  └──────────────────────────────────────────┘ │
│                      │                        │
│              WebGPU API                       │
│              (navigator.gpu)                  │
└──────────────────────────────────────────────┘
                       │
                   GPU Hardware
            (NVIDIA / AMD / Intel / Apple)

Files

Directory	File	Purpose
`src/`	`engine.ts`	Main `LlogosEngine` class — init GPU, load model, generate tokens
`src/`	`index.ts`	Package exports
`src/gguf/`	`gguf-parser.ts`	GGUF v2/v3 binary parser (header, metadata, tensor info)
`src/gguf/`	`quantization.ts`	GgmlType enum, block sizes, type sizes
`src/gpu/`	`compute.ts`	`ComputeEngine` — dispatches all GPU operations
`src/gpu/`	`device.ts`	WebGPU adapter/device initialization, capability detection
`src/gpu/`	`buffer-pool.ts`	GPU buffer allocation, reuse, VRAM tracking
`src/gpu/`	`shader-cache.ts`	Compiled pipeline cache
`src/model/`	`llama-model.ts`	Llama/Qwen2 forward pass (standard attention + GQA)
`src/model/`	`qwen35-model.ts`	Qwen 3.5 hybrid forward pass (DeltaNet + gated attention)
`src/model/`	`kv-cache.ts`	GPU-resident KV cache per layer
`src/model/`	`sampler.ts`	Temperature, top-k, top-p, repetition penalty sampling
`src/tokenizer/`	`bpe-tokenizer.ts`	BPE tokenizer from GGUF metadata
`src/tokenizer/`	`chat-template.ts`	Jinja2-style chat template interpreter
`src/storage/`	`download-manager.ts`	HTTP streaming download with Cache API persistence

Compute Shaders (WGSL)

Shader	Operation	Workgroup
`matmul.wgsl`	F32 matrix-vector multiply	256 threads, 2D dispatch for large M
`matmul_q4.wgsl`	Fused Q4_0 dequant + matmul	256 threads
`matmul_q8.wgsl`	Native Q8_0 matmul (unpack2x16float)	256 threads
`attention.wgsl`	Multi-head attention with GQA	64 threads per head
`rmsnorm.wgsl`	RMS normalization	256-thread reduction
`rope.wgsl`	Rotary position embeddings	1 thread per pair
`embedding.wgsl`	Token embedding lookup	256 threads
`silu_mul.wgsl`	Fused SiLU gate × up (SwiGLU)	256 threads
`add.wgsl`	Element-wise add (residual connections)	256 threads
`add_bias.wgsl`	In-place bias addition	256 threads
`copy_rmsnorm.wgsl`	Fused copy + RMSNorm	256 threads
`softmax.wgsl`	Numerically stable softmax	256 threads
DeltaNet shaders
`conv1d_silu.wgsl`	Fused causal conv1d + SiLU with persistent state	256 threads
`l2_norm_groups.wgsl`	Per-group L2 normalization	256-thread reduction per group
`compute_decay_beta.wgsl`	Softplus + exp decay, sigmoid beta	64 threads
`deltanet_step.wgsl`	Full state update: mat-vec, rank-1 update, output, per-head RMSNorm	128 threads per group
`silu_gate.wgsl`	Element-wise SiLU gating	256 threads
`silu_inplace.wgsl`	In-place SiLU activation	256 threads

Supported Models

Architecture	Models	Features
Llama	TinyLlama 1.1B, Llama 3.2 1B	GQA, tied weights, 128K vocab
Qwen 2/2.5	Qwen 2.5 0.5B	Attention biases, ChatML
Qwen 3.5	Qwen 3.5 0.8B	DeltaNet hybrid (SSM + gated attention), 248K vocab

Quantization Support

Format	GPU Native	CPU Dequant
F32	Matmul	—
Q8_0	Matmul (native i8 dot)	Dequant
Q4_0	Matmul (fused dequant)	Dequant
Q4_1	—	Dequant
Q5_0	—	Dequant
Q5_1	—	Dequant
F16	—	Dequant
Q4_K	—	Dequant (partial)
Q6_K	—	Dequant

Benchmarks

Measured via Dawn WebGPU (Node.js), NVIDIA RTX 5090 (Blackwell), 32 decode tokens.

Model	Prefill	Decode	VRAM
TinyLlama 1.1B Q8_0	45 tok/s	—	1570 MB
Llama 3.2 1B Q8_0	61 tok/s	54 tok/s	2787 MB
Qwen 2.5 0.5B Q8_0	42 tok/s	37 tok/s	1592 MB
Qwen 3.5 0.8B Q8_0	17 tok/s	17 tok/s	1592 MB

Qwen 3.5 DeltaNet runs entirely on GPU — zero CPU readbacks during inference.

Chat Template Support

The engine automatically detects and applies chat templates:

Llama 3 — detected via <|start_header_id|> token. Uses <|begin_of_text|> + role headers.
ChatML (Qwen 2/2.5/3.5) — detected via <|im_start|> token. Standard <|im_start|>role\ncontent<|im_end|> format.
Llama 2 — detected via [INST] in template. Simple instruction wrapping.
Jinja2 — generic parser for tokenizer.chat_template from GGUF metadata. Supports {% for %}, {% if %}, {{ var }}, filters.

Integration with DAISI Network

The WebGPU engine integrates with the DAISI Manager as a Browser Host:

Loads models from HuggingFace via HTTP, caches in browser Cache API
Connects to the ORC (Orchestrator) directly via gRPC-web
Processes inference requests from the network
Streams tokens back in batches (10 per response) for efficient throughput
Auto-reconnects with exponential backoff

See the Browser Host Learn Page for user-facing documentation.

Testing

cd src/Daisi.Llogos.WebGpu

# Run all tests (72 tests including GPU inference via Dawn)
npm test

# Run benchmarks
npx vitest run test/benchmark.test.ts

# Build the package
npm run build

Tests use the webgpu npm package (Dawn bindings) for GPU compute in Node.js, enabling automated CI testing without a browser.

Quick Start

import { LlogosEngine } from '@daisinet/llogos-webgpu';

const engine = new LlogosEngine();
await engine.initGpu();
await engine.loadModel('https://huggingface.co/.../model.gguf');

for await (const token of engine.generate('Hello, world')) {
  process.stdout.write(token);
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebGPU Compute Backend

Overview

Architecture

Files

Compute Shaders (WGSL)

Supported Models

Quantization Support

Benchmarks

Chat Template Support

Integration with DAISI Network

Testing

Quick Start

FilesExpand file tree

webgpu-backend.md

Latest commit

History

webgpu-backend.md

File metadata and controls

WebGPU Compute Backend

Overview

Architecture

Files

Compute Shaders (WGSL)

Supported Models

Quantization Support

Benchmarks

Chat Template Support

Integration with DAISI Network

Testing

Quick Start