GitHub - Michael-A-Kuykendall/airframe: Pure-Rust WebGPU inference engine for GGUF models. No C++, no Python, no llama.cpp. Runs on NVIDIA, AMD, Intel, and Apple Silicon.

Pure-Rust WebGPU Inference Engine for GGUF Models

No C++ toolchain. No Python. No llama.cpp. Just Rust and your GPU.

Airframe is the GPU inference core powering Shimmy. It runs full transformer inference directly on the GPU via WGSL compute shaders — works on NVIDIA, AMD, Intel, and Apple Silicon.

⚡ NEW in v0.2.1: TurboShimmy INT4 KV Cache — ~7× less KV VRAM with one env var. Run Llama-3.2-3B on 4 GB GPUs.

[dependencies]
airframe = "0.1"

Patent Notice: The Fused Semantic Execution (FSE) subsystem (crates/libfse) is covered by a pending US patent. The WebGPU inference runtime (attention, GGUF loader, quantization) is unencumbered MIT. See license section for full terms.

Why Airframe?

Most Rust LLM inference libraries are thin wrappers around llama.cpp — they require a C++ toolchain, link against native libraries, and make cross-compilation painful. Airframe is different:

	Airframe	llama.cpp bindings
Build toolchain	`cargo build`	C++ compiler required
GPU backend	WebGPU (wgpu) — any GPU	CUDA / Metal / Vulkan
Cross-compilation	Native Rust	Complex
Determinism	Guaranteed	Platform-dependent
Dependency count	Minimal	Large C++ dep tree
`cargo publish` friendly	✅	❌

Quick Start

use airframe::runtime::gpu::{GpuRuntime, SamplingParams};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let runtime = GpuRuntime::load("path/to/model.gguf").await?;
    let output = runtime
        .generate("The capital of France is", SamplingParams::default(), None)
        .await?;
    println!("{}", output);
    Ok(())
}

Or run the included example with any GGUF model:

LIBSHIMMY_MODEL_PATH=/path/to/model.gguf cargo run --example simple_flight -- "Hello, world!"

Supported Architectures

Architecture	Models
Llama	Llama 3.2, Llama 3, Llama 2
Mistral	Mistral 7B, Mixtral (dense layers)
Phi	Phi-3, Phi-2
Qwen2	Qwen2 7B
Falcon	Falcon 7B
GPT-NeoX	StableLM
Gemma	Gemma 2B

Supported Quantization

F32 · F16 · Q4_0 · Q4_K_M · Q8_0

All quantization types are implemented in both GPU shader and CPU reference paths, with parity validation — the same model produces bit-identical output on CPU and GPU.

Architecture

Airframe is built around three principles:

1. Bindless WebGPU Pipeline

The GPU backend uses a bindless resource model — all weight tensors are uploaded once to GPU memory and addressed by index in the shader, eliminating per-layer bind group churn. This gives near-linear throughput scaling with context length.

2. Fused Semantic Execution (FSE)

The policy enforcement layer (crates/libfse) compiles multiple independent semantic rules into a single fused DFA evaluated during token generation. Rule evaluation cost is O(1) in rule count for shared selectors — a property that is not an optimization but an architectural inversion.

Input stream → Compiled DFA → Fused opcode table → Fail-closed decision
                                (single pass)

See fused_semantic_execution_full_markdown_reconstruction.md for the full technical specification and patent drawings.

3. Deterministic Sampling

Given the same model file, seed, and sampling parameters, Airframe produces identical output on every run — across restarts, machines, and GPU vendors. This makes it suitable for reproducible evaluation pipelines.

Design Diagrams

┌─────────────────────────────────────────────┐
│                airframe crate               │
│                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │   core/  │  │ family/  │  │  ops/    │  │
│  │ GGUF load│  │  Llama   │  │ attn/FFN │  │
│  │ tensors  │  │  forward │  │ RoPE/RMS │  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  │
│       └─────────────┼─────────────┘         │
│                     ▼                        │
│  ┌──────────────────────────────────────┐    │
│  │           runtime/                   │    │
│  │   engine · KV cache · sampler        │    │
│  └─────────────────┬────────────────────┘    │
│                    ▼                         │
│  ┌──────────────────────────────────────┐    │
│  │       backend/bindless/ (WebGPU)     │    │
│  │   14 WGSL compute shaders            │    │
│  │   dequant · matmul · RoPE · attn     │    │
│  └──────────────────────────────────────┘    │
│                                              │
│  ┌──────────────────────────────────────┐    │
│  │   crates/libfse  (FSE policy engine) │    │
│  │   Patent Pending — see LICENSE note  │    │
│  └──────────────────────────────────────┘    │
└─────────────────────────────────────────────┘
         ▲ used by
┌────────┴──────────────────┐
│  Shimmy GPU Server binary  │
│  shimmy_server_gpu         │
│  HTTP · job queue · eval   │
└───────────────────────────┘

Full architecture reference: docs/architecture-map.md

⚡ TurboShimmy INT4 KV Cache

TurboShimmy is Airframe's on-GPU INT4 KV-cache compression system, shipping in v0.2.1. It squeezes the KV cache from 32-bit floats down to per-head-vector 4-bit integers — entirely in WGSL compute shaders with no CPU roundtrips — delivering ~7× less KV VRAM with no measurable quality loss at normal context lengths.

One env var. ~7× less KV VRAM. Same output quality. Pure Rust, pure GPU.

# Enable TurboShimmy
SHIMMY_KV_QUANT=int4 LIBSHIMMY_MODEL_PATH=/path/to/model.gguf \
  cargo run --bin shimmy_server_gpu --release

# Or with the prefill-chunk flag (prevents Windows TDR resets on long prompts)
SHIMMY_KV_QUANT=int4 SHIMMY_PREFILL_CHUNK=8 LIBSHIMMY_MODEL_PATH=/path/to/model.gguf \
  cargo run --bin shimmy_server_gpu --release

Why it matters — TurboShimmy changes what fits on consumer GPUs:

GPU VRAM	Without TurboShimmy	With TurboShimmy
3 GB	Llama-3.2-1B only	Llama-3.2-3B fits ✅
4 GB	Llama-3.2-3B, ctx=2048 (tight)	Llama-3.2-3B at ctx=8192 ✅
6 GB	3B models, short context	7B models with reasonable context ✅

VRAM savings (ctx=2048):

Model	F32 KV	INT4 KV	Savings
TinyLlama 1.1B (Q4_0)	88 MB	~13 MB	~7× less
Llama-3.2-1B (Q4_K_M)	~128 MB	~18 MB	~7× less
Llama-3.2-3B (Q4_K_M)	~512 MB	~72 MB	~7× less

How it works: Each K/V head vector is independently quantized to 4-bit integers with a per-vector F32 scale factor (max_abs / 7.0), packed into U32s (8 nibbles each) by sh_kv_pack_int4.wgsl. Dequantization via sh_kv_unpack_int4.wgsl happens on-the-fly before each attention computation. The helical context-shift operates directly on the packed INT4 representation — no decompression needed. Zero CPU roundtrips throughout.

Quality validation: Needle-in-a-haystack benchmarks on Llama-3.2-3B show zero retrieval degradation vs F32 at ctx≤2048 across all tested insertion depths (15%, 50%, 85%). See docs/turboshimmy.md and the Shimmy wiki TurboShimmy page for full benchmark data and setup guide.

Server environment variables:

Variable	Default	Description
`LIBSHIMMY_MODEL_PATH`	(required)	Path to `.gguf` model file
`SHIMMY_PORT`	`8080`	HTTP listener port
`SHIMMY_MAX_CTX`	`2048`	Maximum context window (tokens)
`SHIMMY_PREFILL_CHUNK`	`64`	Prefill batch size; reduce to `8` if you see TDR crashes on Windows
`SHIMMY_KV_QUANT`	`f32`	KV cache mode: `f32` or `int4` (TurboShimmy)
`SHIMMY_VRAM_LIMIT_MB`	`10500`	VRAM budget warning threshold (MB); tune for your GPU

Benchmarks

Airframe has been validated on standard LLM evaluation benchmarks. Results are tracked in artifacts/.

The FSE policy layer benchmarks 27% faster than raw aho-corasick iterator on 7KB payloads (see crates/libfse/AUDIT_INFO.md for methodology).

To run performance baselines:

cargo bench
# or with a model:
LIBSHIMMY_MODEL_PATH=/path/to/model.gguf cargo run --bin shimmy_server_gpu --release

Development

git clone https://github.com/Michael-A-Kuykendall/airframe
cd airframe
cargo build
cargo test
cargo run --example simple_flight  # requires LIBSHIMMY_MODEL_PATH

See CONTRIBUTING.md for guidelines. See CHANGELOG.md for release history.

Ecosystem

Project	Description
Shimmy	OpenAI-compatible inference server — powered by Airframe
libfse	Fused Semantic Execution policy engine — ships as part of this repo
shimmytok	GGUF-native tokenizer used by both Airframe and Shimmy
shimmyjinja	Pure-Rust Jinja2 engine for HuggingFace `chat_template` strings — live in v0.1.1, powers the prompt rendering pipeline

License

MIT — see LICENSE.

Inference runtime (attention kernels, GGUF loader, quantization, WebGPU backend): unencumbered MIT.

FSE subsystem (crates/libfse): MIT for non-commercial use. The Fail-Closed Policy Fusion and Execution Kernel methods are covered by a pending US patent. Commercial embedding requires a separate license — contact michaelallenkuykendall@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
.github		.github
.vscode		.vscode
artifacts		artifacts
assets		assets
crates		crates
docs		docs
examples		examples
fixtures		fixtures
scripts		scripts
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
CERT.toml		CERT.toml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pure-Rust WebGPU Inference Engine for GGUF Models

Why Airframe?

Quick Start

Supported Architectures

Supported Quantization

Architecture

1. Bindless WebGPU Pipeline

2. Fused Semantic Execution (FSE)

3. Deterministic Sampling

Design Diagrams

⚡ TurboShimmy INT4 KV Cache

Benchmarks

Development

Ecosystem

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pure-Rust WebGPU Inference Engine for GGUF Models

Why Airframe?

Quick Start

Supported Architectures

Supported Quantization

Architecture

1. Bindless WebGPU Pipeline

2. Fused Semantic Execution (FSE)

3. Deterministic Sampling

Design Diagrams

⚡ TurboShimmy INT4 KV Cache

Benchmarks

Development

Ecosystem

License

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages