Airframe is the GPU inference core powering Shimmy. It runs full transformer inference directly on the GPU via WGSL compute shaders — works on NVIDIA, AMD, Intel, and Apple Silicon.
⚡ NEW in v0.2.1: TurboShimmy INT4 KV Cache — ~7× less KV VRAM with one env var. Run Llama-3.2-3B on 4 GB GPUs.
[dependencies]
airframe = "0.1"Patent Notice: The Fused Semantic Execution (FSE) subsystem (
crates/libfse) is covered by a pending US patent. The WebGPU inference runtime (attention, GGUF loader, quantization) is unencumbered MIT. See license section for full terms.
Most Rust LLM inference libraries are thin wrappers around llama.cpp — they require a C++ toolchain, link against native libraries, and make cross-compilation painful. Airframe is different:
| Airframe | llama.cpp bindings | |
|---|---|---|
| Build toolchain | cargo build |
C++ compiler required |
| GPU backend | WebGPU (wgpu) — any GPU | CUDA / Metal / Vulkan |
| Cross-compilation | Native Rust | Complex |
| Determinism | Guaranteed | Platform-dependent |
| Dependency count | Minimal | Large C++ dep tree |
cargo publish friendly |
✅ | ❌ |
use airframe::runtime::gpu::{GpuRuntime, SamplingParams};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let runtime = GpuRuntime::load("path/to/model.gguf").await?;
let output = runtime
.generate("The capital of France is", SamplingParams::default(), None)
.await?;
println!("{}", output);
Ok(())
}Or run the included example with any GGUF model:
LIBSHIMMY_MODEL_PATH=/path/to/model.gguf cargo run --example simple_flight -- "Hello, world!"| Architecture | Models |
|---|---|
| Llama | Llama 3.2, Llama 3, Llama 2 |
| Mistral | Mistral 7B, Mixtral (dense layers) |
| Phi | Phi-3, Phi-2 |
| Qwen2 | Qwen2 7B |
| Falcon | Falcon 7B |
| GPT-NeoX | StableLM |
| Gemma | Gemma 2B |
F32 · F16 · Q4_0 · Q4_K_M · Q8_0
All quantization types are implemented in both GPU shader and CPU reference paths, with parity validation — the same model produces bit-identical output on CPU and GPU.
Airframe is built around three principles:
The GPU backend uses a bindless resource model — all weight tensors are uploaded once to GPU memory and addressed by index in the shader, eliminating per-layer bind group churn. This gives near-linear throughput scaling with context length.
The policy enforcement layer (crates/libfse) compiles multiple independent semantic rules into a single fused DFA evaluated during token generation. Rule evaluation cost is O(1) in rule count for shared selectors — a property that is not an optimization but an architectural inversion.
Input stream → Compiled DFA → Fused opcode table → Fail-closed decision
(single pass)
See fused_semantic_execution_full_markdown_reconstruction.md for the full technical specification and patent drawings.
Given the same model file, seed, and sampling parameters, Airframe produces identical output on every run — across restarts, machines, and GPU vendors. This makes it suitable for reproducible evaluation pipelines.
┌─────────────────────────────────────────────┐
│ airframe crate │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ core/ │ │ family/ │ │ ops/ │ │
│ │ GGUF load│ │ Llama │ │ attn/FFN │ │
│ │ tensors │ │ forward │ │ RoPE/RMS │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ └─────────────┼─────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ runtime/ │ │
│ │ engine · KV cache · sampler │ │
│ └─────────────────┬────────────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ backend/bindless/ (WebGPU) │ │
│ │ 14 WGSL compute shaders │ │
│ │ dequant · matmul · RoPE · attn │ │
│ └──────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ crates/libfse (FSE policy engine) │ │
│ │ Patent Pending — see LICENSE note │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
▲ used by
┌────────┴──────────────────┐
│ Shimmy GPU Server binary │
│ shimmy_server_gpu │
│ HTTP · job queue · eval │
└───────────────────────────┘
Full architecture reference: docs/architecture-map.md
TurboShimmy is Airframe's on-GPU INT4 KV-cache compression system, shipping in v0.2.1. It squeezes the KV cache from 32-bit floats down to per-head-vector 4-bit integers — entirely in WGSL compute shaders with no CPU roundtrips — delivering ~7× less KV VRAM with no measurable quality loss at normal context lengths.
One env var. ~7× less KV VRAM. Same output quality. Pure Rust, pure GPU.
# Enable TurboShimmy
SHIMMY_KV_QUANT=int4 LIBSHIMMY_MODEL_PATH=/path/to/model.gguf \
cargo run --bin shimmy_server_gpu --release
# Or with the prefill-chunk flag (prevents Windows TDR resets on long prompts)
SHIMMY_KV_QUANT=int4 SHIMMY_PREFILL_CHUNK=8 LIBSHIMMY_MODEL_PATH=/path/to/model.gguf \
cargo run --bin shimmy_server_gpu --releaseWhy it matters — TurboShimmy changes what fits on consumer GPUs:
| GPU VRAM | Without TurboShimmy | With TurboShimmy |
|---|---|---|
| 3 GB | Llama-3.2-1B only | Llama-3.2-3B fits ✅ |
| 4 GB | Llama-3.2-3B, ctx=2048 (tight) | Llama-3.2-3B at ctx=8192 ✅ |
| 6 GB | 3B models, short context | 7B models with reasonable context ✅ |
VRAM savings (ctx=2048):
| Model | F32 KV | INT4 KV | Savings |
|---|---|---|---|
| TinyLlama 1.1B (Q4_0) | 88 MB | ~13 MB | ~7× less |
| Llama-3.2-1B (Q4_K_M) | ~128 MB | ~18 MB | ~7× less |
| Llama-3.2-3B (Q4_K_M) | ~512 MB | ~72 MB | ~7× less |
How it works: Each K/V head vector is independently quantized to 4-bit integers with a per-vector F32 scale factor (max_abs / 7.0), packed into U32s (8 nibbles each) by sh_kv_pack_int4.wgsl. Dequantization via sh_kv_unpack_int4.wgsl happens on-the-fly before each attention computation. The helical context-shift operates directly on the packed INT4 representation — no decompression needed. Zero CPU roundtrips throughout.
Quality validation: Needle-in-a-haystack benchmarks on Llama-3.2-3B show zero retrieval degradation vs F32 at ctx≤2048 across all tested insertion depths (15%, 50%, 85%). See docs/turboshimmy.md and the Shimmy wiki TurboShimmy page for full benchmark data and setup guide.
Server environment variables:
| Variable | Default | Description |
|---|---|---|
LIBSHIMMY_MODEL_PATH |
(required) | Path to .gguf model file |
SHIMMY_PORT |
8080 |
HTTP listener port |
SHIMMY_MAX_CTX |
2048 |
Maximum context window (tokens) |
SHIMMY_PREFILL_CHUNK |
64 |
Prefill batch size; reduce to 8 if you see TDR crashes on Windows |
SHIMMY_KV_QUANT |
f32 |
KV cache mode: f32 or int4 (TurboShimmy) |
SHIMMY_VRAM_LIMIT_MB |
10500 |
VRAM budget warning threshold (MB); tune for your GPU |
Airframe has been validated on standard LLM evaluation benchmarks. Results are tracked in artifacts/.
The FSE policy layer benchmarks 27% faster than raw aho-corasick iterator on 7KB payloads (see crates/libfse/AUDIT_INFO.md for methodology).
To run performance baselines:
cargo bench
# or with a model:
LIBSHIMMY_MODEL_PATH=/path/to/model.gguf cargo run --bin shimmy_server_gpu --releasegit clone https://github.com/Michael-A-Kuykendall/airframe
cd airframe
cargo build
cargo test
cargo run --example simple_flight # requires LIBSHIMMY_MODEL_PATHSee CONTRIBUTING.md for guidelines. See CHANGELOG.md for release history.
| Project | Description |
|---|---|
| Shimmy | OpenAI-compatible inference server — powered by Airframe |
| libfse | Fused Semantic Execution policy engine — ships as part of this repo |
| shimmytok | GGUF-native tokenizer used by both Airframe and Shimmy |
| shimmyjinja | Pure-Rust Jinja2 engine for HuggingFace chat_template strings — live in v0.1.1, powers the prompt rendering pipeline |
MIT — see LICENSE.
Inference runtime (attention kernels, GGUF loader, quantization, WebGPU backend): unencumbered MIT.
FSE subsystem (crates/libfse): MIT for non-commercial use. The Fail-Closed Policy Fusion and Execution Kernel methods are covered by a pending US patent. Commercial embedding requires a separate license — contact michaelallenkuykendall@gmail.com.
