Skip to content

Michael-A-Kuykendall/airframe

Repository files navigation

Airframe

Pure-Rust WebGPU Inference Engine for GGUF Models

Crates.io License: MIT Rust GitHub Stars Powered by Shimmy

No C++ toolchain. No Python. No llama.cpp. Just Rust and your GPU.


Airframe is the GPU inference core powering Shimmy. It runs full transformer inference directly on the GPU via WGSL compute shaders — works on NVIDIA, AMD, Intel, and Apple Silicon.

⚡ NEW in v0.2.1: TurboShimmy INT4 KV Cache — ~7× less KV VRAM with one env var. Run Llama-3.2-3B on 4 GB GPUs.

[dependencies]
airframe = "0.1"

Patent Notice: The Fused Semantic Execution (FSE) subsystem (crates/libfse) is covered by a pending US patent. The WebGPU inference runtime (attention, GGUF loader, quantization) is unencumbered MIT. See license section for full terms.


Why Airframe?

Most Rust LLM inference libraries are thin wrappers around llama.cpp — they require a C++ toolchain, link against native libraries, and make cross-compilation painful. Airframe is different:

Airframe llama.cpp bindings
Build toolchain cargo build C++ compiler required
GPU backend WebGPU (wgpu) — any GPU CUDA / Metal / Vulkan
Cross-compilation Native Rust Complex
Determinism Guaranteed Platform-dependent
Dependency count Minimal Large C++ dep tree
cargo publish friendly

Quick Start

use airframe::runtime::gpu::{GpuRuntime, SamplingParams};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let runtime = GpuRuntime::load("path/to/model.gguf").await?;
    let output = runtime
        .generate("The capital of France is", SamplingParams::default(), None)
        .await?;
    println!("{}", output);
    Ok(())
}

Or run the included example with any GGUF model:

LIBSHIMMY_MODEL_PATH=/path/to/model.gguf cargo run --example simple_flight -- "Hello, world!"

Supported Architectures

Architecture Models
Llama Llama 3.2, Llama 3, Llama 2
Mistral Mistral 7B, Mixtral (dense layers)
Phi Phi-3, Phi-2
Qwen2 Qwen2 7B
Falcon Falcon 7B
GPT-NeoX StableLM
Gemma Gemma 2B

Supported Quantization

F32 · F16 · Q4_0 · Q4_K_M · Q8_0

All quantization types are implemented in both GPU shader and CPU reference paths, with parity validation — the same model produces bit-identical output on CPU and GPU.


Architecture

Airframe is built around three principles:

1. Bindless WebGPU Pipeline

The GPU backend uses a bindless resource model — all weight tensors are uploaded once to GPU memory and addressed by index in the shader, eliminating per-layer bind group churn. This gives near-linear throughput scaling with context length.

2. Fused Semantic Execution (FSE)

The policy enforcement layer (crates/libfse) compiles multiple independent semantic rules into a single fused DFA evaluated during token generation. Rule evaluation cost is O(1) in rule count for shared selectors — a property that is not an optimization but an architectural inversion.

Input stream → Compiled DFA → Fused opcode table → Fail-closed decision
                                (single pass)

See fused_semantic_execution_full_markdown_reconstruction.md for the full technical specification and patent drawings.

3. Deterministic Sampling

Given the same model file, seed, and sampling parameters, Airframe produces identical output on every run — across restarts, machines, and GPU vendors. This makes it suitable for reproducible evaluation pipelines.


Design Diagrams

┌─────────────────────────────────────────────┐
│                airframe crate               │
│                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │   core/  │  │ family/  │  │  ops/    │  │
│  │ GGUF load│  │  Llama   │  │ attn/FFN │  │
│  │ tensors  │  │  forward │  │ RoPE/RMS │  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  │
│       └─────────────┼─────────────┘         │
│                     ▼                        │
│  ┌──────────────────────────────────────┐    │
│  │           runtime/                   │    │
│  │   engine · KV cache · sampler        │    │
│  └─────────────────┬────────────────────┘    │
│                    ▼                         │
│  ┌──────────────────────────────────────┐    │
│  │       backend/bindless/ (WebGPU)     │    │
│  │   14 WGSL compute shaders            │    │
│  │   dequant · matmul · RoPE · attn     │    │
│  └──────────────────────────────────────┘    │
│                                              │
│  ┌──────────────────────────────────────┐    │
│  │   crates/libfse  (FSE policy engine) │    │
│  │   Patent Pending — see LICENSE note  │    │
│  └──────────────────────────────────────┘    │
└─────────────────────────────────────────────┘
         ▲ used by
┌────────┴──────────────────┐
│  Shimmy GPU Server binary  │
│  shimmy_server_gpu         │
│  HTTP · job queue · eval   │
└───────────────────────────┘

Full architecture reference: docs/architecture-map.md


⚡ TurboShimmy INT4 KV Cache

TurboShimmy is Airframe's on-GPU INT4 KV-cache compression system, shipping in v0.2.1. It squeezes the KV cache from 32-bit floats down to per-head-vector 4-bit integers — entirely in WGSL compute shaders with no CPU roundtrips — delivering ~7× less KV VRAM with no measurable quality loss at normal context lengths.

One env var. ~7× less KV VRAM. Same output quality. Pure Rust, pure GPU.

# Enable TurboShimmy
SHIMMY_KV_QUANT=int4 LIBSHIMMY_MODEL_PATH=/path/to/model.gguf \
  cargo run --bin shimmy_server_gpu --release

# Or with the prefill-chunk flag (prevents Windows TDR resets on long prompts)
SHIMMY_KV_QUANT=int4 SHIMMY_PREFILL_CHUNK=8 LIBSHIMMY_MODEL_PATH=/path/to/model.gguf \
  cargo run --bin shimmy_server_gpu --release

Why it matters — TurboShimmy changes what fits on consumer GPUs:

GPU VRAM Without TurboShimmy With TurboShimmy
3 GB Llama-3.2-1B only Llama-3.2-3B fits ✅
4 GB Llama-3.2-3B, ctx=2048 (tight) Llama-3.2-3B at ctx=8192 ✅
6 GB 3B models, short context 7B models with reasonable context ✅

VRAM savings (ctx=2048):

Model F32 KV INT4 KV Savings
TinyLlama 1.1B (Q4_0) 88 MB ~13 MB ~7× less
Llama-3.2-1B (Q4_K_M) ~128 MB ~18 MB ~7× less
Llama-3.2-3B (Q4_K_M) ~512 MB ~72 MB ~7× less

How it works: Each K/V head vector is independently quantized to 4-bit integers with a per-vector F32 scale factor (max_abs / 7.0), packed into U32s (8 nibbles each) by sh_kv_pack_int4.wgsl. Dequantization via sh_kv_unpack_int4.wgsl happens on-the-fly before each attention computation. The helical context-shift operates directly on the packed INT4 representation — no decompression needed. Zero CPU roundtrips throughout.

Quality validation: Needle-in-a-haystack benchmarks on Llama-3.2-3B show zero retrieval degradation vs F32 at ctx≤2048 across all tested insertion depths (15%, 50%, 85%). See docs/turboshimmy.md and the Shimmy wiki TurboShimmy page for full benchmark data and setup guide.

Server environment variables:

Variable Default Description
LIBSHIMMY_MODEL_PATH (required) Path to .gguf model file
SHIMMY_PORT 8080 HTTP listener port
SHIMMY_MAX_CTX 2048 Maximum context window (tokens)
SHIMMY_PREFILL_CHUNK 64 Prefill batch size; reduce to 8 if you see TDR crashes on Windows
SHIMMY_KV_QUANT f32 KV cache mode: f32 or int4 (TurboShimmy)
SHIMMY_VRAM_LIMIT_MB 10500 VRAM budget warning threshold (MB); tune for your GPU

Benchmarks

Airframe has been validated on standard LLM evaluation benchmarks. Results are tracked in artifacts/.

The FSE policy layer benchmarks 27% faster than raw aho-corasick iterator on 7KB payloads (see crates/libfse/AUDIT_INFO.md for methodology).

To run performance baselines:

cargo bench
# or with a model:
LIBSHIMMY_MODEL_PATH=/path/to/model.gguf cargo run --bin shimmy_server_gpu --release

Development

git clone https://github.com/Michael-A-Kuykendall/airframe
cd airframe
cargo build
cargo test
cargo run --example simple_flight  # requires LIBSHIMMY_MODEL_PATH

See CONTRIBUTING.md for guidelines. See CHANGELOG.md for release history.


Ecosystem

Project Description
Shimmy OpenAI-compatible inference server — powered by Airframe
libfse Fused Semantic Execution policy engine — ships as part of this repo
shimmytok GGUF-native tokenizer used by both Airframe and Shimmy
shimmyjinja Pure-Rust Jinja2 engine for HuggingFace chat_template strings — live in v0.1.1, powers the prompt rendering pipeline

License

MIT — see LICENSE.

Inference runtime (attention kernels, GGUF loader, quantization, WebGPU backend): unencumbered MIT.

FSE subsystem (crates/libfse): MIT for non-commercial use. The Fail-Closed Policy Fusion and Execution Kernel methods are covered by a pending US patent. Commercial embedding requires a separate license — contact michaelallenkuykendall@gmail.com.

About

Pure-Rust WebGPU inference engine for GGUF models. No C++, no Python, no llama.cpp. Runs on NVIDIA, AMD, Intel, and Apple Silicon.

Topics

Resources

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors