kernel-compass

Automated GPU kernel optimization loop for transformer inference, seeded from the cache-barrier paper on MLA reconstruction GEMMs.

Clone with the pinned companion checkout:

git clone --recursive git@github.com:zhan4808/kernel-compass.git
# or after a plain clone:  git submodule update --init --recursive

Pipeline

target script
     │
     ▼ Stage 1
ncu_runner.py  ──→  NCU CSV  ──→  List[KernelProfile]
                                         │
                                         ▼ Stage 2
                                   bottleneck.py  ──→  KernelProfile.bottleneck
                                         │
                                         ▼ Stage 3
                                   select_candidate()
                                         │
                                         ▼ Stage 4
                                   propose()  ──→  optimization description
                                         │
                                         ▼ Stage 5
                                   validate()  ──→  accept / revert

Structure

profiling/
  metrics.py       KernelProfile and BMMProfile dataclasses
  ncu_runner.py    Stage 1 — run ncu, parse CSV, emit KernelProfiles
  bottleneck.py    Stage 2 — classify + markdown report

kernels/
  mla_reconstruction.py  DeepSeek-V2/V3 MLA reconstruction BMM profiler
  baselines.py           FP16 cuBLAS and INT4 Triton W4A16 BMM wrappers

optimizer/
  loop.py          Stages 3–5 — candidate selection, proposal, validation

data/              NCU CSVs and benchmark results (gitignored)
paper/             LaTeX draft + GPU data checklist (see paper/README.md)
cache-barrier/     Git submodule — reference experiments/paper artifact (optional checkout)
DIRECTION.md       Roadmap and design notes

Requirements

NVIDIA GPU (H100 or A100 recommended for MLA experiments)
PyTorch ≥ 2.1 with CUDA
Triton ≥ 3.0
Nsight Compute (ncu) for Stage 1

pip install torch triton

Quick start

Profile MLA reconstruction BMMs:

python -m kernels.mla_reconstruction --model deepseek-v3

L2 barrier sweep (INT4 vs FP16 across L2 boundary):

python -m kernels.baselines --output data/l2_sweep.json

Run NCU and classify kernels:

python -m profiling.ncu_runner \
    --script kernels/mla_reconstruction.py \
    --args "--model deepseek-v3 --ncu-mode" \
    --output data/mla_v3.csv \
    --label "mla_v3_decode_bs1"

Parse an existing NCU CSV:

python -m profiling.ncu_runner --parse-only data/mla_v3.csv --label "mla_v3"

Full optimization loop (interactive):

python -m optimizer.loop \
    --script kernels/mla_reconstruction.py \
    --args "--model deepseek-v3 --ncu-mode" \
    --iters 3

From Python

from profiling.ncu_runner import run_ncu, load_profiles
from profiling.bottleneck import classify, report

csv_path = run_ncu("kernels/mla_reconstruction.py",
                   args="--ncu-mode", output="data/out.csv")
profiles = classify(load_profiles(csv_path))
print(report(profiles, label="mla_decode_bs1"))

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cache-barrier @ 1143cce		cache-barrier @ 1143cce
data		data
kernels		kernels
optimizer		optimizer
paper		paper
profiling		profiling
tests		tests
.claude.md		.claude.md
.gitignore		.gitignore
.gitmodules		.gitmodules
DIRECTION.md		DIRECTION.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kernel-compass

Pipeline

Structure

Requirements

Quick start

From Python

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kernel-compass

Pipeline

Structure

Requirements

Quick start

From Python

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages