Multi-vendor GPU implementations of the perceptual image quality metrics Imazen runs in production, plus a unified CLI.
Built on CubeCL via the zenforks-cubecl
publication of tracel-ai/cubecl
(0.10.x — carries pinned-upload + PTX-cache-widening + Metal
Atomic<f32> capability patches for our use case). A single
#[cube]-annotated Rust kernel source dispatches across CUDA (NVIDIA),
WGPU (Vulkan / Metal / DX12 / WebGPU), HIP (AMD ROCm), and a
build-time CPU fallback.
The six GPU metric crates plus the two CPU reference crates that the orchestrator's CPU ladder dispatches to:
| Crate | Metric | Range / shape | Parity reference |
|---|---|---|---|
butteraugli-gpu |
Butteraugli | distance, max-norm (default) + libjxl 3-norm | butteraugli 0.9.4 |
ssim2-gpu |
SSIMULACRA2 | 0–100, higher better | ssimulacra2 0.5 |
dssim-gpu |
DSSIM | distance, 0 = identical | dssim-core 3.4 |
iwssim-gpu |
IW-SSIM (Wang & Li 2011) | [0, 1], 1.0 = identical |
iwssim (in-tree CPU port) |
zensim-gpu |
zensim feature extractor | 228-feature vector + scalar score 0–100 | zensim 0.3.0 |
cvvdp-gpu |
ColorVideoVDP (still-image, GPU) | JOD ~3–10, higher better | pycvvdp 0.5.4 |
iwssim |
IW-SSIM (CPU reference + SIMD) | [0, 1], 1.0 = identical |
self (pure-Rust port) |
cvvdp |
ColorVideoVDP (still-image, CPU) | JOD ~3–10 + per-pixel diffmap | pycvvdp 0.5.4 |
The CPU side of each metric is supplied by an external reference crate
(fast-ssim2 0.8.1,
dssim-core 3.4,
butteraugli 0.9.4,
zensim 0.3.0) or an in-tree crate
(cvvdp, iwssim). All six metrics
expose a CPU backend (the IW-SSIM CPU port landed in 2026-05; see the
Modes × metrics support matrix).
| Crate | Role |
|---|---|
zenmetrics-api |
Umbrella: one MetricKind enum + one Metric type dispatching to all six per-crate opaque scorers |
zenmetrics-orchestrator |
Capability-aware backend chooser + persistent benchmark cache + OOM fallback ladder + warm worker pool |
zen-metrics-cli |
zen-metrics CLI front-end (score / batch / compare / sweep) |
zenmetrics-corpus |
Shared test-image corpus (test infra) |
iwssim-filter-codegen |
Build-time generator for the IW-SSIM separable blur filters |
cvvdp-conformance |
pycvvdp conformance fixtures + parity harness for the cvvdp crates |
The workspace also contains the vast.ai / Hetzner / RunPod / Salad
sweep-fleet crates (zen-cloud-*, zencloud-hetzner,
zenfleet-orchestrator, zen-job-*, zen-ledger, zen-jobdash,
zen-sweep-worker, zenstats) that drive the backfill pipeline; they
are infrastructure, not part of the metric API.
For any caller that scores more than one (ref, dist) pair —
sweeps, picker training, RD curves, batch comparison, anything with
multiple tasks — use zenmetrics-orchestrator.
It adds three things every previous in-tree caller had to hand-roll:
- Backend selection. Persistent per-machine benchmark cache picks
the fastest backend that fits available VRAM for each task. Knows
which
(metric, size)combinations OOM on this machine and avoids them on subsequent runs. - OOM-safe fallback ladder.
GpuFull → GpuStrip → (Cvvdp: GpuStripPair) → Cpu. Each downgrade is recorded in the cache so the same machine never tries the failing combination twice. - Cached-reference auto-detect. xxhash3 hashes ref bytes per task,
promotes consecutive same-ref tasks to the
set_reference+compute_with_cached_referenceAPI for the 1.5–3× speedup that sweeps benefit from.
Quick decision table:
| Caller shape | Use |
|---|---|
One (ref, dist) per process, no fallback needed |
zenmetrics-api directly |
| Batch / sweep / picker training / RD curve | zenmetrics-orchestrator |
| Streaming workload | zenmetrics-orchestrator |
| OOM-tolerant scoring | zenmetrics-orchestrator |
| One-ref / many-dist workloads | zenmetrics-orchestrator |
See crates/zenmetrics-orchestrator/README.md
for quickstart, the streaming + batch APIs, OOM handling details,
cached-ref semantics, CPU backend selection, capability cache lifecycle,
and the full configuration surface. Migration code samples in
crates/zenmetrics-orchestrator/docs/MIGRATION_FROM_API.md.
The zen-metrics CLI routes scoring through the orchestrator by
default (since Phase 7.7.1, 2026-05-27). The legacy direct-dispatch
path remains available via zen-metrics --use-legacy-scheduler … (or
ZENMETRICS_USE_LEGACY_SCHEDULER=1) — useful when an archived parquet
sidecar needs bit-identical regeneration, or when comparing the two
paths for parity. The orchestrator path itself was validated as
bit-identical to legacy across all 54 cells (6 metrics × 3 sizes × 3
qs) on RTX 5070 + 7950X — see
benchmarks/orchestrator_parity_2026-05-27_phase771_run3.csv
for the per-cell data. The --use-orchestrator flag and
ZENMETRICS_USE_ORCHESTRATOR env var are accepted for
backwards-compat with pre-Phase-7.7.1 scripts / Docker images but
emit a deprecation warning.
The new sweep image
Dockerfile.sweep.v27 bakes the orchestrator
features in and ships
scripts/sweep/onstart_orchestrator.sh
as an entrypoint that drives the per-cell scoring through the
orchestrator's worker pool.
One per-metric carve-out remains: butteraugli stays on the legacy
direct-dispatch path because ButteraugliOpaque::new_with_memory_mode
resolves Auto to strip-mode (butter is strip-preferred), which drops
to single-resolution scoring and diverges from the legacy CLI's
always-multires output by ~14-30 %. The orchestrator transparently
falls back to legacy for butter; sweeps emit the same column shape
in both paths.
Spearman rank correlation coefficient against published still-image MOS datasets, sign-normalized so higher = better. These figures are illustrative, sourced externally (the published Cloudinary SSIMULACRA2 benchmark table for the reference metrics) — they are not regenerated by any harness in this repo, so treat them as an order-of-magnitude sanity check on metric discrimination, not a committed measurement. The metric each crate computes is bit-comparable to the cited reference, so the reference's published SRCC transfers.
| Metric | TID2013 | KADID-10k | CID22 |
|---|---|---|---|
dssim-gpu (= DSSIM) |
0.871 | 0.856 | 0.872 |
ssim2-gpu (= SSIMULACRA2) |
0.819 | 0.785 | 0.885 |
butteraugli-gpu (3-norm) |
0.664 | 0.543 | 0.794 |
iwssim-gpu (= IW-SSIM) |
(not benchmarked here) | ||
zensim-gpu (= zensim) |
(Imazen-internal benchmark) | ||
cvvdp-gpu (= ColorVideoVDP) |
(pending — reference is pycvvdp 0.5.4) |
Every GPU metric crate exposes a MemoryMode enum + a
new_with_memory_mode constructor so callers choose how the GPU
working set is laid out. The umbrella (zenmetrics-api)
re-exports a single user-facing enum and converts to each crate's own
MemoryMode at the call boundary:
// zenmetrics_api::MemoryMode — the portable subset every metric accepts.
pub enum MemoryMode {
/// Per-crate `resolve_auto` picks the variant that fits the cap. Default.
Auto,
/// Whole-image working set on device.
Full,
/// Vertical strips of `h_body` body rows + the crate's halo per
/// side. `h_body == None` lets the resolver pick the largest body
/// that fits the cap.
Strip { h_body: Option<u32> },
/// Reserved — every per-crate `From` maps `Tile` to `Auto` today.
Tile { h: u32, w: u32 },
}cvvdp-gpu additionally exposes two cvvdp-specific variants on its
typed enum (cvvdp_gpu::MemoryMode) that the umbrella's portable
subset does not carry, because they change the one-shot/cached-ref
shape or the JOD value:
StripPair { h_body }— Mode B: ref and dist both walk in strips together (no full-ref cache). Best for one-shot CLI callers; the orchestrator surfaces it asBackend::GpuStripPair.CappedPyramid { levels }— JOD-shifting safety net that truncates pyramid depth to shrink the deepest-band blur halo. Not bit-identical to Full — opt-in only;Autonever picks it.
The full per-metric breakdown — which modes each crate exposes on CPU and GPU, and the exact constructor to invoke each — is in the Modes × metrics support matrix and API surface sections below.
MemoryMode::Auto resolves per crate by:
- Reading the VRAM cap:
ZENMETRICS_VRAM_CAP_BYTES(decimal usize) when set, else a live free-VRAM probe (cubecl /nvidia-smi), else an 8 GB default. - Estimating the whole-image working-set bytes via the per-crate
estimate_gpu_memory_byteshelper (zensim-gpu additionally reservesCUBECL_OVERHEAD_BYTES≈ 193 MiB for the runtime pool). - Picking Full when it fits and the crate is not strip-preferred;
else picking Strip with an auto-sized
h_bodythat fits the cap. - Returning
Error::TooBigForFull { needed, cap }when neither fits.
Only butteraugli-gpu is strip-preferred — its resolve_auto tries
Strip first and picks it even when Full would fit, because the strip
walker is the faster path on that crate
(crates/butteraugli-gpu/src/memory_mode.rs).
dssim-gpu, ssim2-gpu, iwssim-gpu, zensim-gpu, and cvvdp-gpu are
Full-preferred — Auto only drops to Strip when Full exceeds the cap.
When the zenmetrics-orchestrator
drives scoring it does not rely on per-crate Auto alone — it runs
a cost-model-aware backend chooser over its persistent benchmark cache.
A ChooserConfig::vram_safety_margin (default 0.15) is held back, and
the chooser picks the fastest backend that fits. For a single cold call
(ExecContext::OneShot, task #146) it additionally consults the
measured one-shot CPU/GPU crossover
(benchmarks/cpu_gpu_crossover_2026-05-29.tsv)
and routes small images to CPU rather than paying the GPU
context-init floor; the warm pool / sweep path stays Batch and ranks
on warm steady-state cost. See
API surface.
The historical Metric::new(backend, w, h, params) constructor is
preserved and delegates through new_with_memory_mode(.., MemoryMode::Auto).
Existing call sites compile and behave the same unless
ZENMETRICS_VRAM_CAP_BYTES is set tight enough to force a mode change.
Which execution modes each metric exposes, on CPU and on GPU, verified against each crate's public API. Legend: ✓ supported · ✗ not supported in this release · n/a not applicable to that metric.
- Full — whole-image working set.
- Strip — vertical strip walker, cold
(ref, dist)per call. - warm_ref — reference cached once,
score/computeper distorted image against whole-image ref state. - warm_ref_strip — reference cached, distorted image walked in
strips. (iwssim's GPU variant uniquely walks the ref in strips too —
CachedRefStripPolicy::BothStripped.) - StripPair — cvvdp-only Mode B: ref + dist walk in strips together,
no full-ref cache (one-shot CLI path; orchestrator
Backend::GpuStripPair). - CappedPyramid — cvvdp-only, JOD-shifting depth cap (opt-in safety
net; not bit-identical to Full, never picked by
Auto).
| Crate | Full | Strip | warm_ref | warm_ref_strip | StripPair | CappedPyramid |
|---|---|---|---|---|---|---|
cvvdp-gpu |
✓ | ✓ ¹ | ✓ | ✓ | ✓ | ✓ |
ssim2-gpu |
✓ | ✓ | ✓ | ✓ | n/a | n/a |
butteraugli-gpu |
✓ | ✓ ² | ✓ | ✓ | n/a | n/a |
dssim-gpu |
✓ | ✓ | ✓ | ✓ | n/a | n/a |
iwssim-gpu |
✓ | ✓ | ✓ | ✓ ³ | n/a | n/a |
zensim-gpu |
✓ | ✓ | ✓ | ✓ | n/a | n/a |
¹ cvvdp-gpu's Strip (Mode E) is the cached-ref path — warm_reference_srgb
- a per-strip dist walker; the one-shot strip is
StripPair. Verifiedcrates/cvvdp-gpu/src/memory_mode.rs(MemoryMode::{Full, Strip, StripPair, CappedPyramid}) +pipeline.rs(Cvvdp::new,new_strip,new_strip_pair,new_capped_pyramid). ² butteraugli-gpu is the one strip-preferred crate —Autopicks Strip even when Full fits. ³ iwssim-gpu'swarm_ref_stripcan keep the ref full or walk it in strips (CachedRefStripPolicy); the other crates keep the ref full and strip only the dist. Verifiedcrates/zenmetrics-api/src/memory_mode.rs.
| Metric (CPU) | Full | Strip | warm_ref | warm_ref_strip |
|---|---|---|---|---|
cvvdp (in-tree) |
✓ | ✓ | ✓ | ✓ |
ssim2 (fast-ssim2 0.8.1) |
✓ | ✓ | ✓ | ✓ |
butter (butteraugli 0.9.4) |
✓ | ✓ | ✓ | ✓ |
iwssim (in-tree) |
✓ | ✓ | ✓ | ✓ |
zensim (zensim 0.3.0) |
✓ | ✓ | ✓ | ✓ |
dssim (dssim-core 3.4) |
✓ | ✗ | ✓ | ✗ |
dssim CPU has no strip walker — dssim-core 3.4 exposes no strip
API, so dssim CPU is Full + warm_ref only (verified
crates/zenmetrics-orchestrator/src/cpu_adapter.rs
compute_strip / compute_warm_ref_strip return an error for dssim).
On the GPU, dssim-gpu does support Strip.
There are three layers. Pick by how many pairs you score:
-
Umbrella (
zenmetrics-api) — one cold pair, no fallback. One enum, one constructor, one score:use zenmetrics_api::{Backend, Metric, MetricKind, MetricParams, MemoryMode}; // Auto memory mode (the default Metric::new path). let mut m = Metric::new( MetricKind::Cvvdp, Backend::Cuda, 1024, 1024, MetricParams::default_for(MetricKind::Cvvdp), )?; let score = m.compute_srgb_u8(&ref_rgb, &dist_rgb)?; // Force a specific memory mode at construction: let mut m = Metric::new_with_memory_mode( MetricKind::Ssim2, Backend::Cuda, 4096, 4096, MetricParams::default_for(MetricKind::Ssim2), MemoryMode::Strip { h_body: None }, // None → resolver auto-sizes the body )?; // Cache one reference, score many distorted images against it: m.set_reference_srgb_u8(&ref_rgb)?; let s1 = m.compute_with_cached_reference_srgb_u8(&dist1)?; let s2 = m.compute_with_cached_reference_srgb_u8(&dist2)?; # Ok::<(), zenmetrics_api::Error>(())
The umbrella's
MemoryModecarries the portable{ Auto, Full, Strip, Tile }subset; it converts to each crate's own enum at the boundary. cvvdp'sStripPair/CappedPyramidare not in the umbrella subset — reach for the typed crate (below) to use them. -
Typed per-crate opaque — a mode the umbrella doesn't expose. Each crate ships
<Metric>Opaque::new/new_with_memory_modeplusset_reference_srgb_u8+compute_with_cached_reference_srgb_u8(cvvdp-gpu names thesewarm_reference_srgb+compute_with_warm_ref_srgb). cvvdp's extra modes:use cvvdp_gpu::{CvvdpOpaque, CvvdpParams, MemoryMode, Backend}; // Mode B one-shot strip-pair (lowest one-shot VRAM): let mut s = CvvdpOpaque::new_with_memory_mode( Backend::Cuda, 4096, 4096, CvvdpParams::default(), MemoryMode::StripPair { h_body: Some(256) }, )?; // JOD-shifting capped pyramid (opt-in; NOT bit-identical to Full): let mut s = CvvdpOpaque::new_with_memory_mode( Backend::Cuda, 4096, 4096, CvvdpParams::default(), MemoryMode::CappedPyramid { levels: 5 }, )?; # Ok::<(), cvvdp_gpu::Error>(())
The typed
cvvdp_gpu::Cvvdppipeline also offers the matching constructors directly:Cvvdp::new,new_strip,new_strip_pair,new_capped_pyramid. -
CPU strip — the in-tree
cvvdp/iwssimcrates. The CPU reference crates take an explicith_bodyon the strip calls:use cvvdp::{Cvvdp, CvvdpParams}; // Strip-shape allocation up front (peak heap bounded to the strip): let mut c = Cvvdp::new_strip(4096, 4096, CvvdpParams::default(), 512)?; let jod = c.score_strip(&ref_rgb, &dist_rgb, 512)?; // Or cache the reference, then strip-walk each distorted image: c.warm_reference(&ref_rgb)?; let jod = c.score_with_warm_ref_strip(&dist_rgb, 512)?; # Ok::<(), cvvdp::Error>(())
h_bodymust be a positive power of two — pass512when unsure (the per-crate default).iwssimexposesiwssim::STRIP_BODY_DEFAULTfor the same purpose.
For batches / sweeps, let zenmetrics-orchestrator
choose. It owns a persistent benchmark cache and a pure decision
function over it:
use zenmetrics_orchestrator::{Orchestrator, OrchestratorConfig, ExecContext, TaskShape};
use zenmetrics_api::MetricKind;
let mut orch = Orchestrator::new(OrchestratorConfig::default())?;
orch.warm()?; // bench-on-demand; cache-hit if fresh
let task = TaskShape { metric: MetricKind::Cvvdp, width: 4096, height: 4096 };
// Batch / warm-pool ranking (ranks on warm steady-state ns/px):
let choice = orch.choose_backend_for_task(&task)?; // ExecContext::Batch
// Single cold call — apply the measured CPU/GPU one-shot crossover:
let choice = orch.choose_backend_for_task_with_context(&task, ExecContext::OneShot)?;
println!("{:?} @ {:.2} ns/px", choice.backend, choice.predicted_ns_per_px);
# Ok::<(), Box<dyn std::error::Error>>(())The chooser's Backend enum is the resolved mode: GpuFull,
GpuStrip, GpuStripPair (cvvdp only), or Cpu. The ExecContext
controls how the cold-start floor is weighed:
Batch(default) ranks on the cache's warmns_per_px— correct when a persistent warm worker amortizes the GPU context-init floor. GPU wins at every measured size for every metric.OneShotconsults the measured one-shot crossover: at/below the per-metric crossover size (cvvdp / ssim2 / butter / zensim through 16 MP, dssim through 4 MP, iwssim through 1 MP) it routes to CPU when CPU is a feasible candidate, because a single cold GPU call would pay the ~181 ms context-init floor that makes CPU faster at that size.
For the full streaming + batch scoring APIs (submit / poll /
run_all / upload_reference), the OOM fallback ladder, and cached-ref
auto-detect, see
crates/zenmetrics-orchestrator/README.md.
GPU scoring cost splits into three components. Modelling a workload as
total ≈ process_start + Σ_refs(per_ref) + Σ_dists(per_dist)
is accurate because each piece is paid in a different scope and each was measured separately:
process_start— paid once per process: the CUDA context init (Backend::client(), a flat ~181 ms floor that is independent of metric and image size) plus the first-kernel PTX/JIT load for each metric the first time it runs. On the CPU backend this term is ≈ 0 (no device handshake — it starts computing immediately).per_ref— paid once per distinct reference image you cache viaset_reference_srgb_u8(cvvdp:warm_reference_srgb): the metric's reference-side precompute. Every new reference re-pays this cost; budget oneper_refper distinct reference. Measured to 16.777 MP only (40 MP is unmeasured — don't extrapolate). cvvdp / ssim2 / dssim / zensim are roughly flat across references (mediansetref1 ≈ setref2 ≈ …). Two exceptions: butteraugli eagerly allocates its full reference working set on the firstset_referenceof a freshly-warmed instance (~250 ms/MP — ≈4 s at 16 MP — then flat for later refs), so budget a one-time first-ref cost per instance on top of the flat steady state; iwssim's reuse-path references cost ~1.8× its first reference at 16 MP (~120–160 ms vs ~68 ms, and run-to-run noisy) — its per-reference cost rises, so size the larger value at 16 MP.per_dist— paid once per scored distorted image against a warm cached reference:score_with_warm_ref(dist), the steady-state per-call wall.
The consequence is a ~181 ms one-time GPU floor (plus per-metric JIT). For a
single small image on a freshly-launched process the CPU wins — it has
no floor to amortize. As the image grows or the batch lengthens, the GPU's
throughput outruns the CPU even after paying the floor: for batch / server
use (warm context, reference cached) the GPU is faster at every measured
size (warm per-call is 10–100× below the CPU wall). The per-process floor
is paid once and shared across every metric and every pair scored in that
process — which is exactly why
zenmetrics-orchestrator keeps one
long-lived warm worker. The full warmth-scope analysis (which transitions
re-pay which component) is in
docs/GPU_INPROCESS_WARMTH_2026-05-29.md;
the clean per-reference re-measure (task #151,
benchmarks/setref_clean_all_2026-05-29.tsv)
settled the per-metric first-ref behaviour: cvvdp/ssim2/dssim/zensim are flat
across references; the prior iwssim "3×" was an n=1 transient (its real reuse
cost is ~1.8× its first ref, not 3×); and butteraugli carries a genuine
first-set_reference allocation cost (~250 ms/MP on the first call) that the
warmed-instance median in that TSV smooths over — see the raw first-call
samples and
docs/GPU_INPROCESS_WARMTH_2026-05-29.md.
All numbers below are measured medians; no value is interpolated or extrapolated. Sizes are 512² (0.262 MP), 1024² (1.049 MP), 2048² / "2K" (4.194 MP), and 4096² / "16 MP" (16.777 MP).
API: Backend::client() then the first compute_* on each metric.
Source: benchmarks/gpu_coldstart_2026-05-29.tsv
(client_init_ms / first_compute_ms / cold_total_ms, warm-disk,
7-process medians). Host: RTX 5070 (12 GiB), cuda backend, no
-C target-cpu=native.
cold_total = client_init + metric_new + first_compute. client_init
(the CUDA context) is the shared ~181 ms floor; the rest is per-metric and,
at large sizes, allocation-dominated.
| Metric | client_init (ms) |
first-kernel JIT first_compute 512² (ms) |
cold_total 512² (ms) |
cold_total 16 MP (ms) |
|---|---|---|---|---|
butteraugli-gpu |
166.8 | 286.7 | 498.7 | 4923.9 |
cvvdp-gpu |
172.5 | 272.4 | 504.5 | 4282.7 |
ssim2-gpu |
187.1 | 129.4 | 396.2 | 6740.5 |
dssim-gpu |
185.0 | 136.5 | 376.1 | 3949.4 |
iwssim-gpu |
182.5 | 265.1 | 491.4 | 2512.5 |
zensim-gpu |
182.2 | 385.0 | 570.3 | 914.2 |
The client_init column is flat across metrics and sizes (measured range
166.8–191.2 ms over all 24 warm rows) — this is the once-per-process floor.
First-ever JIT on an empty PTX disk cache inflates first_compute further
(butter 1024² 303 → 1288 ms, +~1050 ms one-shot; zensim 1024² 382 → 506 ms,
+~175 ms — rows 26–27); the figures above are the warm-disk case (process
N>1 after the box has run any GPU job).
API (umbrella): Metric::set_reference_srgb_u8(ref). Per-crate this is
set_reference_srgb_u8 (butter / ssim2 / dssim / iwssim / zensim) or
warm_reference_srgb (cvvdp). Source (all six metrics, clean re-measure):
benchmarks/setref_clean_all_2026-05-29.tsv
(task #151 — setref1 = first set_reference on a fully warm instance,
setref2/setref3/setref4 = distinct different-pixel new references
(the reuse path), each followed by block_on(client.sync()) inside the
timed region, n=8 samples/phase, median + min + max reported, distinct
pixels every rep). Host: RTX 5070, cuda, no -C target-cpu=native. Each
setref1 phase shows a single rep-1 transient (a one-time first-set_reference
allocation spike — iwssim 248 ms, butter up to 4166 ms @16 MP) that the
n=8 median/min reject; that transient is exactly what an n=1 sample would
have mistaken for the phase cost.
| Metric | setref1 512² (ms) |
setref2 512² (ms) |
setref1 16 MP (ms) |
setref2 16 MP (ms) |
|---|---|---|---|---|
cvvdp-gpu |
1.65 | 1.59 | 16.98 | 17.17 |
ssim2-gpu |
2.48 | 2.88 | 29.34 | 29.02 |
dssim-gpu |
1.43 | 1.34 | 23.15 | 23.16 |
iwssim-gpu |
2.14 | 2.04 | 68.13 | 120.04 |
zensim-gpu |
0.62 | 0.50 | 14.59 | 14.77 |
butteraugli-gpu |
0.77 | 0.74 | 23.33 | 23.65 |
For five of six metrics (cvvdp / ssim2 / dssim / zensim / butter) the
per-reference cost is flat: setref1 ≈ setref2 ≈ setref3 ≈ setref4
at every size, so budget one per_ref per distinct reference regardless
of which reference it is. The earlier profile recorded a huge butter
first-ref cost (34 ms @512², 3990 ms @16 MP); the task #148 clean
re-measure isolated that to first-instance allocation + JIT (which
process_start already accounts for), not the per-reference cost.
iwssim is NOT 3× more expensive on its first reference — the opposite.
A prior table here reported iwssim @16 MP at 196.5 ms setref1 vs 67.4 ms
setref2 and asserted a "~3× first-ref warmup". That row came from task
#144's gpu_inprocess_warmth Q3, which was a single sample (n=1) on a
GPU contaminated by a concurrent zensim eval — the 196.5 ms was a
transient. The clean n=8 #151 re-measure (two independent 16 MP runs) finds
iwssim's setref1 (68.1 / 73.6 ms) is the cheapest phase; setref2–
setref4 land at 120–163 ms. iwssim alone shows a real first-ref discount
at 16 MP (subsequent distinct references cost ~1.8× the first), and is flat
at 512² / 1024² / 2K. Budget the larger ~120–160 ms for every reference
after the first.
API: score_with_warm_ref(dist). Source:
benchmarks/gpu_coldstart_2026-05-29.tsv
(warm_per_call_ms, intra-process warm repeats, every call ends in a host
readback so the wall is real GPU execution). Cross-confirmed by the
warm_ref cuda rows in
benchmarks/gpu_metrics_sweep_2026-05-28.tsv.
Host: RTX 5070, cuda, no -C target-cpu=native.
| Metric | 512² (ms) | 1024² (ms) | 2K / 4.2 MP (ms) | 16 MP (ms) |
|---|---|---|---|---|
butteraugli-gpu |
1.54 | 3.61 | 12.93 | 50.20 |
cvvdp-gpu |
4.23 | 6.00 | 11.80 | 41.33 |
ssim2-gpu |
3.96 | 6.50 | 14.16 | 47.70 |
dssim-gpu |
4.14 | 5.21 | 12.17 | 46.81 |
iwssim-gpu |
6.53 | 9.47 | 12.78 | 39.44 |
zensim-gpu |
1.66 | 3.27 | 9.67 | 37.80 |
So scoring a batch of N distorted images against one cached reference at
16 MP on cvvdp is ~504.5 + 16.86 + N×41.33 ms (process_start512 floor is
size-independent; per_ref and per_dist scale with image size). The
gpu_metrics_sweep warm_ref cuda column gives the same per-call shape
measured by the independent sweep harness (e.g. cvvdp 4 MP 11.80 ms here vs
7.60 ms there, ssim2 16 MP 47.70 vs 43.98 — same order, different warm-up
counts).
API: score(ref, dist) (umbrella zenmetrics-api, full mode — build +
one cold score per call). Source:
benchmarks/cpu_wall_all_metrics_2026-05-29.tsv
(mode=full, cold_or_warm=cold, mean_ms). Harness: zenbench 0.1.8
interleaved round-robin (paired stats, loop-overhead compensated — not
criterion). Host: AMD Ryzen 9 7950X, release, no -C target-cpu=native
(runtime archmage SIMD dispatch only).
| Metric | 512² (ms) | 1024² (ms) | 2K / 4.2 MP (ms) | 16 MP (ms) |
|---|---|---|---|---|
cvvdp |
32.48 | 128.35 | 607.28 | 3812.26 |
ssim2 |
16.67 | 70.05 | 297.76 | 2591.03 |
dssim |
30.53 | 123.48 | 546.16 | 4114.34 |
butter |
12.69 | 62.69 | 347.53 | 1690.87 |
iwssim |
59.81 | 261.88 | 1169.06 | 6665.18 |
zensim |
6.92 | 13.92 | 78.86 | 369.66 |
These tables give the measured wall and peak working-set of
the four common execution modes (full / strip / warm_ref /
warm_ref_strip — see the
support matrix for the full set,
including cvvdp's StripPair / CappedPyramid) at a representative large
size. Every cell is a committed-TSV value — no number is interpolated
or extrapolated; unsupported (metric, mode) cells say n/a.
The modes:
- full —
score(ref, dist): whole-image working set. - strip — strip-walker, one cold
(ref, dist)per call. - warm_ref — reference cached once (
set_reference/warm_reference), thenscore_with_warm_ref(dist)per distorted image (whole-image ref state). - warm_ref_strip — reference cached, distorted image walked in strips per call.
GPU (cuda), 16 MP = 4096². Wall = wall_median_ms (per-call
steady-state); mem = peak_vram_human. Source:
benchmarks/gpu_metrics_sweep_2026-05-28.tsv
unless noted. Host: RTX 5070 (12 GiB), no -C target-cpu=native.
| Metric | full (ms / VRAM) | strip (ms / VRAM) | warm_ref (ms / VRAM) | warm_ref_strip (ms / VRAM) |
|---|---|---|---|---|
cvvdp-gpu |
45.5 / 3.88 GiB | 203.0 / 2.22 GiB † | 25.9 / 3.88 GiB | 108.9 / 3.88 GiB |
butteraugli-gpu |
62.3 / 3.91 GiB | 81.1 / 481 MiB | 32.8 / 3.91 GiB | 150.9 / 4.19 GiB |
ssim2-gpu |
50.7 / 6.15 GiB | 205.1 / 1.19 GiB | 44.0 / 6.19 GiB | 119.7 / 4.06 GiB |
dssim-gpu |
50.5 / 3.16 GiB | 277.8 / 897 MiB | 52.2 / 3.16 GiB | 161.8 / 2.59 GiB |
iwssim-gpu |
45.3 / 2.16 GiB | 385.0 / 545 MiB | 42.3 / 2.16 GiB | 99.8 / 802 MiB |
zensim-gpu |
38.1 / 1.16 GiB | 61.1 / 289 MiB ‡ | 30.9 / 1.16 GiB | 488.3 / 1.22 GiB |
† cvvdp's GPU strip mode is StripPair (Mode B, one-shot — ref+dist
walk together); the row is the strip_pair cuda row. cvvdp has no
warm_ref_strip-distinct VRAM win at 16 MP because its warm_ref
keeps full-image ref state on device; the strip win for cvvdp shows up
on the CPU path below.
‡ zensim-gpu's standalone cold-strip VRAM at 16 MP is 289 MiB (vs
1.16 GiB Full — a 4.1× reduction) per the corrected re-measure
crates/zensim-gpu/benchmarks/zensim_strip_remeasure_2026-05-28.tsv;
the wall (61.1 ms) is from the sweep TSV. The strip VRAM rows in
gpu_metrics_sweep are flagged superseded (pre-fix code built a
full-image ref pyramid). The warm_ref_strip column keeps a device
ref cache, so it stays at 1.22 GiB by design.
CPU, wall at 16 MP = 4096². Wall = warm per-call for the two
warm_ref* modes, cold per-call for full/strip. Source:
benchmarks/cpu_wall_all_metrics_2026-05-29.tsv
(zenbench, 7950X). Peak heap (heaptrack) is reported at the largest
committed heaptracked size — 16 MP (4096²) for cvvdp only; the
other five were heaptracked at 12 MP (4000×3000) and are marked
@12MP, since no 16 MP heaptrack is committed for them and memory does
not extrapolate across sizes. Heap source:
benchmarks/cpu_metrics_full_table_2026-05-28.tsv
(cvvdp rows corrected to the Path A new_strip dispatcher,
crates/cvvdp/benchmarks/cpu_path_a_recovered_2026-05-29.tsv).
| Metric | full (ms / heap) | strip (ms / heap) | warm_ref (warm ms / heap) | warm_ref_strip (warm ms / heap) |
|---|---|---|---|---|
cvvdp |
3812 / 3.66 GB | 2605 / 1.58 GB | 1790 / 3.15 GB | 2168 / 1.55 GB |
ssim2 |
2591 / 2.01 GB@12MP | 3032 / 0.90 GB@12MP | 1429 / 1.81 GB@12MP | 2457 / 1.21 GB@12MP |
dssim |
4114 / 2.60 GB@12MP | n/a — dssim-core 3.4 has no strip | 2938 / 2.60 GB@12MP | n/a — no strip |
butter |
1691 / 2.37 GB@12MP | 1624 / 0.80 GB@12MP | 1472 / 2.31 GB@12MP | 1606 / 1.93 GB@12MP |
iwssim |
6665 / 1.77 GB@12MP | 9954 / 0.70 GB@12MP | 6203 / 1.77 GB@12MP | 4898 / 0.92 GB@12MP |
zensim |
370 / 0.74 GB@12MP | 368 / 0.69 GB@12MP | 345 / 0.79 GB@12MP | 290 / 0.69 GB@12MP |
The memory win of strip vs full is the reason strip mode exists.
At 16 MP on the CPU path cvvdp drops from 3.66 GB (full) to
1.58 GB (strip) — a 2.3× reduction — with the bit-identical JOD and
a faster wall (Path A new_strip is −43 % wall at 16 MP). At 12 MP,
butter (2.37 → 0.80 GB), iwssim (1.77 → 0.70 GB), and ssim2 (2.01 →
0.90 GB) show similar 2.5–3× CPU-heap reductions. On the GPU the
standalone strip win is largest for zensim-gpu (1.16 GiB → 289 MiB,
4.1×), butteraugli-gpu (3.91 GiB → 481 MiB, 8.3×), and iwssim-gpu
(2.16 GiB → 545 MiB, 4.1×) — at the cost of more launches, so strip
mode is the OOM-avoidance path, not the throughput path (except butter,
which is strip-preferred). dssim's strip win is GPU-only (3.16 GiB →
897 MiB); the dssim CPU path has no strip walker
(dssim-core 3.4).
The size below which a single image on a cold process is faster on
CPU than GPU. gpu_cold_total_ms is the one-shot GPU floor (context-init +
metric_new + first_compute). Source:
benchmarks/cpu_gpu_crossover_2026-05-29.tsv
docs/CPU_GPU_CROSSOVER_2026-05-29.md. Hosts: CPU 7950X, GPU RTX 5070, cuda, no-C target-cpu=native.
| Metric | one-shot: CPU wins up to | one-shot: GPU wins from | batch (warm) winner |
|---|---|---|---|
cvvdp |
16.8 MP (all measured) | — | GPU at all sizes |
ssim2 |
16.8 MP (all measured) | — | GPU at all sizes |
butter |
16.8 MP (all measured) | — | GPU at all sizes |
zensim |
16.8 MP (all measured) | — | GPU at all sizes |
dssim |
4.2 MP (2048²) | 16.8 MP (4096²) | GPU at all sizes |
iwssim |
1.0 MP (1024²) | 4.2 MP (2048²) | GPU at all sizes |
Crossovers stated as a bracket between two measured sizes are interpolated, never a fabricated MP. GPU-cold was measured only at 512² / 1024² / 2048² / 4096²; the 12 MP and 30 MP CPU rows in the source TSV have no GPU-cold counterpart and are not given a one-shot winner. For batch / warm use there is no crossover in range — GPU wins everywhere.
One runner drives all four measurement harnesses:
# full grid (512² / 1024² / 2K / 16 MP) — matches the committed TSVs
scripts/perf/reproduce_perf_profile.sh
# quick smoke (512² + 16 MP only)
scripts/perf/reproduce_perf_profile.sh --quickIt invokes the existing drivers — no new measurement code:
process_start+per_dist—scripts/memory_audit/sweep_gpu_coldstart_2026-05-29.py(builds each crate'sexamples/coldstart_one, e.g.crates/cvvdp-gpu/examples/coldstart_one.rs).per_ref—scripts/memory_audit/sweep_gpu_inprocess_warmth_2026-05-29.py(buildscrates/zenmetrics-api/examples/inprocess_warmth.rs).- CPU full wall — the
cpu-wallzenbench binary (cargo build --release -p cpu-profile --bin cpu-wall).
The GPU harnesses require a CUDA-capable host; the CPU wall runs anywhere. Outputs land in a timestamped scratch dir and are diffed against the committed TSVs. See the script header for per-harness flags.
docs/CUBECL_PORTING_GUIDE.md— patterns for porting more CUDA / scalar metrics to multi-vendor CubeCL.docs/CUBECL_GOTCHAS.md— 30-entry catalogue of cubecl-0.10-era traps with symptoms / fixes / examples.docs/SSIMULACRA2_PORTING_PLAN.md,docs/SSIM2_GPU_HANDOFF.md— the per-crate porting playbooks.crates/cvvdp-gpu/docs/PORT_STATUS.md— ColorVideoVDP per-stage port status against pycvvdp v0.5.4 (host scalar reference path + GPU composition + parity test matrix).scripts/sweep/cvvdp_backfill/README.md— operator runbook for the vast.ai pipeline that backfills cvvdp JOD scores onto the zensim training parquet store. Produces side- by-sidecvvdp_imazen_*+cvvdp_pycvvdp_v054sidecars with a parity gate (assert_parity.py) that catches both threshold violations and silent-failure flatlines.
Dual-licensed: AGPL-3.0-only (see LICENSE-AGPL3) or
Imazen commercial (see COMMERCIAL.md). dssim-gpu's
commercial track requires Kornel's upstream DSSIM licensing —
see COMMERCIAL.md, but this crate is neither maintained nor warrantied by him.