Skip to content

feat: add benchmark toolkit for comparing inference configs against f…#1938

Merged
misko merged 17 commits intomainfrom
benchmark-toolkit
Apr 22, 2026
Merged

feat: add benchmark toolkit for comparing inference configs against f…#1938
misko merged 17 commits intomainfrom
benchmark-toolkit

Conversation

@misko
Copy link
Copy Markdown
Contributor

@misko misko commented Mar 31, 2026

Unified benchmark toolkit that runs gold-standard fp64 baseline inference, then evaluates a given InferenceSettings config on the same systems — measuring accuracy
(energy/force/stress error vs baseline) and performance (QPS, GPU memory, warmup time).

Default systems:

  • Water box (60 atoms, omol)
  • FCC crystal (200 atoms, omat)
  • FCC crystal (1000 atoms, omat)

Integrates with the fairchem Hydra CLI — override any setting:
fairchem -c configs/uma/benchmark/toolkit/benchmark.yaml
fairchem -c ... runner.inference_settings.execution_mode=umas_fast_gpu
fairchem -c ... runner.inference_settings.tf32=True
fairchem -c ... runner.device=cpu

…p64 baseline

Unified toolkit that runs gold-standard fp64 baseline inference, then
evaluates a given InferenceSettings config on the same systems —
measuring accuracy (energy/force/stress error vs baseline) and
performance (QPS, GPU memory, warmup time).

Default systems: water box (60 atoms, omol), FCC crystal (200 atoms,
omat), FCC crystal (1000 atoms, omat).

Integrates with the fairchem Hydra CLI — override any setting:
  fairchem -c configs/uma/benchmark/toolkit/benchmark.yaml
  fairchem -c ... runner.inference_settings.execution_mode=umas_fast_gpu
  fairchem -c ... runner.inference_settings.tf32=True
  fairchem -c ... runner.device=cpu
@meta-cla meta-cla Bot added the cla signed label Mar 31, 2026
misko added 2 commits April 7, 2026 02:09
Add a training benchmark that compares fp32 baseline vs candidate
training runs across all 5 UMA tasks (oc20, omol, omat, odac, omc),
reporting loss fidelity, gradient norm error, throughput, and memory.

- Move fake dataset generation utilities into
  src/fairchem/core/components/benchmark/fake_dataset.py so they are
  importable from both CLI and tests
- Generate 5 datasets with production-like system sizes and batch
  config (max_atoms=350) into a cached tmpdir
- Cache both fake datasets and fp32 baseline results across runs to
  avoid redundant computation
- Add BenchmarkTrainCallback to capture per-step losses, grad norms,
  step times, and peak memory
- Add baseline caching for inference benchmark in toolkit.py
- Expose last_loss and last_grad_norm on MLIPTrainEvalUnit for the
  callback to read
@misko misko added enhancement New feature or request patch Patch version release labels Apr 7, 2026
misko added 4 commits April 7, 2026 02:49
Remove fragile TRAINING_CONFIG relative path that breaks in CI
when the package is installed into site-packages. Both the test
and CLI config now pass the config path explicitly.
The benchmark toolkit's training_inner.yaml references backbone, datasets,
optimizer, and tasks via Hydra defaults, but the symlinks pointing to
tests/core/units/mlip_unit/ were not committed. This caused CI to fail
with "Could not find 'optimizer/adamw'". Also removes unused slurm and
local_8gpu job configs.
Symlinks to tests/core/units/mlip_unit/ were fragile and caused CI
failures. Copy the referenced configs (backbone, datasets, optimizer,
tasks) directly into configs/uma/benchmark/toolkit/ as real files.
@misko misko requested review from lbluque April 9, 2026 22:03
…toolkit

run_inference was passing model names like "uma-s-1p2" directly to
MLIPPredictUnit, which expects a file path. Now resolves names via
pretrained_checkpoint_path_from_name before loading.
Copy link
Copy Markdown
Contributor

@lbluque lbluque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far. One general nit, can we change the name "toolkit" to something more informative throughout this PR? I feel toolkit is quite general and this is supposed to be a set of end-to-end sanity tests only

scheduler:
mode: LOCAL
ranks_per_node: 1
run_dir: /checkpoint/ocp/rgao/speed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we set this to some general shared location? or a users directory with ${oc.env:USER}?

- job: local
- _self_

checkpoint:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be a local checkpoint instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now can be local or string, its gets passed through both are supported

misko added 8 commits April 10, 2026 18:12
…test

ManagedAttribute enforces type checking on job_config, rejecting MagicMock.
Use OmegaConf.create() with the required structure instead, matching the
pattern used in test_runner_catches_oom.
…onfig

Addresses review feedback to not hardcode a personal directory.
The runner.run() method accesses job_config.run_dir for baseline caching.
These tests download the full uma-s-1p2 checkpoint and need more GPU
memory than available in CI.
- training.py: Add FileHandler logging to TrainingBenchmarkRunner.run()
- training_inner.yaml: Match production params (num_experts=64, max_neighbors=30, cutoff_radius=6.0)
- K4L2.yaml: Parameterize cutoff via ${cutoff_radius}
- fake_dataset.py: Production-like system size distributions and per-dataset n_train
- Add 2x configs (aselmdb_conserving_all_2x, training_inner_2x, training_benchmark_2x)
  for heavier small-atom dataset sampling (~11 systems/batch)
- training_benchmark.yaml: Add candidate_overrides (null, no nvmath dependency)
lbluque
lbluque previously approved these changes Apr 21, 2026
Copy link
Copy Markdown
Contributor

@lbluque lbluque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - have you tested this already? I assume we will need to save reference values to compare changes in checkpoints and code, is that right?

The name "toolkit" is generic — "perf-check" better captures the purpose
of validating correctness and performance against a reference baseline.
@misko misko added this pull request to the merge queue Apr 22, 2026
Merged via the queue into main with commit 2e55187 Apr 22, 2026
14 of 15 checks passed
@misko misko deleted the benchmark-toolkit branch April 22, 2026 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed enhancement New feature or request patch Patch version release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants