feat: add benchmark toolkit for comparing inference configs against f… by misko · Pull Request #1938 · facebookresearch/fairchem

misko · 2026-03-31T18:57:49Z

Unified benchmark toolkit that runs gold-standard fp64 baseline inference, then evaluates a given InferenceSettings config on the same systems — measuring accuracy
(energy/force/stress error vs baseline) and performance (QPS, GPU memory, warmup time).

Default systems:

Water box (60 atoms, omol)
FCC crystal (200 atoms, omat)
FCC crystal (1000 atoms, omat)

Integrates with the fairchem Hydra CLI — override any setting:
fairchem -c configs/uma/benchmark/toolkit/benchmark.yaml
fairchem -c ... runner.inference_settings.execution_mode=umas_fast_gpu
fairchem -c ... runner.inference_settings.tf32=True
fairchem -c ... runner.device=cpu

…p64 baseline Unified toolkit that runs gold-standard fp64 baseline inference, then evaluates a given InferenceSettings config on the same systems — measuring accuracy (energy/force/stress error vs baseline) and performance (QPS, GPU memory, warmup time). Default systems: water box (60 atoms, omol), FCC crystal (200 atoms, omat), FCC crystal (1000 atoms, omat). Integrates with the fairchem Hydra CLI — override any setting: fairchem -c configs/uma/benchmark/toolkit/benchmark.yaml fairchem -c ... runner.inference_settings.execution_mode=umas_fast_gpu fairchem -c ... runner.inference_settings.tf32=True fairchem -c ... runner.device=cpu

Add a training benchmark that compares fp32 baseline vs candidate training runs across all 5 UMA tasks (oc20, omol, omat, odac, omc), reporting loss fidelity, gradient norm error, throughput, and memory. - Move fake dataset generation utilities into src/fairchem/core/components/benchmark/fake_dataset.py so they are importable from both CLI and tests - Generate 5 datasets with production-like system sizes and batch config (max_atoms=350) into a cached tmpdir - Cache both fake datasets and fp32 baseline results across runs to avoid redundant computation - Add BenchmarkTrainCallback to capture per-step losses, grad norms, step times, and peak memory - Add baseline caching for inference benchmark in toolkit.py - Expose last_loss and last_grad_norm on MLIPTrainEvalUnit for the callback to read

Remove fragile TRAINING_CONFIG relative path that breaks in CI when the package is installed into site-packages. Both the test and CLI config now pass the config path explicitly.

The benchmark toolkit's training_inner.yaml references backbone, datasets, optimizer, and tasks via Hydra defaults, but the symlinks pointing to tests/core/units/mlip_unit/ were not committed. This caused CI to fail with "Could not find 'optimizer/adamw'". Also removes unused slurm and local_8gpu job configs.

Symlinks to tests/core/units/mlip_unit/ were fragile and caused CI failures. Copy the referenced configs (backbone, datasets, optimizer, tasks) directly into configs/uma/benchmark/toolkit/ as real files.

…toolkit run_inference was passing model names like "uma-s-1p2" directly to MLIPPredictUnit, which expects a file path. Now resolves names via pretrained_checkpoint_path_from_name before loading.

lbluque

Looking good so far. One general nit, can we change the name "toolkit" to something more informative throughout this PR? I feel toolkit is quite general and this is supposed to be a set of end-to-end sanity tests only

lbluque · 2026-04-10T17:30:59Z

+scheduler:
+  mode: LOCAL
+  ranks_per_node: 1
+run_dir: /checkpoint/ocp/rgao/speed


should we set this to some general shared location? or a users directory with ${oc.env:USER}?

lbluque · 2026-04-10T17:33:01Z

+  - job: local
+  - _self_
+
+checkpoint:


Do we want this to be a local checkpoint instead?

Right now can be local or string, its gets passed through both are supported

…test ManagedAttribute enforces type checking on job_config, rejecting MagicMock. Use OmegaConf.create() with the required structure instead, matching the pattern used in test_runner_catches_oom.

…onfig Addresses review feedback to not hardcode a personal directory.

The runner.run() method accesses job_config.run_dir for baseline caching.

These tests download the full uma-s-1p2 checkpoint and need more GPU memory than available in CI.

- training.py: Add FileHandler logging to TrainingBenchmarkRunner.run() - training_inner.yaml: Match production params (num_experts=64, max_neighbors=30, cutoff_radius=6.0) - K4L2.yaml: Parameterize cutoff via ${cutoff_radius} - fake_dataset.py: Production-like system size distributions and per-dataset n_train - Add 2x configs (aselmdb_conserving_all_2x, training_inner_2x, training_benchmark_2x) for heavier small-atom dataset sampling (~11 systems/batch) - training_benchmark.yaml: Add candidate_overrides (null, no nvmath dependency)

lbluque

lgtm - have you tested this already? I assume we will need to save reference values to compare changes in checkpoints and code, is that right?

The name "toolkit" is generic — "perf-check" better captures the purpose of validating correctness and performance against a reference baseline.

meta-cla Bot added the cla signed label Mar 31, 2026

misko added 2 commits April 7, 2026 02:09

Merge remote-tracking branch 'origin/main' into benchmark-toolkit

74aebd6

misko added enhancement New feature or request patch Patch version release labels Apr 7, 2026

misko added 4 commits April 7, 2026 02:49

fix: make training_config an explicit required parameter

6c07a91

Remove fragile TRAINING_CONFIG relative path that breaks in CI when the package is installed into site-packages. Both the test and CLI config now pass the config path explicitly.

fix: replace symlinks with real Hydra config files in benchmark toolkit

7306343

Symlinks to tests/core/units/mlip_unit/ were fragile and caused CI failures. Copy the referenced configs (backbone, datasets, optimizer, tasks) directly into configs/uma/benchmark/toolkit/ as real files.

docs: add inference and training benchmark toolkit documentation

5734a9b

misko requested review from lbluque April 9, 2026 22:03

fix: resolve pretrained model names to checkpoint paths in benchmark …

32e3a10

…toolkit run_inference was passing model names like "uma-s-1p2" directly to MLIPPredictUnit, which expects a file path. Now resolves names via pretrained_checkpoint_path_from_name before loading.

lbluque requested changes Apr 10, 2026

View reviewed changes

misko added 8 commits April 10, 2026 18:12

Merge branch 'main' into benchmark-toolkit

6bff3f6

fix: use DictConfig instead of MagicMock for job_config in benchmark …

55df2e0

…test ManagedAttribute enforces type checking on job_config, rejecting MagicMock. Use OmegaConf.create() with the required structure instead, matching the pattern used in test_runner_catches_oom.

fix: replace hardcoded run_dir with user-specific path in benchmark c…

abf3fa2

…onfig Addresses review feedback to not hardcode a personal directory.

fix: add missing run_dir to job_config in benchmark end-to-end test

9b6a7cf

The runner.run() method accesses job_config.run_dir for baseline caching.

test: skip benchmark tests that require full UMA model and large GPU

0ba62ee

These tests download the full uma-s-1p2 checkpoint and need more GPU memory than available in CI.

Merge branch 'main' into benchmark-toolkit

283b506

Merge branch 'main' into benchmark-toolkit

9dd0c05

lbluque previously approved these changes Apr 21, 2026

View reviewed changes

Rename benchmark toolkit to perf-check

e07b444

The name "toolkit" is generic — "perf-check" better captures the purpose of validating correctness and performance against a reference baseline.

misko dismissed lbluque’s stale review via e07b444 April 22, 2026 00:03

rayg1234 approved these changes Apr 22, 2026

View reviewed changes

misko added this pull request to the merge queue Apr 22, 2026

Merged via the queue into main with commit 2e55187 Apr 22, 2026
14 of 15 checks passed

misko deleted the benchmark-toolkit branch April 22, 2026 04:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add benchmark toolkit for comparing inference configs against f…#1938

feat: add benchmark toolkit for comparing inference configs against f…#1938
misko merged 17 commits intomainfrom
benchmark-toolkit

misko commented Mar 31, 2026

Uh oh!

lbluque left a comment

Uh oh!

lbluque Apr 10, 2026

Uh oh!

lbluque Apr 10, 2026

Uh oh!

misko Apr 11, 2026

Uh oh!

lbluque left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

misko commented Mar 31, 2026

Uh oh!

lbluque left a comment

Choose a reason for hiding this comment

Uh oh!

lbluque Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

lbluque Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

misko Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

lbluque left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants