feat: add benchmark toolkit for comparing inference configs against f…#1938
feat: add benchmark toolkit for comparing inference configs against f…#1938
Conversation
…p64 baseline Unified toolkit that runs gold-standard fp64 baseline inference, then evaluates a given InferenceSettings config on the same systems — measuring accuracy (energy/force/stress error vs baseline) and performance (QPS, GPU memory, warmup time). Default systems: water box (60 atoms, omol), FCC crystal (200 atoms, omat), FCC crystal (1000 atoms, omat). Integrates with the fairchem Hydra CLI — override any setting: fairchem -c configs/uma/benchmark/toolkit/benchmark.yaml fairchem -c ... runner.inference_settings.execution_mode=umas_fast_gpu fairchem -c ... runner.inference_settings.tf32=True fairchem -c ... runner.device=cpu
Add a training benchmark that compares fp32 baseline vs candidate training runs across all 5 UMA tasks (oc20, omol, omat, odac, omc), reporting loss fidelity, gradient norm error, throughput, and memory. - Move fake dataset generation utilities into src/fairchem/core/components/benchmark/fake_dataset.py so they are importable from both CLI and tests - Generate 5 datasets with production-like system sizes and batch config (max_atoms=350) into a cached tmpdir - Cache both fake datasets and fp32 baseline results across runs to avoid redundant computation - Add BenchmarkTrainCallback to capture per-step losses, grad norms, step times, and peak memory - Add baseline caching for inference benchmark in toolkit.py - Expose last_loss and last_grad_norm on MLIPTrainEvalUnit for the callback to read
Remove fragile TRAINING_CONFIG relative path that breaks in CI when the package is installed into site-packages. Both the test and CLI config now pass the config path explicitly.
The benchmark toolkit's training_inner.yaml references backbone, datasets, optimizer, and tasks via Hydra defaults, but the symlinks pointing to tests/core/units/mlip_unit/ were not committed. This caused CI to fail with "Could not find 'optimizer/adamw'". Also removes unused slurm and local_8gpu job configs.
Symlinks to tests/core/units/mlip_unit/ were fragile and caused CI failures. Copy the referenced configs (backbone, datasets, optimizer, tasks) directly into configs/uma/benchmark/toolkit/ as real files.
…toolkit run_inference was passing model names like "uma-s-1p2" directly to MLIPPredictUnit, which expects a file path. Now resolves names via pretrained_checkpoint_path_from_name before loading.
lbluque
left a comment
There was a problem hiding this comment.
Looking good so far. One general nit, can we change the name "toolkit" to something more informative throughout this PR? I feel toolkit is quite general and this is supposed to be a set of end-to-end sanity tests only
| scheduler: | ||
| mode: LOCAL | ||
| ranks_per_node: 1 | ||
| run_dir: /checkpoint/ocp/rgao/speed |
There was a problem hiding this comment.
should we set this to some general shared location? or a users directory with ${oc.env:USER}?
| - job: local | ||
| - _self_ | ||
|
|
||
| checkpoint: |
There was a problem hiding this comment.
Do we want this to be a local checkpoint instead?
There was a problem hiding this comment.
Right now can be local or string, its gets passed through both are supported
…test ManagedAttribute enforces type checking on job_config, rejecting MagicMock. Use OmegaConf.create() with the required structure instead, matching the pattern used in test_runner_catches_oom.
…onfig Addresses review feedback to not hardcode a personal directory.
The runner.run() method accesses job_config.run_dir for baseline caching.
These tests download the full uma-s-1p2 checkpoint and need more GPU memory than available in CI.
- training.py: Add FileHandler logging to TrainingBenchmarkRunner.run()
- training_inner.yaml: Match production params (num_experts=64, max_neighbors=30, cutoff_radius=6.0)
- K4L2.yaml: Parameterize cutoff via ${cutoff_radius}
- fake_dataset.py: Production-like system size distributions and per-dataset n_train
- Add 2x configs (aselmdb_conserving_all_2x, training_inner_2x, training_benchmark_2x)
for heavier small-atom dataset sampling (~11 systems/batch)
- training_benchmark.yaml: Add candidate_overrides (null, no nvmath dependency)
lbluque
left a comment
There was a problem hiding this comment.
lgtm - have you tested this already? I assume we will need to save reference values to compare changes in checkpoints and code, is that right?
The name "toolkit" is generic — "perf-check" better captures the purpose of validating correctness and performance against a reference baseline.
Unified benchmark toolkit that runs gold-standard fp64 baseline inference, then evaluates a given InferenceSettings config on the same systems — measuring accuracy
(energy/force/stress error vs baseline) and performance (QPS, GPU memory, warmup time).
Default systems:
Integrates with the fairchem Hydra CLI — override any setting:
fairchem -c configs/uma/benchmark/toolkit/benchmark.yaml
fairchem -c ... runner.inference_settings.execution_mode=umas_fast_gpu
fairchem -c ... runner.inference_settings.tf32=True
fairchem -c ... runner.device=cpu