Evaluation and benchmark docs (#1402)

lbluque · mshuaibii · web-flow · commit 00743398b6dd · 2025-08-20T16:21:07.000-07:00
* evaluation doc page

* cleanup evaluation docs

* cleanup evaluation docs

* title nits

* benchmark docs

* remove extra title

---------

Co-authored-by: Muhammed Shuaibi &lt;45150244+mshuaibii@users.noreply.github.com&gt;
diff --git a/docs/core/common_tasks/ase_calculator.md b/docs/core/common_tasks/ase_calculator.md
@@ -11,7 +11,7 @@ kernelspec:
   name: python3
 ---
 
-Inference using ASE and Predictor Interface
+Inference using ASE and Predictor interface
 ------------------
 
 Inference is done using [MLIPPredictUnit](https://github.com/facebookresearch/fairchem/blob/main/src/fairchem/core/units/mlip_unit/mlip_unit.py#L867). The [FairchemCalculator](https://github.com/facebookresearch/fairchem/blob/main/src/fairchem/core/calculate/ase_calculator.py#L3) (an ASE calculator) is simply a convenience wrapper around the MLIPPredictUnit.
diff --git a/docs/core/common_tasks/ase_dataset_creation.md b/docs/core/common_tasks/ase_dataset_creation.md
@@ -1,5 +1,5 @@
 
-# FAIRChem & Custom Datasets
+# FAIRChem & custom datasets
 
 ## Datasets in `fairchem`:
 `fairchem` provides training and evaluation code for tasks and models that take arbitrary
diff --git a/docs/core/common_tasks/benchmark.md b/docs/core/common_tasks/benchmark.md
@@ -0,0 +1,81 @@
+## Running model benchmarks
+
+Model benchmarks involve evaluating a model on downstream property predictions involving several model evaluations to calculate a single or set of related properties. For example calculating structure relaxations, elastic tensors, phonons, or adsportion energy.
+
+To benchmark UMA models on standard datasets, you can find benchmark configuration files in `configs/uma/benchmark`. Example files include:
+- `adsorbml.yaml`
+- `hea-is2re.yaml`
+- `kappa103.yaml`
+- `matbench-discovery-discovery.yaml`
+- `mdr-phonon.yaml`
+
+Note that to run these UMA benchmarks you will need to obtain the target data.
+
+1. **Run the Benchmark Script**  
+   Use the same runner script, specifying the benchmark config:
+   ```bash
+   fairchem --config configs/uma/benchmark/benchmark.yaml
+   ```
+   Replace `benchmark.yaml` with the desired benchmark config file.
+
+2. **Output**  
+   Benchmark results will are saved to a *results* directory under the *run_dir* specified in the configuration file. Additionally benchmark metrics are logged using the specified logger. We currently only support Weights and Biases.
+
+## Benchmark Configuration File Format
+
+Evaluation configuration files are written in Hydra YAML format and specify how a model evaluation should be run. UMA evaluation configuration files, which can be used as templates to evaluate other models if needed, are located in `configs/uma/evaluate/`.
+
+### Top-Level Keys
+
+The benchmark configuration files follow the same format as model training and evaluation configuration files, with the addition of a **reducer** flag to specify how final metrics are calculated from the results of a given benchmark calculation protocol.
+
+A benchmark configuration files should define the following top level keys:
+
+- **job**: Contains all settings related to the evaluation job itself, including model, data, and logger configuration. For additional details see the description given in the Evaluation page.
+- **runner**: Contains settings for a `CalculateRunner` which implements a downstream property calculation or simulation.
+- **reducer**: Contains the settings for a `BenchmarkReducer` class which defines how to aggregate the results of calculated by the `CalculateRunner` and computes metrics based on given target values.
+
+#### `CalculateRunner`s:
+The benchmark details including the type of calculations and the model checkpoint are specified under the runner flag. The specific benchmark calculations are based on the chosen `CalculateRunner` (for example a `RelaxationRunner`). Several `CalculateRunner` implementations are found in the `fairchem.core.components.calculate` submodule.
+
+### Implementing new calculations in a `CalculateRunner`
+It is straightforward to write your own calculations in a `CalculateRunner`. Although implementation is very flexible and open ended, we suggest that you have a look at the interface set up by the `CalculateRunner` base class. At a minimum you will need to implement the following methods:
+
+```python
+    def calculate(self, job_num: int = 0, num_jobs: int = 1) -> R:
+      """Implement your calculations here by iterating over the self.input_data attribute"""
+
+    def write_results(
+        self, results: R, results_dir: str, job_num: int = 0, num_jobs: int = 1
+    ) -> None:
+      """Write the results returned by your calculations in the method above"""
+```
+
+You will also see a `save_state` and `load_state` abstract methods that you can use to checkpoint calculations, however in most cases if calculations are fast enough you wont need this and you can simply implement those as empty methods.
+
+
+#### `BenchmarkReducer`s:
+A `CalculateRunner` will run calculations over a given set of structures and write out results. In order to compute benchmark metrics, a `BenchmarkReducer` is used to aggregate all these results, compute metrics and report them. Implementations of `BenchmarkReducer` classes are found in the `fairchem.core.components.benchmark` submodule
+
+### Implenting metrics in a `BenchmarkReducer`
+
+If you want to implement your own benchmark metric calculation you can write a `BenchmarkReducer` class. At a minimum, you will need to implement the following methods:
+
+```python
+    def join_results(self, results_dir: str, glob_pattern: str) -> R:
+        """Join your results from multiple files into a single result object."""
+
+    def save_results(self, results: R, results_dir: str) -> None:
+        """Save joined results to a single file"""
+
+    def compute_metrics(self, results: R, run_name: str) -> M:
+        """Compute metrics using the joined results and target data in your BenchmarkReducer."""
+
+    def save_metrics(self, metrics: M, results_dir: str) -> None:
+        """Save the computed metrics to a file."""
+
+    def log_metrics(self, metrics: M, run_name: str):
+        """Log metrics to the configured logger."""
+```
+
+If it makes sense for your benchmark metrics and are happy working with dictionaries and pandas `DataFrames`, a lot of boilerplate code is implemented in the `JsonDFReducer`. We recommend that you start there by deriving your class from it, and focusing only on implementing the `compute_metrics` method.
diff --git a/docs/core/common_tasks/evaluation.md b/docs/core/common_tasks/evaluation.md
@@ -1,3 +1,80 @@
-# Evaluation
+# Evaluating pretrained models
 
-This repo provides a number of methods used to benchmark and evaluate the UMA models that will be helpful for apples-to-apples comparisons with the paper results. More details to be provided here soon. 
+`fairchemV2` provides a number of methods used to benchmark and evaluate the UMA models that will be helpful for apples-to-apples comparisons with the paper results. More details to be provided here soon. 
+
+## Running Model Evaluations
+
+To evaluate a UMA model using a pre-existing configuration file, follow these steps. Example configuration files used to evaluate uma models are stored in `configs/uma/evaluate`.
+
+1. **Run the Evaluation Script**  
+   To run an evaluation simply run:
+   ```bash
+   fairchem --config evaluation_config.yaml
+   ```
+   Replace `evaluation_config.yaml` with the desired config file. For example, `configs/uma/evaluate/uma_conserving.yaml`
+
+1. **Output**  
+   Results will be logged according the specified logger. We currently only support Weights and Biases.
+
+## Evaluation Configuration File Format
+
+Evaluation configuration files are written in Hydra YAML format and specify how a model evaluation should be run. UMA evaluation configuration files, which can be used as templates to evaluate other models if needed, are located in `configs/uma/evaluate/`.
+
+### Top-Level Keys
+
+Similar to training configuration files, the only allowed top-level keys are the `job` and `runner` keys as well interpolation keys that are resolved at runtime.
+
+- **job**: Contains all settings related to the evaluation job itself, including model, data, and logger configuration.
+- **runner**: Contains settings for the evaluation runner, such as which script to use and runtime options.
+
+Important configuration options are nested under these keys as follows:
+
+#### Under `job`:
+Specifications of how to run the actual job. The configuration options are the same here as those in a training job. Some notable flags are detailed below,
+- `device_type`: The device to run model inference on (ie CUDA or CPU)
+- `scheduler`: The compute scheduler specifications
+- `logger`: Configuration for logging results.
+  - `type`: Logger type (e.g., `wandb`).
+  - `project`: Logging project name.
+  - `entity`: (Optional) Logger entity/user.
+- `run_dir`: Directory where results and logs will be saved.
+
+#### Under `runner`:
+The actual benchmark details such as model checkpoint and the dataset are specified under the runner flag. An evaluation run should use the `EvalRunner` class which relies on an `MLIPEvalUnit` to run inference using a pretrained model.
+
+- `dataloader`: Dataloader specification for the evaluation dataset.
+- `eval_unit`: The specification of the `MLIPEvalUnit` to be used.
+  - `tasks`: The prediction task configuration. In almost all cases you can think of, these should be loaded from a model checkpoint using the `fairchem.core.units.mlip_unit.utils.load_tasks` function.
+  - `model`: Defines how to load a pretrained model. We recommend using the `fairchem.core.units.mlip_unit.mlip_unit.load_inference_model` function to do so.
+
+
+### Using the `defaults` key to define config groups
+
+The `defaults` key is a Hydra feature that allows you to compose configuration files from modular config groups. Each entry under `defaults` refers to a config group (such as `model`, `data`, or other reusable components) that is merged into the final configuration at runtime. This makes it easy to swap out models, datasets, or other settings without duplicating configuration code.
+
+For example in the UMA evaluation configs we have set up the following config groups and defaults:
+```yaml
+defaults:
+  - _self_
+  - model: omc_conserving
+  - data: my_eval_data
+```
+This will include the configuration from `configs/uma/evaluate/model/omc_conserving.yaml` and `configs/uma/evaluate/data/my_eval_data.yaml` into the main config. The `_self_` entry ensures the current file's contents are included.
+
+You can create new config groups or override existing ones by changing the entries under `defaults`.
+
+```yaml
+defaults:
+  - cluster: Configuration settings for a particular compute cluster
+  - dataset: Configuration settings for the evaluation dataset
+  - checkpoint: Configuration settings of the pretrained model checkpoint 
+  - _self_
+```
+
+Using config groups allows to easily override defaults in the cli. For example,
+
+```bash
+fairchem --config evaluation_config.yaml cluster=cluster_config checkpoint=checkpoint_config
+```
+
+Where `cluster_config` and `checkpoint_config` are cluster and checkpoint configuration files written to directories under cluster and checkpoint respectively. See the files in `configs/uma/evaluate` as a full example.
diff --git a/docs/core/common_tasks/training.md b/docs/core/common_tasks/training.md
@@ -1,3 +1,3 @@
-# Training
+# Training models from scratch
 
 This repo is uses to train large state-of-the-art graph neural networks from scratch on datasets like OC20, OMol25, or OMat24, among others. We now provide a simple CLI to handle this using your own custom datasets, but we suggest fine-tuning one of the existing checkpoints first before trying a from-scratch training. 
diff --git a/docs/core/common_tasks/workflows.md b/docs/core/common_tasks/workflows.md
@@ -11,8 +11,8 @@ kernelspec:
   name: python3
 ---
 
-Workflows
-------------------
+Calculation workflows with FAIRChem models
+------------------------------------------
 
 This repo is integrated with workflow tools like [QuAcc](https://github.com/Quantum-Accelerators/quacc) to make complex molecular simulation workflows easy. You can use any MLP recipe (relaxations, single-points, elastic calculations, etc) and simply specify the `fairchem` model type. Below is an example that uses the default elastic_tensor_flow flow.
 

Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,3 @@`
`1`		`-# Training`
	`1`	`+# Training models from scratch`
`2`	`2`
`3`	`3`	`This repo is uses to train large state-of-the-art graph neural networks from scratch on datasets like OC20, OMol25, or OMat24, among others. We now provide a simple CLI to handle this using your own custom datasets, but we suggest fine-tuning one of the existing checkpoints first before trying a from-scratch training.`