|
| 1 | +## Running model benchmarks |
| 2 | + |
| 3 | +Model benchmarks involve evaluating a model on downstream property predictions involving several model evaluations to calculate a single or set of related properties. For example calculating structure relaxations, elastic tensors, phonons, or adsportion energy. |
| 4 | + |
| 5 | +To benchmark UMA models on standard datasets, you can find benchmark configuration files in `configs/uma/benchmark`. Example files include: |
| 6 | +- `adsorbml.yaml` |
| 7 | +- `hea-is2re.yaml` |
| 8 | +- `kappa103.yaml` |
| 9 | +- `matbench-discovery-discovery.yaml` |
| 10 | +- `mdr-phonon.yaml` |
| 11 | + |
| 12 | +Note that to run these UMA benchmarks you will need to obtain the target data. |
| 13 | + |
| 14 | +1. **Run the Benchmark Script** |
| 15 | + Use the same runner script, specifying the benchmark config: |
| 16 | + ```bash |
| 17 | + fairchem --config configs/uma/benchmark/benchmark.yaml |
| 18 | + ``` |
| 19 | + Replace `benchmark.yaml` with the desired benchmark config file. |
| 20 | + |
| 21 | +2. **Output** |
| 22 | + Benchmark results will are saved to a *results* directory under the *run_dir* specified in the configuration file. Additionally benchmark metrics are logged using the specified logger. We currently only support Weights and Biases. |
| 23 | + |
| 24 | +## Benchmark Configuration File Format |
| 25 | + |
| 26 | +Evaluation configuration files are written in Hydra YAML format and specify how a model evaluation should be run. UMA evaluation configuration files, which can be used as templates to evaluate other models if needed, are located in `configs/uma/evaluate/`. |
| 27 | + |
| 28 | +### Top-Level Keys |
| 29 | + |
| 30 | +The benchmark configuration files follow the same format as model training and evaluation configuration files, with the addition of a **reducer** flag to specify how final metrics are calculated from the results of a given benchmark calculation protocol. |
| 31 | + |
| 32 | +A benchmark configuration files should define the following top level keys: |
| 33 | + |
| 34 | +- **job**: Contains all settings related to the evaluation job itself, including model, data, and logger configuration. For additional details see the description given in the Evaluation page. |
| 35 | +- **runner**: Contains settings for a `CalculateRunner` which implements a downstream property calculation or simulation. |
| 36 | +- **reducer**: Contains the settings for a `BenchmarkReducer` class which defines how to aggregate the results of calculated by the `CalculateRunner` and computes metrics based on given target values. |
| 37 | + |
| 38 | +#### `CalculateRunner`s: |
| 39 | +The benchmark details including the type of calculations and the model checkpoint are specified under the runner flag. The specific benchmark calculations are based on the chosen `CalculateRunner` (for example a `RelaxationRunner`). Several `CalculateRunner` implementations are found in the `fairchem.core.components.calculate` submodule. |
| 40 | + |
| 41 | +### Implementing new calculations in a `CalculateRunner` |
| 42 | +It is straightforward to write your own calculations in a `CalculateRunner`. Although implementation is very flexible and open ended, we suggest that you have a look at the interface set up by the `CalculateRunner` base class. At a minimum you will need to implement the following methods: |
| 43 | + |
| 44 | +```python |
| 45 | + def calculate(self, job_num: int = 0, num_jobs: int = 1) -> R: |
| 46 | + """Implement your calculations here by iterating over the self.input_data attribute""" |
| 47 | + |
| 48 | + def write_results( |
| 49 | + self, results: R, results_dir: str, job_num: int = 0, num_jobs: int = 1 |
| 50 | + ) -> None: |
| 51 | + """Write the results returned by your calculations in the method above""" |
| 52 | +``` |
| 53 | + |
| 54 | +You will also see a `save_state` and `load_state` abstract methods that you can use to checkpoint calculations, however in most cases if calculations are fast enough you wont need this and you can simply implement those as empty methods. |
| 55 | + |
| 56 | + |
| 57 | +#### `BenchmarkReducer`s: |
| 58 | +A `CalculateRunner` will run calculations over a given set of structures and write out results. In order to compute benchmark metrics, a `BenchmarkReducer` is used to aggregate all these results, compute metrics and report them. Implementations of `BenchmarkReducer` classes are found in the `fairchem.core.components.benchmark` submodule |
| 59 | + |
| 60 | +### Implenting metrics in a `BenchmarkReducer` |
| 61 | + |
| 62 | +If you want to implement your own benchmark metric calculation you can write a `BenchmarkReducer` class. At a minimum, you will need to implement the following methods: |
| 63 | + |
| 64 | +```python |
| 65 | + def join_results(self, results_dir: str, glob_pattern: str) -> R: |
| 66 | + """Join your results from multiple files into a single result object.""" |
| 67 | + |
| 68 | + def save_results(self, results: R, results_dir: str) -> None: |
| 69 | + """Save joined results to a single file""" |
| 70 | + |
| 71 | + def compute_metrics(self, results: R, run_name: str) -> M: |
| 72 | + """Compute metrics using the joined results and target data in your BenchmarkReducer.""" |
| 73 | + |
| 74 | + def save_metrics(self, metrics: M, results_dir: str) -> None: |
| 75 | + """Save the computed metrics to a file.""" |
| 76 | + |
| 77 | + def log_metrics(self, metrics: M, run_name: str): |
| 78 | + """Log metrics to the configured logger.""" |
| 79 | +``` |
| 80 | + |
| 81 | +If it makes sense for your benchmark metrics and are happy working with dictionaries and pandas `DataFrames`, a lot of boilerplate code is implemented in the `JsonDFReducer`. We recommend that you start there by deriving your class from it, and focusing only on implementing the `compute_metrics` method. |
0 commit comments