Skip to content

Commit f29c137

Browse files
authored
add training docs (#1506)
1 parent bf07889 commit f29c137

3 files changed

Lines changed: 131 additions & 31 deletions

File tree

docs/core/common_tasks/training.md

Lines changed: 130 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,132 @@
11
# Training models from scratch
22

3-
This repo is uses to train large state-of-the-art graph neural networks from scratch on datasets like OC20, OMol25, or OMat24, among others. We now provide a simple CLI to handle this using your own custom datasets, but we suggest fine-tuning one of the existing checkpoints first before trying a from-scratch training.
3+
This repo is uses to train large state-of-the-art graph neural networks from scratch on datasets like OC20, OMol25, or OMat24, among others. We now provide a simple CLI to handle this using your own custom datasets, but we suggest fine-tuning one of the existing checkpoints first before trying a from-scratch training.
4+
5+
## Fairchem training framework overview
6+
7+
Fairchem training framework currently is a simple SPMD (Single Program Multiple Data) paradigm training framework. It is made of several components:
8+
9+
1. A user cli (`fairchem`) and launcher - can run jobs locally using [torch distributed elastic](https://docs.pytorch.org/docs/stable/distributed.elastic.html) or on [Slurm](https://slurm.schedmd.com/documentation.html). More environments may be supported in the future.
10+
11+
2. Configuration - we strictly use [Hydra yamls](https://hydra.cc/docs/intro/) for configuration.
12+
13+
3. A Runner interface - the core program code that is replicated to run on all ranks. An optional Reducer is also available for evaluation jobs.
14+
Runners are distinct user functions are that run on a single rank (ie: GPU). They describe separate high level tasks such as Train, Eval, Predict, Relaxations, MD etc. Anyone can write a new runner if its functionality is sufficiently different than the ones that already exist.
15+
16+
4. Trainer - we use [TorchTNT](https://docs.pytorch.org/tnt/stable/) as a light-weight training loop. This allow us to cleanly separate the dataloading loading from the training loop. TNT is Pytorch's replacement for Pytorch Lightning - which has become severely bloated and difficult to use over the years; so we opt'd for the simpler option. Units are concepts in TorchTNT that provide a basic interface for training, evaluation and predict. These replace trainers in fairchemv1. You should write a new unit when the model paradigm is significantly different, ie: Training a Multitask-MLIP is one unit, training a diffusion model should be another Unit.
17+
18+
19+
## Fairchemv2
20+
21+
Fairchem uses a single [cli](https://github.com/facebookresearch/fairchem/blob/main/src/fairchem/core/_cli.py) for running jobs. It accepts a single argument, the location of the Hydra yaml. This is intentional to make sure all configuration is fully captured and avoid bloating of the command line interface. Because of the flexibility of Hydra yamls, use can still provide additional parameters and overrides using the [hydra override syntax](https://hydra.cc/docs/advanced/override_grammar/basic/).
22+
23+
The cli can launch jobs locally using [torch distributed elastic](https://docs.pytorch.org/docs/stable/distributed.elastic.html) OR on [Slurm](https://slurm.schedmd.com/documentation.html).
24+
25+
### Fairchemv2 config structure
26+
27+
A fairchem config is composed of only 2 valid top level keys: "job" (Job Config) and "runner" (Runner Config). Additionally you can add key/values that are used by the OmegaConf interpolation syntax to replace fields. Other than these, no other top-level keys are permitted.
28+
JobConfig represents configuration parameters that describe the overall job (mostly infra parameters) such as number of nodes, log locations, loggers etc. This is a structured config and must strictly adhere to the JobConfig class.
29+
Runner Config describe the user code. This part of config is recursively instantiated at the start of a job using hydra instantiation framework.
30+
31+
### Example configurations for a local run:
32+
33+
```
34+
job:
35+
device_type: CUDA
36+
scheduler:
37+
mode: LOCAL
38+
ranks_per_node: 4
39+
run_name: local_training_run
40+
```
41+
42+
Example configurations for a slurm run:
43+
44+
```
45+
job:
46+
device_type: CUDA
47+
scheduler:
48+
mode: SLURM
49+
ranks_per_node: 8
50+
num_nodes: 4
51+
slurm:
52+
account: ${cluster.account}
53+
qos: ${cluster.qos}
54+
mem_gb: ${cluster.mem_gb}
55+
cpus_per_task: ${cluster.cpus_per_task}
56+
run_dir: /path/to/output
57+
run_name: slurm_run_example
58+
```
59+
60+
### Config Object Instantiation
61+
To keep our configs explict (configs should be thought of as extension of code), we prefer to use the hydra instantiation framework throughout; the config is always fully described by a corresponding python class and should never be a standalone dictionary.
62+
63+
```
64+
# this is bad
65+
# because we have no idea what where to find the code
66+
# that uses runner or where variables x and y are actually used
67+
runner:
68+
x: 5
69+
y: 6
70+
71+
# this is good
72+
# now we know which class runner corresponds to and that x,y are
73+
# just initializer variables of runner. If we need to check the defintion
74+
# or understand the code, we can simply goto runner.py
75+
runner:
76+
_target_: fairchem.core.componets.runner.Runner
77+
x: 5
78+
y: 6
79+
```
80+
81+
### Runtime instantiation with partial functions
82+
While we want to use static instantiation as much as possible, there will be lots of cases where certain objects require runtime inputs to create. For example, if we want to create a pytorch optimizer, we can give it all the arguments except the model parameters (because its only known at runtime).
83+
84+
```
85+
optimizer:
86+
_target_: torch.optim.AdamW
87+
params: ?? # this is only known at runtime
88+
lr: 8e-4
89+
weight_decay: 1e-3
90+
```
91+
92+
In this case we can use a partial function, now instead of creating an optimizer object, we create a python partial function that can then be used to instantiate the optimizer in code later
93+
94+
```
95+
optimizer_fn:
96+
_target_: torch.optim.AdamW
97+
_partial_: true
98+
lr: 8e-4
99+
weight_decay: 1e-3
100+
# later in the runner
101+
optimizer = optimizer_fn(model.parameters())
102+
```
103+
104+
## Training UMA
105+
106+
The UMA model is completely defined [here](https://github.com/facebookresearch/fairchem/tree/main/src/fairchem/core/models/uma). It is also called "escn_md" during internal development since it was based on the eSEN architecture.
107+
108+
Training, eval and inference are all defined in the [mlip unit](https://github.com/facebookresearch/fairchem/blob/main/src/fairchem/core/units/mlip_unit/mlip_unit.py).
109+
110+
To train a model, we need to initialize a [TrainRunner](https://github.com/facebookresearch/fairchem/blob/main/src/fairchem/core/components/train/train_runner.py) with a [MLIPTrainEvalUnit](https://github.com/facebookresearch/fairchem/blob/main/src/fairchem/core/units/mlip_unit/mlip_unit.py).
111+
112+
Due to the complexity of UMA and training a multi-architecture, multi-dataset, multi-task model, we leverage [config groups](https://hydra.cc/docs/tutorials/basic/your_first_app/config_groups/) syntax in Hydra to organize UMA training into the [following sections](https://github.com/facebookresearch/fairchem/tree/main/configs/uma/training_release):
113+
114+
* backbone - selects the specific backbone architecture, ie: uma-sm, uma-md, uma-large etc.
115+
* cluster - quickly switch settings between different slurm clusters or local env
116+
* dataset - select the dataset to train on
117+
* element_refs - select the element references
118+
* tasks - select the task set, ie: for direct or conservative training
119+
120+
We can switch between different combinations of configs easily this way, for example:
121+
122+
Getting training started locally using local settings and the debug dataset
123+
124+
```
125+
fairchem -c configs/uma/training_release/uma_sm_direct_pretrain.yaml cluster=h100_local dataset=uma_debug
126+
```
127+
128+
Training UMA conservative with 16 nodes on slurm
129+
130+
```
131+
fairchem -c configs/uma/training_release/uma_sm_conserve_finetune.yaml cluster=h100 job.scheduler.num_nodes=16 run_name="uma_conserve_train"
132+
```

src/fairchem/core/models/esen/nn/rank2.py

Whitespace-only changes.

src/fairchem/lammps/README.md

Lines changed: 1 addition & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -4,33 +4,4 @@ This directory provides an interface to use FAIR Chemistry models in conjuction
44

55
The source under sub-repository (src/fairchem/lammps) is licensed under the GPL-2.0 License, the same as in the LAMMPs software package. Please refer to the LICENSE file in this same directory. ***It is NOT the same as the license for rest of this repository, which is licensed under the MIT license.***
66

7-
This lammps integration uses the lammps "[fix external](https://docs.lammps.org/fix_external.html)" command to run the external MLIP: UMA and compute forces and potential energy of the system. This hands control of the parallelism to UMA instead of integrating with directly with LAMMPS neighborlist, domain decomp and forward + backward pass communication algorithms as well as converting to-from per-atom forces/pair-wise forces.
8-
9-
10-
## Usage notes that differ from regular lammps workflows:
11-
* We currently only support `metal` [units](https://docs.lammps.org/units.html), ie: energy in `ev` and forces in `ev/A`
12-
* User can write lammps scripts in the usual way (see lammps_in_example.file)
13-
* User should *NOT* define other types of forces such as "pair_style", "bond_style" in their scripts. These forces will get added together with UMA forces and most likely produce false results
14-
* UMA uses atomic numbers so we try to guess the atomic number from the provided atomic masses, make sure you provide the right masses for your atom types - this makes it easy so that you don't need to redefine atomic element mappings with lammps
15-
16-
## Install and run
17-
User can install lammps however they like but the simplest is to install via conda (https://docs.lammps.org/Install_conda.html). Next install fairchem into the same conda env and then you can run like so:
18-
19-
Activate the conda env with lammps and install fairchem into it
20-
```
21-
conda activate lammps-env
22-
pip install fairchem/packages/fairchem-core[extras]
23-
pip install fairchem/packages/lammps
24-
```
25-
26-
Assuming you have a classic lammps .in script, to run it, do the following
27-
```
28-
1. Remove all other forces that you normally from your lammps script (ie: pair_style etc.)
29-
2. Make sure the units are in`metal`
30-
3. Make sure there is only 1 run command at the bottom of the script
31-
```
32-
33-
Finally to run with `luma` (shorthand for python lammps_uma.py script)
34-
```
35-
luma lmp_in="lammps_in_example.file" task_name="omc"
36-
```
7+
Refer to the [docs](https://fair-chem.github.io/core/common_tasks/lammps.html) more details.

0 commit comments

Comments
 (0)