Skip to content

Commit 2da6fdd

Browse files
authored
OMC25 dataset release docs update (#1396)
* omc25 docs update with arxiv placeholder * update arxiv identifier
1 parent 2d875e1 commit 2da6fdd

4 files changed

Lines changed: 60 additions & 15 deletions

File tree

docs/core/common_tasks/ase_dataset_creation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ tasks, data, and metrics, please read the documentations and respective papers:
1313
- [OC20NEB](catalysts/datasets/oc20neb)
1414
- [OMat24](inorganic_materials/datasets/omat24)
1515
- [OMol25](https://ai.meta.com/blog/meta-fair-science-new-open-source-releases/)
16-
16+
- [OMC25](molecules/datasets/omc25)
1717

1818
There are multiple ways to train and evaluate FAIRChem models on data other than OC20 and OC22. Writing an LMDB is the most performant option. However, ASE-based dataset formats are also included as a convenience for people with existing data who simply want to try fairchem tools without needing to learn about LMDBs.
1919

docs/core/uma.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,10 @@ UMA is trained on 5 different DFT datasets with different levels of theory. An U
3434

3535
| Task | Dataset | DFT Level of Theory | Relevant applications | Usage Notes |
3636
| ------- | ------- | ----- | ------ | ----- |
37-
| omol | [Omol25](https://arxiv.org/abs/2505.08762) | wB97M-V/def2-TZVPD as implemented in ORCA6, including non-local dispersion. All solvation should be explicit. | Biology, organic chemistry, protein folding, small-molecule pharmaceuticals, organic liquid properties, homogeneous catalysis | total charge and spin multiplicity. If you don't know what these are, you should be very careful if modeling charged or open-shell systems. This can be used to study radical chemistry or understand the impact of magnetic states on the structure of a molecule. All training data is aperiodic, so any periodic systems should be treated with some caution. Probably won't work well for inorganic materials. |
38-
| omc | Omc25 | PBE+D3 as implemented in VASP. | Pharmaceutical packaging, bio-inspired materials, organic electronics, organic LEDs | UMA has not seen varying charge or spin multiplicity for the OMC task, and expects total_charge=0 and spin multiplicity=0 as model inputs. |
39-
| omat | [Omat24](https://arxiv.org/abs/2410.12771) | PBE/PBE+U as implemented in VASP using Materials Project suggested settings, except with VASP 54 pseudopotentials. No dispersion. | Inorganic materials discovery, solar photovoltaics, advanced alloys, superconductors, electronic materials, optical materials | UMA has not seen varying charge or spin multiplicity for the OMat task, and expects total_charge=0 and spin multiplicity=0 as model inputs. Spin polarization effects are included, but you can't select the magnetic state. Further, OMat24 did not fully sample possible spin states in the training data. |
37+
| omol | [OMol25](https://arxiv.org/abs/2505.08762) | wB97M-V/def2-TZVPD as implemented in ORCA6, including non-local dispersion. All solvation should be explicit. | Biology, organic chemistry, protein folding, small-molecule pharmaceuticals, organic liquid properties, homogeneous catalysis | total charge and spin multiplicity. If you don't know what these are, you should be very careful if modeling charged or open-shell systems. This can be used to study radical chemistry or understand the impact of magnetic states on the structure of a molecule. All training data is aperiodic, so any periodic systems should be treated with some caution. Probably won't work well for inorganic materials. |
38+
| omc | [OMC25](https://arxiv.org/abs/2508.02651) | PBE+D3 as implemented in VASP. | Pharmaceutical packaging, bio-inspired materials, organic electronics, organic LEDs | UMA has not seen varying charge or spin multiplicity for the OMC task, and expects total_charge=0 and spin multiplicity=0 as model inputs. |
39+
| omat | [OMat24](https://arxiv.org/abs/2410.12771) | PBE/PBE+U as implemented in VASP using Materials Project suggested settings, except with VASP 54 pseudopotentials. No dispersion. | Inorganic materials discovery, solar photovoltaics, advanced alloys, superconductors, electronic materials, optical materials | UMA has not seen varying charge or spin multiplicity for the OMat task, and expects total_charge=0 and spin multiplicity=0 as model inputs. Spin polarization effects are included, but you can't select the magnetic state. Further, OMat24 did not fully sample possible spin states in the training data. |
4040
| oc20 | [OC20*](https://arxiv.org/abs/2010.09990) | RPBE as implemented in VASP, with VASP5.4 pseudopotentials. No dispersion. | Renewable energy, catalysis, fuel cells, energy conversion, sustainable fertilizer production, chemical refining, plastics synthesis/upcycling | UMA has not seen varying charge or spin multiplicity for the OC20 task, and expects total_charge=0 and spin multiplicity=0 as model inputs. No oxides or explicit solvents are included in OC20. The model works surprisingly well for transition state searches given the nature of the training data, but you should be careful. RPBE works well for small molecules, but dispersion will be important for larger molecules on surfaces. |
41-
| odac | [ODac23](https://arxiv.org/abs/2311.00341) | PBE+D3 as implemented in VASP, with VASP5.4 pseudopotentials. | Direct air capture, carbon capture and storage, CO2 conversion, catalysis | UMA has not seen varying charge or spin multiplicity for the ODAC task, and expects total_charge=0 and spin multiplicity=0 as model inputs. The ODAC23 dataset only contains CO2/H2O water absorption, so anything more than might be inaccurate (e.g. hydrocarbons in MOFs). Further, there is a limited number of bare-MOF structures in the training data, so you should be careful if you are using a new MOF structure. |
41+
| odac | [ODAC23](https://arxiv.org/abs/2311.00341) | PBE+D3 as implemented in VASP, with VASP5.4 pseudopotentials. | Direct air capture, carbon capture and storage, CO2 conversion, catalysis | UMA has not seen varying charge or spin multiplicity for the ODAC task, and expects total_charge=0 and spin multiplicity=0 as model inputs. The ODAC23 dataset only contains CO2/H2O water absorption, so anything more than might be inaccurate (e.g. hydrocarbons in MOFs). Further, there is a limited number of bare-MOF structures in the training data, so you should be careful if you are using a new MOF structure. |
4242

4343
*Note: OC20 is was updated from the original OC20 and recomputed to produce total energies instead of adsorption energies.

docs/molecules/datasets/omc25.md

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,29 @@
11
# OMC25
22

3-
The Open Molecular Crystals 2025 (OMC25) dataset was announced along with UMA, and comprises ~25 million calculations of organic molecular crystals from random packing of OE62 structures into various 3D unit cells. It is calculated at the PBE+D3 level of theory via VASP. More details and download information coming!
3+
The Open Molecular Crystals 2025 (OMC25) dataset comprises >25 million structures of organic molecular crystals from relaxation trajectories of random packings of OE62 molecules into various 3D unit cells using Genarris 3.0 package. The dataset contains structures labeled with total energy (eV), forces (eV/A), and stress (ev/A^3) via VASP.
4+
5+
The training and validation splits of the OMC25 dataset are available for download from HuggingFace at https://huggingface.co/facebook/OMC25, under the CC BY 4.0 license, after applying for the repository access on HuggingFace.
6+
7+
## Dataset format
8+
9+
The dataset is provided in ASE DB compatible lmdb files (*.aselmdb).
10+
11+
## Level of theory
12+
13+
OMC25 was calculated at the PBE+D3 level via VASP. To reproduce the calculations, please use `fairchem.data.omc.scripts.create_vasp_inputs.py` to write compatible VASP inputs.
14+
15+
## Citing
16+
17+
We encourage users to cite this paper when using the OMC25 dataset or pretrained models for molecular crystals in their research.
18+
19+
```bibtex
20+
@misc{gharakhanyan2025openmolecularcrystals2025omc25dataset,
21+
title={Open Molecular Crystals 2025 (OMC25) Dataset and Models},
22+
author={Vahe Gharakhanyan and Luis Barroso-Luque and Yi Yang and Muhammed Shuaibi and Kyle Michel and Daniel S. Levine and Misko Dzamba and Xiang Fu and Meng Gao and Xingyu Liu and Haoran Ni and Keian Noori and Brandon M. Wood and Matt Uyttendaele and Arman Boromand and C. Lawrence Zitnick and Noa Marom and Zachary W. Ulissi and Anuroop Sriram},
23+
year={2025},
24+
eprint={2508.02651},
25+
archivePrefix={arXiv},
26+
primaryClass={physics.chem-ph},
27+
url={https://arxiv.org/abs/2508.02651},
28+
}
29+
```

docs/molecules/models.md

Lines changed: 28 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,32 +2,51 @@
22

33
**2025 recommendation:** We suggest using the [UMA model](../core/uma), trained on all of the FAIR chemistry datasets before using one of the checkpoints below. The UMA model has a number of nice features over the previous checkpoints
44
1. It is state-of-the-art in out-of-domain prediction accuracy
5-
2. The UMA small model is an energy conserving and smooth checkpoint, so should work much better for vibrational calculations, molecular dynamics, etc.
5+
2. The UMA small model is an energy conserving and smooth checkpoint, so should work much better for vibrational calculations, molecular dynamics, etc.
66
3. The UMA model is most likely to be updated in the future.
77

88
## Baseline models in the OMol25 paper
99
As part of the OMol25 release, we released two sets of models:
1010
1. [preferred] UMA models trained on a range of FAIR chemistry datasets, available at [HuggingFace](https://huggingface.co/facebook/UMA)
1111
2. eSEN models trained only on OMol25, available at [HuggingFace](https://huggingface.co/facebook/OMol25/tree/main)
1212

13-
The UMA models will continue to be updated regularly and we expect those to remain the default and performant option for the forseeable future. The OMol25-only eSEN models are provided mostly as a base-line for models trained only on OMol25.
14-
15-
## License
16-
17-
Both models require users to agree to the FAIR Chemistry License as part of the HuggingFace model gating process.
13+
The UMA models will continue to be updated regularly and we expect those to remain the default and performant option for the forseeable future. The OMol25-only eSEN models are provided mostly as a base-line for models trained only on OMol25.
1814

1915
## Citing
2016

21-
If you use the OMol25-trained eSEN models, please cite the following paper.
17+
If you use the OMol25-trained eSEN models, please cite the following paper.
2218

2319
```bib
2420
@misc{levine2025openmolecules2025omol25,
25-
title={The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models},
21+
title={The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models},
2622
author={Daniel S. Levine and Muhammed Shuaibi and Evan Walter Clark Spotte-Smith and Michael G. Taylor and Muhammad R. Hasyim and Kyle Michel and Ilyes Batatia and Gábor Csányi and Misko Dzamba and Peter Eastman and Nathan C. Frey and Xiang Fu and Vahe Gharakhanyan and Aditi S. Krishnapriyan and Joshua A. Rackers and Sanjeev Raja and Ammar Rizvi and Andrew S. Rosen and Zachary Ulissi and Santiago Vargas and C. Lawrence Zitnick and Samuel M. Blau and Brandon M. Wood},
2723
year={2025},
2824
eprint={2505.08762},
2925
archivePrefix={arXiv},
3026
primaryClass={physics.chem-ph},
31-
url={https://arxiv.org/abs/2505.08762},
27+
url={https://arxiv.org/abs/2505.08762},
28+
}
29+
```
30+
31+
## Baseline models in the OMC25 paper
32+
As part of the OMC25 release, we released eSEN model trained only on OMC25, available at [HuggingFace](https://huggingface.co/facebook/OMC25). [preferred] UMA models trained on a range of FAIR chemistry datasets are available at [HuggingFace](https://huggingface.co/facebook/UMA).
33+
34+
## Citing
35+
36+
We encourage users to cite this paper when using the OMC25 dataset or pretrained models for molecular crystals in their research.
37+
38+
```bibtex
39+
@misc{gharakhanyan2025openmolecularcrystals2025omc25dataset,
40+
title={Open Molecular Crystals 2025 (OMC25) Dataset and Models},
41+
author={Vahe Gharakhanyan and Luis Barroso-Luque and Yi Yang and Muhammed Shuaibi and Kyle Michel and Daniel S. Levine and Misko Dzamba and Xiang Fu and Meng Gao and Xingyu Liu and Haoran Ni and Keian Noori and Brandon M. Wood and Matt Uyttendaele and Arman Boromand and C. Lawrence Zitnick and Noa Marom and Zachary W. Ulissi and Anuroop Sriram},
42+
year={2025},
43+
eprint={2508.02651},
44+
archivePrefix={arXiv},
45+
primaryClass={physics.chem-ph},
46+
url={https://arxiv.org/abs/2508.02651},
3247
}
3348
```
49+
50+
## License
51+
52+
All models require users to agree to the FAIR Chemistry License as part of the HuggingFace model gating process.

0 commit comments

Comments
 (0)