|
1 | | -# Open Molecule 2025 Electronic Structures Dataset |
| 1 | +# OMol25 Electronic Structures |
2 | 2 |
|
3 | 3 | The Open Molecules 2025 (OMol25) dataset represents the largest dataset of its kind, with more than 100 million density functional theory (DFT) calculations at the ωB97M-V/def2-TZVPD level of theory, spanning several chemical domains including small molecules, biomolecules, metal complexes, and electrolytes. |
4 | 4 |
|
5 | 5 | At release, the OMol25 dataset provided structure energies, per-atom forces, and Lowdin/Mulliken charges and spins, where available. These properties were sufficient to train state-of-the-art machine learning interatomic potentials (MLIPs) and are already demonstrating incredible performance across a wide range of applications. However, to maximize the community benefit of these calculations, we have partnered with the [Department of Energy’s Argonne National Laboratory](https://www.anl.gov/) to provide access to the raw DFT outputs and additional files for the OMol25 dataset. |
6 | 6 |
|
7 | | -By releasing the ORCA output files, users will be able to parse NBO orbital/bonding information, reduced orbital populations, Fock matrices, and more. By releasing the ORCA GBW files, users will be able to run electronic structure post-processing in order to obtain higher quality partial charges and partial spins and a variety of more advanced electronic features that could be extremely valuable for physics-informed ML models. Finally, the release will provide critical high quality data for nascent ML models that train directly on electron densities. |
| 7 | +By releasing the [ORCA](https://www.faccts.de/docs/orca/6.0/manual/) output files, users will be able to parse NBO orbital/bonding information, reduced orbital populations, Fock matrices, and more. By releasing the ORCA GBW files, users will be able to run electronic structure post-processing in order to obtain higher quality partial charges and partial spins and a variety of more advanced electronic features that could be extremely valuable for physics-informed ML models. Finally, the release will provide critical high quality data for nascent ML models that train directly on electron densities. |
8 | 8 |
|
9 | 9 | ## Data Description |
10 | 10 |
|
11 | | -The OMol25 dataset is broken into several training splits - All and 4M. The 4M split corresponds to a randomly sampled 4M subset of the full OMol25 dataset. Given the size of the full dataset, O(petabytes), we are first releasing all electronic structure and ORCA output data for the 4M split. Based on community interest, we will work to provide the full dataset. |
| 11 | +The OMol25 dataset is broken into several training splits - All and 4M. The 4M split corresponds to a randomly sampled 4M subset of the full OMol25 dataset. Given the size of the full dataset, O(petabytes), we are first releasing all electronic structure and ORCA output data for the 4M split. Based on community interest, we will work to provide the full dataset. |
12 | 12 |
|
13 | 13 | For each calculation, the following data is available: |
14 | 14 |
|
15 | | -* **orca.tar.zst**: Bundle of the raw ORCA outputs - including (orca.out, orca.inp orca.engrad, orca_property.txt, orca.xyz). To open: |
| 15 | +* **orca.tar.zst**: Bundle of the raw [ORCA](https://www.faccts.de/docs/orca/6.0/manual/) outputs - including (orca.out, orca.inp orca.engrad, orca_property.txt, orca.xyz). To open: |
16 | 16 |
|
17 | 17 | ``` |
18 | 18 | >> tar --zstd -xvf orca.tar.zst |
@@ -61,7 +61,7 @@ argonne_paths = [] |
61 | 61 | for idx in indices: |
62 | 62 | # ASE Atoms object that can be visualized/examined |
63 | 63 | atoms = dataset.get_atoms(idx) |
64 | | - # Check if this is a system you care about. |
| 64 | + # Check if this is a system you care about. |
65 | 65 | is_relevant = is_atoms_object_relevant(atoms) |
66 | 66 | if is_relevant: |
67 | 67 | # Extract the relative path that matches the Argonne cluster |
|
0 commit comments