We provide the data in two formats: processed with only the variables used in our paper for ML model evaluation and the full raw VASP output.
The processed data contain the unrelaxed structures, energies, formation energies, HOMO, LUMO and derived variables.
The archive can be downloaded and viewed directly at the Constructor Research Platform.
Alternatively, the data are available in DVC:
- Clone the repository
- Ensure that DVC[S3] is installed, for example by running
pip install dvc[s3] - Download the datasets
dvc pull -R processed-high-density processed-low-density datasets/processed/{high,low}_density_defects datasets/csv_cif/high_density_defects/{MoS2,WSe2,BP_spin,GaSe_spin,InSe_spin,hBN_spin}_500 datasets/csv_cif/low_density_defects/{MoS2,WSe2}
_idunique structure identifierdescriptor_ididentifier of the defect type as specified indescriptors.csvdefect_idunusedenergytotal potential energy of the system as reported by VASP, eVenergy_per_atomtotal potential energy of the system divided by the number of atoms, eVfermi_levelFermi level, eVhomois highest occupied molecular orbital (HOMO) energy, eVlumois lowest unoccupied molecular orbital (LUMO) energy, eVnormalized_homois HOMO value normalised respective to the host valence band maximum (VBM) (see section "DFT computations" in the paper), eVnormalized_homois LUMO value normalised respective to the host valence band maximum (VBM) (see section "DFT computations" in the paper), eVE_1is the energy of the first Kohn–Sham orbital of the structure with defect (see section "DFT computations" in the paper), eVhomo_lumo_gapis the band gap, LUMO - HOMO, eVtotal_magis the total magnetisation*_{majority,minority}are the corresponding quantities computed for the majority and minority spin channels for materials computed with spinband_gapOBSOLETE
Same as defects.csv.gz plus additional derivative variables:
formation_energyis the defect formation energy, computed according equation 1 from the paperformation_energy_per_siteis the defect formation energy divided by the number of defects according to equation 2 from the paper*_{min,max}are the minimim and maximum of quantities with the respect to to different spin channels
The archive initial.tar.gz contains the unrelaxed structures in the CIF format. Names correspond to the unique identifiers _id in defects.csv.gz. Note that the structures were relaxed prior to computing the properties.
_idunique identifier of the defect type, corresponds to thedescriptor_idcolumn indefects.csvdescriptionis a short semantic abbreviation of the defect typebaseis the chemical formula of the pristine materialcellis the supercell sizedefectsis a dictionary describing each point defect
Contains chemical potentials (in eV) of the elements, to be used in formation energy computation.
Contains the properties of pristine material.
baseis the chemical formula of the pristine materialcell_sizeis the supercell sizeenergytotal potential energy of the system, eVfermiis the Fermi level, eVE_1is the energy of the first Kohn–Sham orbital of the pristine structure (see section "DFT computations" in the paper), eVE_VBMis the energy of the valence band maximum of pristine structure
Unit cells of the pristine materials used to produce the structures in the folder.
The raw VASP output, including the relaxation trajectories, is available in DVC:
- Clone the repository
- Ensure that DVC[S3] is installed, for example by running
pip install dvc[s3] - Download the VASP output:
dvc pull -R datasets/raw_vasp/high_density_defects datasets/raw_vasp/dichalcogenides8x8_vasp_nus_202110 - Some of the data are packed into
tar.gz, as its unpacked size is ~300Gb. You might want to use ratarmount to work with it.