Improve parsing capabilities of CifData class#1257
Conversation
2707090 to
4ee926b
Compare
The CifParser of pymatgen allows to set the parameter site_tolerance which will be used to determine whether two sites overlap and if created by symmetry they should be merged as one. By default this value is 1E-4 in v4.5.3, which is rather tight and often leads to a created structure with overlapping atoms. To prevent this one should be able to specify this parameter, however, unfortunately it is not supported by the other library 'ase' that AiiDA supports internally for structure generation from CifData nodes.
The CifParser of pymatgen will check the atomic occupations of the parsed cif and if they exceed unity, a warning will be emitted and the parsing of the structure fails. It is possible to give a certain tolerance to allow invalid occupations through the occupation_tolerance of the CifParser constructor. Any occupations between one and this tolerance will be rounded to one. Pymatgen does not distinguish between various parsing errors, any failure just raises a ValueError. However, we would like to distinguish between well defined errors such as the incorrect occupations case. As a solution, if the parsing fails, we attempt a second time, this time setting the occupation tolerance to an unrealistic high value. If the parsing now succeeds, the original failure was due to the occupations being exceeded and we raise a specific error InvalidOccupationsError. If the second parse attempt also failed, there was some unknown parsing error and we simply raise a ValueError.
Cif files, for example from COD, can not define any atomic sites or have atomic species that are not elemental, such as Deuterium, Tritium or even whole water molecules. Cif files without atomic sites or with these non elemental species cannot be converted to AiiDA StructureData objects. These new CifData properties make it easy to check whether the cif file has these issues, before we attempt to parse a structure from it using either ASE or pymatgen.
4ee926b to
098d500
Compare
|
|
||
| @optional_inline | ||
| def _get_aiida_structure_ase_inline(cif, parameters): | ||
| def _get_aiida_structure_ase_inline(cif, parameters=None): |
There was a problem hiding this comment.
defaults with non database objects such as None (here, parameters=None) should be avoided in optional_inline calculations...
|
|
||
| @optional_inline | ||
| def _get_aiida_structure_pymatgen_inline(cif=None, parameters=None): | ||
| def _get_aiida_structure_pymatgen_inline(cif, parameters=None): |
There was a problem hiding this comment.
same comment as above (on parameters=None)
| raise ValueError('pymatgen failed to provide a structure from the cif file') | ||
| else: | ||
| # If it now succeeds, non-unity occupancies were the culprit | ||
| raise InvalidOccupationsError('detected atomic sites with an occupation number exceeding the tolerance') |
There was a problem hiding this comment.
maybe a more explicit error message like
"detected atomic sites with a total occupation number higher than 1"
| """ | ||
| from aiida.common.constants import elements | ||
|
|
||
| known_species = [element['symbol'] for index, element in elements.items()] |
There was a problem hiding this comment.
known_species = [element['symbol'] for element in elements.values()]
I know, I'm picky today...
| known_species = [element['symbol'] for index, element in elements.items()] | ||
|
|
||
| for formula in self.get_formulae(): | ||
| species = parse_formula(formula).keys() |
There was a problem hiding this comment.
note that
parse_formula('D1 H Li3 Te2.5 H3')
(H is repeated)
returns
{'D': 1, 'H': 3, 'Li': 3, 'Te': 2.5}
which is ok here (we don't care about how many H atoms there are), but maybe not in general.
These converter functions are optional inlines, which means that if they are called with `store=True`, AiiDA will try to store the input arguments, but if the `parameters` were not specified, it will try to store the `None` default, which of course fails. The solution is to use `**kwargs`. With this a user is not required to specify an empty dictionary or `ParameterData` node, and the function can be used both as inline and as regular function.
Tests have shown that `pymatgen` is a lot more robust in parsing cif files and complains about files that have nonsensical data such as atomic occupations that exceed unity. We therefore swap the default converter engine from `ase` to `pymatgen`.
Fixes #1256
This PR addresses a couple of points and improvements to the parsing of
CifDatafilesinto
StructureDataobjects. It implements two new properties forCifData:has_atomic_siteshas_unknown_speciesThe former will check if there any atomic sites defined at all and the latter will check whether
any species in the formulae are unknown, which is to say, they are not elements listed in the
aiida.common.constants.elementsdictionary. In both cases, it will be impossible to createa
StructureDatanode out of thoseCifDataobjects and so one does not even have to tryto parse them with either
aseorpymatgen.Additionally, we improve the automatic structure converter
_get_aiida_structure_pymatgen_inlinethat uses
pymatgento parse the cif content to return apymatgenstructure object, which isthen used to create a
StructureDatanode. There are two improvements here:The
CifParserofpymatgenallows to set the parametersite_tolerancewhich will be usedto determine whether two sites overlap and if created by symmetry they should be merged as one.
By default this value is
1E-4inv4.5.3, which is rather tight and often leads to a created structurewith overlapping atoms. To prevent this one should be able to specify this parameter, however,
unfortunately it is not supported by the other library
asethat AiiDA supports internally for structuregeneration from
CifDatanodes.The
CifParserofpymatgenwill check the atomic occupations of the parsed cif and if theyexceed unity, a warning will be emitted and the parsing of the structure fails. It is possible to give
a certain tolerance to allow invalid occupations through the
occupation_toleranceof theCifParserconstructor. Any occupations between one and this tolerance will be rounded to one.
Pymatgendoesnot distinguish between various parsing errors, any failure just raises a
ValueError. However, wewould like to distinguish between well defined errors such as the incorrect occupations case.
As a solution, if the parsing fails, we attempt a second time, this time setting the occupation tolerance
to an unrealistic high value. If the parsing now succeeds, the original failure was due to the occupations
being exceeded and we raise a specific error
InvalidOccupationsError. If the second parse attemptalso failed, there was some unknown parsing error and we simply raise a
ValueError.