Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argument check_occu is not working as intended (in _get_structure and parse_structures) #3816

Open
KunhuanLiu opened this issue May 9, 2024 · 15 comments · May be fixed by #3819
Open

Argument check_occu is not working as intended (in _get_structure and parse_structures) #3816

KunhuanLiu opened this issue May 9, 2024 · 15 comments · May be fixed by #3819
Labels

Comments

@KunhuanLiu
Copy link

Python version

3.11.8

Pymatgen version

2024.2.8

Operating system version

No response

Current behavior

Hi all,
I was trying to deal with a crystal structure (in .cif format) that had warnings on having occupancy issues:

"UserWarning: Some occupancies ([1.0, ... (x876)]) sum to > 1! If they are within the occupancy_tolerance, they will be rescaled. The current occupancy_tolerance is set to: 1.0
  warnings.warn(msg)

I am unsure if it is the structure's error given that many other programs (ase, VESTA, etc) were not reporting error, so I tried to turn off the occupancy check and check myself. When I called CifParser function parse_structures with check_occu = False, the same warning and errors were logged.
Checking the source code I think there is something implemented that is inconsistent with the algorithm, by a quick glance:
Under _get_structures():

 # If check_occu is True or the occupancy is greater than 0, create comp_d
            if not check_occu or occu > 0:
                coord = (x, y, z)
                match = get_matching_coord(coord)
                comp_dict = {el: max(occu, 1e-8)}

I think not check_occu is not consistent with the algorithm commented above.
Meanwhile, the warning I'm receiving comes from the following code snippet that is not switched off by "check_occu":

        if any(occu > 1 for occu in sum_occu):
            msg = (
                f"Some occupancies ({sum_occu}) sum to > 1! If they are within "
                "the occupancy_tolerance, they will be rescaled. "
                f"The current occupancy_tolerance is set to: {self._occupancy_tolerance}"
            )
            warnings.warn(msg)
            self.warnings.append(msg)

My temporarily hacky solution to my structure is to raise the occupancy_tolerance to 1000 to get the structure.

Expected Behavior

Sites that are too close to each other are kept without raising warning and errors that prevent getting the structures.

Minimal example

No response

Relevant files to reproduce this bug

No response

@KunhuanLiu KunhuanLiu added the bug label May 9, 2024
@DanielYang59
Copy link
Contributor

DanielYang59 commented May 9, 2024

I'm not a maintainer, but if you could provide the cif file and code snippet to reproduce this issue, that would be very helpful :)

@KunhuanLiu
Copy link
Author

KunhuanLiu commented May 9, 2024

Here is a minimally reproducible example with the structure (Crystal.cif) attached, highlighting that the argument check_occu cannot turn off the occupancy check. In this example structure, Site1 -- Site3 and Site4--6 are problematic positions in my structure. Atoms C4 and C5 mimic the normal atom sites in my structure.

To reproduce the bug:

from pymatgen.io.cif import CifParser
parser = CifParser("Crystal.cif", site_tolerance = 1e-4, occupancy_tolerance = 1.0)
b = parser.parse_structures(check_occu = False)

Error that I encountered:

Some occupancies ([2.0, 2.0, 2.0, 1.0, 1.0]) sum to > 1! If they are within the occupancy_tolerance, they will be rescaled. The current occupancy_tolerance is set to: 1.0
No structure parsed for section 1 in CIF.
...
in CifParser.parse_structures(self, primitive, symmetrized, check_occu, on_error)
   1219     warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
   1221 if len(structures) == 0:
-> 1222     raise ValueError("Invalid CIF file with no structures!")
   1223 return structures

ValueError: Invalid CIF file with no structures!

Expected behavior: occupancy would not be checked and a structure with all atoms should be given. I think code review could be more helpful to see the issue though.
Crystal.txt

@DanielYang59
Copy link
Contributor

DanielYang59 commented May 11, 2024

Hi @KunhuanLiu, I just had a look at this issue, and the inconsistency between comment and implementation you mentioned is indeed an issue:

pymatgen/pymatgen/io/cif.py

Lines 985 to 989 in 2e1c301

# If check_occu is True or the occupancy is greater than 0, create comp_d
if not check_occu or occu > 0:
coord = (x, y, z)
match = get_matching_coord(coord)
comp_dict = {el: max(occu, 1e-8)}

But in our case, that's not the cause of the failure.

Cause of code failure

The issue is the bad default value of occupancy_tolerance for CifParser (and lack of proper error message):

occupancy_tolerance: float = 1.0,

And the condition for scaling occupancy:

pymatgen/pymatgen/io/cif.py

Lines 1085 to 1086 in 2e1c301

if 1 < total_occu <= self._occupancy_tolerance:
all_species[idx] = species / total_occu

Which means the default occupancy_tolerance value would prevent invalid occupancy (in our case 2.0) being scaled, the thus the code breaks as Structure does not allow such invalid occupancy:

struct = Structure(lattice, all_species, all_coords, site_properties=site_properties, labels=all_labels)

I would talk to @janosh on changing the bad default value of occupancy_tolerance soon, and I appreciate your report.

To fix your code

To fix your code, you need to set the occupancy_tolerance to a value greater than the max occupancy (2.0 in your case), for example:

from pymatgen.io.cif import CifParser

parser = CifParser("Crystal_min.cif", occupancy_tolerance=2.0)

structures = parser.parse_structures(check_occu=False)

Hope it helps :)

@KunhuanLiu
Copy link
Author

Thanks @DanielYang59 , for investigating the case. Totally agree with your thoughts; I came up with a similar manual fix (I offered a test case where manually setting the occupancy tolerance to 2.0 can address the issue. In my structure, I had to use occupancy tolerance = 1000.) Of course, it will be difficult to manually check every structure (and set tolerance) for high throughput screening...

In addition to attending to the behavior of occupancy tolerance, I think the argument check_occu according to the API reference would be very useful if it can work properly: "site occupancy will not be checked, allowing unphysical occupancy != 1..."

Again thank you for the attention and time!

@DanielYang59
Copy link
Contributor

DanielYang59 commented May 11, 2024

In addition to attending to the behavior of occupancy tolerance, I think the argument check_occu according to the API reference would be very useful if it can work properly: "site occupancy will not be checked, allowing unphysical occupancy != 1..."

Maybe the behavior should be changed such that if not check_occu, any occupancy would be allowed with a warning message. What do you think? Feel free to comment on #3819.

Again thank you for the attention and time!

No worries at all.

@KunhuanLiu
Copy link
Author

Sorry about the delayed response -- I noticed another thing about the current check_occu behavior. I am sorry I cannot offer more help reviewing the algorithm right now.

Maybe the behavior should be changed such that if not check_occu, any occupancy would be allowed with a warning message. What do you think? Feel free to comment on #3819.

I think that's what I thought the argument should do based on the API documentation description. When not check_occu, all atom sites, regardless of occup_tolerance, will be returned.

I want to note the following behavior when I use check_occu = False though:

b = parser.parse_structures(check_occu = False)[0]
# from pymatgen.analysis.structure_analyzer import get_neighbors
for site in b:
    neighbors = b.get_neighbors(site, 0.1)  # 0.5 Å as an example threshold for overlap
    if neighbors:
        print(f"Overlap suspected near {site.label} at {site.frac_coords}:\n{neighbors}")

Returned atoms do not have labels:

Overlap suspected near C at [0.1191 0.5596 0.5713]:
[PeriodicNeighbor: C (-5.305, 16.0, 13.02) [0.1191, 0.5595, 0.5713]]
Overlap suspected near C at [0.4404 0.5595 0.5713]:
[PeriodicNeighbor: C (5.306, 16.0, 13.02) [0.4405, 0.5596, 0.5713]]
Overlap suspected near C at [0.4405 0.8809 0.5713]:
[PeriodicNeighbor: C (-0.001651, 25.19, 13.02) [0.4404, 0.8809, 0.5713]]
Overlap suspected near C at [0.4404 0.8809 0.5713]:
[PeriodicNeighbor: C (0.001651, 25.19, 13.02) [0.4405, 0.8809, 0.5713]]
Overlap suspected near C at [0.4405 0.5596 0.5713]:
[PeriodicNeighbor: C (5.305, 16.0, 13.02) [0.4404, 0.5595, 0.5713]]
Overlap suspected near C at [0.1191 0.5595 0.5713]:
[PeriodicNeighbor: C (-5.306, 16.0, 13.02) [0.1191, 0.5596, 0.5713]]

@DanielYang59
Copy link
Contributor

Hi thanks for following up. I have updated the behavior and docstring in #3819.

Meanwhile I have a question though, as per the docstring, "unphysical occupancy" refers to occupancy != 1, but the implementation just handles cases where occupancy > 1, is there any chance for the occupancy to be smaller than zero and still need to be included (I'm not aware of such cases)?

I would look into the site label soon, thanks for reporting.

@DanielYang59
Copy link
Contributor

DanielYang59 commented May 15, 2024

The label issue is fixed by 05c11d4. Thanks.

@esoteric-ephemera
Copy link
Contributor

esoteric-ephemera commented May 15, 2024

Hi @KunhuanLiu : the CIF file you provided does not appear to be valid, as the three sites at the end, Site6, Site7, and Site8, are duplicates of Site3, Site2, and Site1, respectively. Removing these latter three sites successfully produces a structure with pymatgen

Pymatgen does not currently include functionality to fix bad CIFs. This isn't an issue with the occupancy tolerance function

@KunhuanLiu
Copy link
Author

KunhuanLiu commented May 15, 2024

Hello @esoteric-ephemera , this ticket inquiries whether the implementation of the check_occu argument is consistent with the algorithm described in both the source code comment and the API documentation.

Removing these latter three sites successfully produces a structure with pymatgen

Example structure serves as a minimal example to highlight the impact of the issue.

Pymatgen does not currently include functionality to fix bad CIFs.

Thank you for sharing. Can you clarify what you mean? I am not requesting such functionality here. Can you comment on how or why check_occu = False should not return a structure when the perceived occupancy exceeds the user-set occu_tolerance? This can be important as @DanielYang59 has implemented several changes in #3819 and your review can be helpful. Just to make sure I am not proposing something based on misunderstanding of the intended behavior.

I also don't think that "bad cifs" (a term that should be defined more clearly) are the sole reasons for such errors. A structure with large unit cell and dense atoms can have atoms such as H that are "close" to each other under the 1e-4 default site tolerance (I cannot publish the structure here). If check_occu = False is not intended to keep the overlapping or nonphysical atoms, then the documentation, instead of the code, needs revision. Of course, it's then the user's responsibility to catch error or adjust tolerance parameters if they decide to continue their analysis on those structures.

This isn't an issue with the occupancy tolerance function

I never thought it was in this case. Setting a higher occupancy tolerance is an unorthodox trick to let pymatgen return my structure. The question concerns the proper behavior with the check_occu argument, and I will appreciate your comments on this

@esoteric-ephemera
Copy link
Contributor

@KunhuanLiu: the example is not a reasonable CIF because it contains multiply-occupied sites. The site occupancy represents fractionally what elements occupy a given site, e.g., a disordered structure will have at least two sites with occupancies < 1, and an ordered structure will have all site occupancies == 1. A site occupancy > 1 is not physical

A "bad CIF" in this context means a CIF that is either out of spec per the IUCr or is unphysical. The ICUr specifically states that the occupancy represents:

The fraction of the atom type present at this site.
The sum of the occupancies of all the atom types at this site
may not significantly exceed 1.0 unless it is a dummy site.

Right now, pymatgen does not support this kind of "dummy site" because it's not compatible with other functionality in pymatgen. pymatgen does support disordered structures, where at least two elements occupy the same site with fractional (< 1) occupancy.

check_occ is not intended to ensure non-overlapping sites. It is intended to ensure that the fractional composition on a given site does not exceed 1

@DanielYang59
Copy link
Contributor

DanielYang59 commented May 16, 2024

Hi @esoteric-ephemera thanks a lot for your comment.

Combine of sites

the CIF file you provided does not appear to be valid, as the three sites at the end, Site6, Site7, and Site8, are duplicates of Site3, Site2, and Site1, respectively.

Pymatgen does not currently include functionality to fix bad CIFs.

It appears CifParser is able to handle such cifs:

pymatgen/pymatgen/io/cif.py

Lines 306 to 329 in bb68c78

class CifParser:
"""
CIF file parser. Attempt to fix CIFs that are out-of-spec, but will issue warnings
if corrections applied. These are also stored in the CifParser's warnings attribute.
CIF file parser. Attempt to fix CIFs that are out-of-spec, but will issue warnings
if corrections applied. These are also stored in the CifParser's errors attribute.
"""
def __init__(
self,
filename: PathLike | StringIO,
occupancy_tolerance: float = 1.0,
site_tolerance: float = 1e-4,
frac_tolerance: float = 1e-4,
check_cif: bool = True,
comp_tol: float = 0.01,
) -> None:
"""
Args:
filename (PathLike): CIF file, gzipped or bzipped CIF files are fine too.
occupancy_tolerance (float): If total occupancy of a site is between 1 and occupancy_tolerance, the
occupancies will be scaled down to 1.
site_tolerance (float): This tolerance is used to determine if two sites are in the same position,
in which case they will be combined to a single disordered site. Defaults to 1e-4.

And:

pymatgen/pymatgen/io/cif.py

Lines 328 to 329 in bb68c78

site_tolerance (float): This tolerance is used to determine if two sites are in the same position,
in which case they will be combined to a single disordered site. Defaults to 1e-4.

The cif @KunhuanLiu provided:

 Site1  C 0.11910   0.55960   0.57130   1.0
 Site2	C 0.44040   0.55950   0.57130   1.000  << 
 Site3  C 0.44050   0.88090   0.57130   1.000
 C4  	C 0.07117   0.21573   0.93523   1.000
 C5	C 0.14457   0.21573   0.93523   1.000
 Site6	C 0.44040   0.88090   0.57130   1.000
 Site7  C 0.44050   0.55960   0.57130   1.000  << would be combined into Site2
 Site8  C 0.11910   0.55950   0.57130   1.000

Site2 and Site7 (and two other pairs similarly) are not duplicate, but within the site_tolerance, and were therefore combined into a single site (with occupancy summed).

Handle of "unphysical" (occupancy > 1) sites

Right now, pymatgen does not support this kind of "dummy site" because it's not compatible with other functionality in pymatgen.

In parse_structures, there is:

pymatgen/pymatgen/io/cif.py

Lines 1263 to 1266 in bb68c78

check_occu (bool): Whether to check site for unphysical occupancy > 1.
Useful for experimental results in which occupancy was allowed to
refine to unphysical values. Warning: unphysical occupancies are
incompatible with many pymatgen features. Defaults to True.

Which should allow user to parse cif with occupancy > 1, once check_occu is turned off?

@KunhuanLiu
Copy link
Author

KunhuanLiu commented May 16, 2024

@esoteric-ephemera Thanks so much for clarifying, and I apologize for my oversight on the fact that in a realistic crystal structure, no atom sites would have summed occupancy > 1. I understand and support the design that raises the error ValueError: Invalid CIF file with no structures! when there are sites with occupancy > occupancy_tolerance and not rescaled to 1.

I am still unsure what the algorithm should do when check_occu is set to FALSE, but according to @esoteric-ephemera it should still not return a bad CIF with site occupancy > 1 that is caused by merging atom sites.

I'm very thankful to the attention and help from @DanielYang59, and I want to make sure we are not changing the code implementation unnecessarily. It looks like I had some misunderstanding caused by warning messages and documentation.

To make sure we are aligned on the intended behavior, here's my (corrected) understanding:

  • site_tolerance: decides whether sites are merged and the resulted site has occupancy as sum of all the site occupancy
  • occupancy_tolerance: used to scale down sites with occupancy <= occupancy_tolerance
  • check_occu: setting to True (default value), check if any site has occupancy > 1; if yes, raise ValueError. Setting to False, ignores _atom_site_occupancy. However, it is intended to still raise ValueError when any site has occupancy >1. (Needs confirmation)
  • Observed in my testing: check_occu = False removes atom site labels.

@DanielYang59 thanks for your attention. Based on our discussion I think the original version has a very reasonable design and raises ValueError as intended. It sounds reasonable to me to refer the structures with occupancy > 1 as bad cifs. Changes around tolerance are not necessary. I think there can be some improvement on the documentation.

pymatgen/pymatgen/io/cif.py

Lines 1011 to 1018 in 2e1c301

if any(occu > 1 for occu in sum_occu):
msg = (
f"Some occupancies ({sum_occu}) sum to > 1! If they are within "
"the occupancy_tolerance, they will be rescaled. "
f"The current occupancy_tolerance is set to: {self._occupancy_tolerance}"
)
warnings.warn(msg)
self.warnings.append(msg)

Here, ({sum_occu}) sum to > 1! is confusing and can lead to a very very long warning message when there are lots of atoms. This happened in my case and made it difficult to correctly understand the warning message.
I suggest the following:

unphysical_occu = [occu for occu in sum_occu if occu > 1]
...         f"Some occupancies sum to > 1! If they are within " 
         "the occupancy_tolerance, they will be rescaled. " 
           f"The current occupancy_tolerance is set to: {self._occupancy_tolerance}. Sites with occupancy sum > 1: {unphysical_occu}" 

@esoteric-ephemera Can please you confirm that the behavior described above on check_occu is correct? I also wonder if this is a good place to close the ticket, or to open new issues with sharpened focus. Thanks for all the attention!

@esoteric-ephemera
Copy link
Contributor

Sorry for the delay. @DanielYang59:

Site2 and Site7 (and two other pairs similarly) are not duplicate, but within the site_tolerance, and were therefore combined into a single site (with occupancy summed).

They're duplicates in terms of position and chemical species, their labels differ. The right way to specify a site like this is with 0.5 occupancy on Site2 and 0.5 occupancy on Site7. That specifies a disordered structure with 50-50 occupancy C labelled Site2 and C labelled Site7

Which should allow user to parse cif with occupancy > 1, once check_occu is turned off?

It might let you do that but I'm not sure what it parses will be meaningful / usable in pymatgen. I think pymatgen should parse CIFs that conform (within a tolerance) to the IUCr standards and not extrapolate when a CIF deviates too far from those standards.

@KunhuanLiu:

Can please you confirm that the behavior described above on check_occu is correct? I also wonder if this is a good place to close the ticket, or to open new issues with sharpened focus. Thanks for all the attention!

This is right except occupancy_tolerance is used to allow for occupancies slightly larger than 1 (this often arises from rounding errors in hand-written CIFs, such as those from the ICSD)

check_occu = False removing the site labels is probably necessary because pymatgen just merges sites with >100% occupancy and the same chemical identity

As for closing the issue / @DanielYang59's PR: I would prefer to just amend the ValueError message for clarity - the site tolerance shouldn't be changed unless there's an in-spec CIF that isn't being parsed correctly

@DanielYang59
Copy link
Contributor

DanielYang59 commented May 22, 2024

Sorry for the delay. @DanielYang59:

All good. Thanks a lot for your comments :)

Site2 and Site7 (and two other pairs similarly) are not duplicate, but within the site_tolerance, and were therefore combined into a single site (with occupancy summed).

They're duplicates in terms of position and chemical species, their labels differ. The right way to specify a site like this is with 0.5 occupancy on Site2 and 0.5 occupancy on Site7. That specifies a disordered structure with 50-50 occupancy C labelled Site2 and C labelled Site7

Sorry I don't think they're duplicates, as their x/y coordinates differ (slightly by 1e-4):

 Site2	C 0.44040   0.55950   0.57130   1.000  
 Site7  C 0.44050   0.55960   0.57130   1.000  

So if the tolerance is set to a stricter value, they could be two individual sites. But in our case, the default tolerance is such that these two sites are indeed considered at the same position (but I would not consider them duplicates).

pymatgen/pymatgen/io/cif.py

Lines 328 to 329 in 14b7357

site_tolerance (float): This tolerance is used to determine if two sites are in the same position,
in which case they will be combined to a single disordered site. Defaults to 1e-4.

Which should allow user to parse cif with occupancy > 1, once check_occu is turned off?

It might let you do that but I'm not sure what it parses will be meaningful / usable in pymatgen. I think pymatgen should parse CIFs that conform (within a tolerance) to the IUCr standards and not extrapolate when a CIF deviates too far from those standards.

I'm unsure about the purpose of having such behavior either, but as per the docstring:

pymatgen/pymatgen/io/cif.py

Lines 1263 to 1266 in 14b7357

check_occu (bool): Whether to check site for unphysical occupancy > 1.
Useful for experimental results in which occupancy was allowed to
refine to unphysical values. Warning: unphysical occupancies are
incompatible with many pymatgen features. Defaults to True.

So I might assume there is a reason to allow for such cif files.

If you or anyone else with more knowledge on cif decide to remove this functionality, we might need to remove check_occu because it seems to causing confusing (for example this issue). Because occupancy would be checked no matter how it is set.

As for closing the issue / @DanielYang59's PR: I would prefer to just amend the ValueError message for clarity - the site tolerance shouldn't be changed unless there's an in-spec CIF that isn't being parsed correctly

Yes I have already updated the code with a more descriptive error message.

I didn't change the site tolerance but updated the behavior of check_occu to allow for any occupancy once turned off (to be consistent with the docstring). But certainly we need further confirmation on whether we want such behavior or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants