Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdb_selaltloc removes residues with (only) alternate locations #153

Open
mgiulini opened this issue Dec 6, 2022 · 22 comments · May be fixed by #156
Open

pdb_selaltloc removes residues with (only) alternate locations #153

mgiulini opened this issue Dec 6, 2022 · 22 comments · May be fixed by #156
Assignees
Labels

Comments

@mgiulini
Copy link

mgiulini commented Dec 6, 2022

Describe the bug
pdb_selaltloc removes residues in which all atoms show alternate locations

To Reproduce

grep -rn pdb4xoj.pdb -e "SER A  85"
pdb_selaltloc pdb4xoj.pdb > pdb4xoj-occ.pdb
grep -rn pdb4xoj-occ.pdb -e "SER A  85"

Expected behavior
pdb_selaltloc should not delete the residue but rather pick one of the two available alternate locations

Desktop (please complete the following information):

  • OS: Linux
  • Python Version 3.9

pdb4xoj.pdb.zip

@mgiulini mgiulini added the bug label Dec 6, 2022
@joaomcteixeira
Copy link
Member

Thanks for this @mgiulini
I will try to address it ASAP.
Cheers,

@mgiulini
Copy link
Author

thank you @joaomcteixeira
I just linked the corresponding issue in the artic3d repo

@joaomcteixeira
Copy link
Member

Hi, sorry for the delay in addressing this issue. I have been a bit overwhelmed on other fronts. I will address it in the next few weeks. It seems something that needs to be inspected carefully.

@mgiulini
Copy link
Author

mgiulini commented Feb 8, 2023

hi Joao! no worries, I could actually try to give a look into it by myself, as I would like to use pdb_selaltloc in another application, what do you say?

@amjjbonvin
Copy link
Member

amjjbonvin commented Feb 8, 2023 via email

@mgiulini
Copy link
Author

mgiulini commented Feb 8, 2023

I'll look into it!

@joaomcteixeira
Copy link
Member

Go for it. Look carefully at the way @JoaoRodrigues and I designed the tests. I suggest you first write the tests for this new case and then try to correct the code for that. Be sure first that the PDB is formatted well. Some issues with PDBs are unsolvable because the PDB itself is wrongly created.
Best,

@mgiulini
Copy link
Author

mgiulini commented Feb 8, 2023

I understood the problem, but I think it would take ages for me to solve it without breaking the code, as the algorithm is very complicated.

the problem occurs when flush_resloc_occ is called

def flush_resloc_occ(altloc_lines, **kw):
"""Flush the captured altloc lines by highest occupancy."""
# only the selected altloc is yieled
highest = 0.00
altloc = ' '
# detects which altloc identifier has the highest occupancy
for key, lines2flush in altloc_lines.items():
# we check only the first line because all atoms in one identifier
# should have the same occupancy value
occ = float(lines2flush[0][54:60])
if occ > highest:
altloc = key
highest = occ
for line2flush in altloc_lines[altloc]:
yield line2flush[:16] + ' ' + line2flush[17:]

You can clearly see from the code from the code that if the input dictionary altloc_lines has a key with highest possible occupancy (e.g., an atom with no alternate location, occupancy = 1.0), then the function will ignore all the other lines, no matter their occupancy. As an example, residues LYS84 and SER85 of the pdb 4xoj give rise to the following lines

{' ': ['ATOM    536  N   LYS A  84      -4.827  -2.055  19.735  1.00  9.68           N  \n', ... some other coordinates], 
'A': ['ATOM    537  CA ALYS A  84      -4.644  -3.431  19.279  0.70 10.05           C  \n', ... some other coordinates],
'B': ['ATOM    538  CA BLYS A  84      -4.643  -3.451  19.329  0.30 10.14           C  \n', ... some other coordinates]}


according to the function, the first key (' ') is selected and the others are simply ignored.

Possible solutions could be:

  • introduce an if clause in this function to always preserve the atoms with 1.0 (key = " "), and check only among the other keys which is the one has the highest occ
  • drop the flush_func_multi_residues logic and read the residues sequentially (according to their number)

I tried the second option, but it broke the tests..

@joaomcteixeira
Copy link
Member

I would say the first option sounds better. No problem accommodating the if clause.

Also, feel free to adventure yourself rewriting the inner code if you feel so 😈 . As long as the already implemented test cases pass and the API itself is kept. But first, i suggest trying the if-clause to see what happens. Good luck and many thanks

@joaomcteixeira
Copy link
Member

joaomcteixeira commented Mar 27, 2023

Thanks for all this input Marco. Addressing it in steps:

1)

according to the old implementation the residue ILE 25 should be excluded, but in my opinion that's wrong. (in #154 coment)

I revisited the tests, and they are correct to my knowledge on altlocs. ILE 25 is excluded if users don't give any option because it has the lowest occupancy. To select the ILE, the command has a trick; users have to specify the space:

pdb_selaltloc -' ' tests/data/dummy_altloc.pdb

In this way, ILE 25 is selected instead of LEU 25. Btw, occupancy should add up to one, and in the example, dummy numbers are used (0.61 and 0.90).

2)

There are several errors associated with this PDB when using pdb_selaltloc. On the one hand, SER 85 disappears. On the other hand, several atoms from other residues also disappear if we give -A or -B as options.

This PDB has an altloc formatting I have never seen before, that is: Atoms with alternate locations do NOT have the altloc identifier of the default. For example, the default is ' ' (space) for single location atom, while altlocs use directly to A and B and no ' '. I had never seen this before and completely overlooked this possibility when I developed the current version of selaltloc.

My question: Is this a correct altloc formatting? @JoaoRodrigues @amjjbonvin need your help with this.

If yes, I will need to rewrite the script considerably. It's not a problem, but I want to double-check with you before doing it.

Cheers!

@amjjbonvin
Copy link
Member

amjjbonvin commented Mar 28, 2023 via email

@joaomcteixeira
Copy link
Member

As usual, I went to bed and woke up thinking about it. I am tempted to say it is complicated (or even impossible) to cover this scenario in the current pdb_selaltloc implementation without twisting the algorithm. The current algorithm is not complex. It's an engine that relies on identifying when a new altloc group appears code. But yesterday I couldn't figure out a way to cover this case cleanly and without adding patch code here and there.

I am tempted to rewrite the whole script and return it to something similar to the original implementation: first collect, then flush; contrarily, to flush-on-the-fly as it is now. Obviously, considering all the new tests and edge cases encountered in the last years that contributed to this change in the algorithm.

I will keep you posted. I will work on this today.

@mgiulini
Copy link
Author

Thanks for all this input Marco. Addressing it in steps:

1)

according to the old implementation the residue ILE 25 should be excluded, but in my opinion that's wrong. (in #154 coment)

I revisited the tests, and they are correct to my knowledge on altlocs. ILE 25 is excluded if users don't give any option because it has the lowest occupancy. To select the ILE, the command has a trick; users have to specify the space:

pdb_selaltloc -' ' tests/data/dummy_altloc.pdb

In this way, ILE 25 is selected instead of LEU 25. Btw, occupancy should add up to one, and in the example, dummy numbers are used (0.61 and 0.90).

Hi there! not sure I agree about this: ILE 25 does not show any alternate locations, right? in my opinion it should be kept in the pdb by pdb_selaltloc, since in this case there's probably only a numbering issue..there're examples where two alternate locations correspond to different amino acids (such as residue 11 of https://www.ebi.ac.uk/pdbe/entry-files/pdb1alx.ent), but in that case you have different location labels (ATYR and BTRP). I would not remove full residues just because at the same residue ID there's another amino acid with some alternate locations. But this is up to your design choices of course:)

2)

There are several errors associated with this PDB when using pdb_selaltloc. On the one hand, SER 85 disappears. On the other hand, several atoms from other residues also disappear if we give -A or -B as options.

This PDB has an altloc formatting I have never seen before, that is: Atoms with alternate locations do NOT have the altloc identifier of the default. For example, the default is ' ' (space) for single location atom, while altlocs use directly to A and B and no ' '. I had never seen this before and completely overlooked this possibility when I developed the current version of selaltloc.

have you checked my solution in #154? probably not super elegant, but it might be a good starting point

@amjjbonvin
Copy link
Member

amjjbonvin commented Mar 28, 2023 via email

@mgiulini
Copy link
Author

It is there indeed (ILE25 and LEU25 are almost superimposed), but in the original file https://files.rcsb.org/view/3U7T.pdb ILE25 has the alternate location label A, so basically there are AILE and BLEU at position 25. It is in this case that it makes sense to select only one of them.

@joaomcteixeira
Copy link
Member

We are talking about the test case

The way I see it, ILE 25 and LEU 25 are two options of the same alt loc group

ATOM 357 N ILE A 25 -9.702 -5.727 -16.084 0.61 4.36 N
ANISOU 357 N ILE A 25 520 465 672 33 -18 2 N
ATOM 358 CA ILE A 25 -9.340 -5.286 -17.428 0.61 4.45 C
ANISOU 358 CA ILE A 25 533 529 630 -6 -50 -5 C
ATOM 359 C ILE A 25 -8.471 -4.014 -17.377 0.61 4.39 C
ANISOU 359 C ILE A 25 478 580 609 -1 -12 -14 C
ATOM 360 O ILE A 25 -7.519 -3.868 -18.141 0.61 5.13 O
ANISOU 360 O ILE A 25 566 725 660 -65 86 -77 O
ATOM 376 N ALEU A 25 -9.706 -5.722 -16.071 0.99 4.68 N
ANISOU 376 N ALEU A 25 553 548 676 10 -26 -1 N
ATOM 377 CA ALEU A 25 -9.351 -5.281 -17.410 0.99 4.87 C
ANISOU 377 CA ALEU A 25 554 645 651 -35 -48 -28 C
ATOM 378 C ALEU A 25 -8.459 -4.039 -17.346 0.99 4.69 C
ANISOU 378 C ALEU A 25 518 645 621 -13 -18 -33 C
ATOM 379 O ALEU A 25 -7.460 -3.953 -18.055 0.99 5.25 O
ANISOU 379 O ALEU A 25 583 759 652 -28 45 -75 O

It is the same as for ASER 22 and BPRO 22 at the beginning of the file, but in this case, the first alternative location is .

have you checked my solution in #154? Probably not super elegant, but it might be a good starting point

Yes, I saw. Thanks. But it's not clean in the sense of the implemented algorithm because it is a patch code trying to overcome a problem at the is_another_altloc_group function. And, it breaks two tests. We need something that is clean from scratch. I will try to do it today.

@mgiulini
Copy link
Author

The way I see it, ILE 25 and LEU 25 are two options of the same alt loc group

here is the question: should we consider the simple space " " an alternative location? I would not do that, but it's a design choice.

have you checked my solution in #154? Probably not super elegant, but it might be a good starting point

Yes, I saw. Thanks. But it's not clean in the sense of the implemented algorithm because it is a patch code trying to overcome a problem at the is_another_altloc_group function. And, it breaks two tests. We need something that is clean from scratch. I will try to do it today.

sure, as you wish

@JoaoRodrigues
Copy link
Member

JoaoRodrigues commented Mar 28, 2023 via email

@mgiulini
Copy link
Author

Hi Joao, yes, your solution makes sense..if the user submits a pdb with two amino acids having same chain-resnum..well, there's nothing you can do about it
the important thing is that the code should never delete a full residue (such as SER A 85 in the example) but rather pick one of the available altlocs

@joaomcteixeira
Copy link
Member

joaomcteixeira commented Mar 29, 2023

We are having some discussions on slack, but this one is relevant to register here:

In the case PDB @mgiulini sent. For example, for SER 85, altlocs A and B have the same occupancy of 0.50. What should we do for these cases when users run pdb_selaltloc without specifying a selection option, that is, select by occupancy?

  1. Should we select the first conformation?
  2. Should we select one of the conformations randomly?
  3. Should we output both and maintain the altloc character? (note altloc characters are deleted after selaltloc)

@JoaoRodrigues
Copy link
Member

JoaoRodrigues commented Mar 29, 2023 via email

@amjjbonvin
Copy link
Member

amjjbonvin commented Mar 29, 2023 via email

@mgiulini mgiulini reopened this Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants