Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] improved/corrected pdb_selaltloc tool #156

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

joaomcteixeira
Copy link
Member

Improves pdb_selaltloc tool:

I think we should bump minor because of the importance of this update.
Thanks to everyone participating in the discussions 馃憦

I cannot add you as a reviewer @mgiulini; consider yourselves a reviewer 馃槈

@joaomcteixeira
Copy link
Member Author

Actions will work after #157

@rvhonorato rvhonorato requested review from mgiulini and removed request for rvhonorato March 31, 2023 09:49
pdbtools/pdb_selaltloc.py Outdated Show resolved Hide resolved
altloc_lines = dict()
res_per_loc = dict()
# flush lines because we enter a new chain
if chain != prev_chain:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really avoid storing all residues in a chain, especially since this operation is residue based. You can trigger a flush when you see a new residue. That saves a lot of memory, not that it matters much because chains are tiny, but keeps with the idea of generators/iterators.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanding on this a little, I wonder if you can simplify the overall logic by doing the following. All ATOM/HETATM/ANISOU records have the same identifiers (chain, resnum, inscode). You can record these are you iterate since they must be in order. If they're not, it's really not our problem.

Once you hit a new residue, look back at the lines you accumulated for the residue. Track altlocs, if any in any of those lines. If there are any altlocs we want to keep/discard, do it then while flushing and ignoring those lines. This requires 3x iterations for a residue block if there are altlocs - once to store, once to process, once to flush - and you could cut this down to two by doing the processing as you store (basically keep a dict of {pdbname: {altloc: occupancy}}).

I think this might be simpler than what you have and has the added benefit that you also filter ANISOU lines that belong to discarded altlocs. Let me know what you think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really avoid storing all residues in a chain, especially since this operation is residue based. You can trigger a flush when you see a new residue. That saves a lot of memory, not that it matters much because chains are tiny, but keeps with the idea of generators/iterators.

While this is formally true, it is also true that there are some cases where altlocs are organized by altloc instead of residues. See test example. I know this is not the official format but if we added it to tests it is because we have seen it at some point.

Expanding on this a little (...)

You are right; we may optimize the code further, but I don't know if it can be simplified. In this PR, I aimed to reach the most robust algorithm, disregarding fine-tuned performance. I am positive there won't be memory issues because lines will be flushed on every chain, model or terminal line. Because pdb-tools is for PDBs and not for CIFs, there won't be more than 100k lines stored simultaneously in the dictionary; that's about 10MB of memory.

Considering the pressing nature of this issue, I suggest we merge this PR with [FEATURE] so all bugs are addressed, and then we work calmly on performance issues. I quickly measured the performance on a large PDB:

time 'pdb_selaltloc 7bqx.pdb'
pdb_selaltloc 7bqx.pdb: command not found

real	0m0,110s
user	0m0,090s
sys	0m0,025s

What do you think? 馃槈 馃幘

@mgiulini
Copy link

mgiulini commented Jul 3, 2023

hey there, I don't know whether this PR is closed, but I found a bug in the new implementation of the code. Take pdb file 2qxf: if you run the new selaltloc you'll get duplicate rows for all the residues that have 3 alternative locations, two of which with equal probability weight. An example is MET A 235:

ATOM   1924  N  AMET A 235      30.258  16.409   7.504  0.35 13.81           N
ATOM   1925  N  BMET A 235      30.226  16.401   7.511  0.35 13.80           N
ATOM   1926  N  CMET A 235      30.284  16.403   7.495  0.30 13.83           N
ATOM   1927  CA AMET A 235      31.514  15.799   7.009  0.35 13.83           C
ATOM   1928  CA BMET A 235      31.457  15.715   7.065  0.35 13.83           C
ATOM   1929  CA CMET A 235      31.587  15.874   6.971  0.30 13.85           C

becomes

ATOM   1924  N   MET A 235      30.258  16.409   7.504  0.35 13.81           N
ATOM   1925  N   MET A 235      30.226  16.401   7.511  0.35 13.80           N
ATOM   1927  CA  MET A 235      31.514  15.799   7.009  0.35 13.83           C
ATOM   1928  CA  MET A 235      31.457  15.715   7.065  0.35 13.83           C

I think that the problem lies here https://github.com/joaomcteixeira/pdb-tools/blob/0981c5a174e79490068bec327dc337cc3cb492d6/pdbtools/pdb_selaltloc.py#L287-L292

hope it helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pdb_selaltloc removes residues with (only) alternate locations
3 participants