Add the ability for the `PDBManager` to perform interface-based chain filtering #333

amorehead · 2023-08-18T20:24:21Z

What does this implement/fix? Explain your changes

This allows one to select PDB complex chains satisfying certain interface contact or hydrogen bonding constraints.

… filtering

amorehead · 2023-08-18T20:25:09Z

graphein/ml/datasets/pdb_data.py

@@ -23,6 +25,11 @@
 )
 from graphein.utils.dependencies import is_tool

+PRIMARY_INTERCHAIN_CONTACT_ATOMS_FOR_FILTERING: List[str] = ["CA", "C4'"]
+SECONDARY_INTERCHAIN_CONTACT_ATOMS_NOT_FOR_FILTERING: List[str] = ["H"]
+PRIMARY_HYDROGEN_BOND_ATOMS_FOR_FILTERING: List[str] = ["N", "O", "N1", "N9", "N3", "C2", "C4", "C5", "C6"]


This should be vetted more carefully, as I initially chose these atom types heuristically.

What is this atom naming scheme? It doesn't ring any bells for me (

graphein/graphein/protein/resi_atoms.py

Line 276 in 281ce30

ATOM_NUMBERING: Dict[str, int] = {

)

We already have these constants:

graphein/graphein/protein/resi_atoms.py

Line 858 in 281ce30

HYDROGEN_BOND_DONORS: Dict[str, Dict[str, int]] = {

graphein/graphein/protein/resi_atoms.py

Line 880 in 281ce30

HYDROGEN_BOND_ACCEPTORS: Dict[str, Dict[str, int]] = {

What is this atom naming scheme? It doesn't ring any bells for me (

graphein/graphein/protein/resi_atoms.py

Line 276 in 281ce30

ATOM_NUMBERING: Dict[str, int] = {

)

The N, CA, O, and H atoms correspond to regular protein vocabulary, however, all other types correspond to nucleic acid residue atoms. My initial goal with this PR was to make a generic dataset chain filter for protein-protein interactions, protein-nucleic acid interactions, and nucleic acid-nucleic acid interactions (inspired by the dataset curation technique of RoseTTAFold2NA for protein-nucleic acid structure prediction - https://www.biorxiv.org/content/10.1101/2022.09.09.507333v1.full.pdf - page 8). I am essentially trying to reproduce this filtering logic with the PDBManager (minus all the sequence alignments), and I thought a PR would be in order.

Per a suggestion from a colleague, I have removed the C atoms from the hydrogen bond calculation, as these atoms are very rarely involved in the formation of h-bonds in proteins and NAs.

The N, CA, O, and H atoms correspond to regular protein vocabulary, however, all other types correspond to nucleic acid residue atoms.

Got it, bells ring for me now :)

So these H-bond definitions do not account for sidechain-X hbonds, only backbone-backbone hbonds?

Right. Here's a naive question on my part: How frequent would you say the occurrence of sidechain-X hbonds is? If they are pretty common, perhaps we can simply include more protein and nucleic acid (NA) atom types to the list here?

Seemingly quite common!

https://academic.oup.com/peds/article/13/4/227/1627008

By way of how I have designed this filtering logic, I am assuming that each (protein or NA) residue (potentially) contains the following atoms: "N", "O", "N1", "N9", "N3". Given the prevalence of sidechain hbonds, what types of protein atoms (shared across all residue types) would you say would be most reasonable to include to cover most of the possible hbonds mentioned in this article? The only other atom type I think we could include would be the carbon-beta (Cb) atoms.

for more information, see https://pre-commit.ci

…-filtering

sonarcloud · 2023-09-12T03:08:57Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

Add the ability for the PDBManager to perform interface-based chain…

e6e9658

… filtering

amorehead commented Aug 18, 2023

View reviewed changes

pre-commit-ci bot and others added 4 commits August 18, 2023 20:26

[pre-commit.ci] auto fixes from pre-commit.com hooks

8c821f0

for more information, see https://pre-commit.ci

Remove C atoms from hydrogen bond calculation

4b39a4b

Merge branch 'a-r-j:master' into amorehead-pdbmanager-chain-interface…

6518dc7

…-filtering

Merge branch 'a-r-j:master' into amorehead-pdbmanager-chain-interface…

ea2e2f3

…-filtering

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability for the `PDBManager` to perform interface-based chain filtering #333

Add the ability for the `PDBManager` to perform interface-based chain filtering #333

amorehead commented Aug 18, 2023

amorehead Aug 18, 2023

a-r-j Aug 19, 2023

a-r-j Aug 19, 2023

amorehead Aug 19, 2023 •

edited

amorehead Aug 19, 2023

a-r-j Aug 20, 2023

amorehead Aug 20, 2023 •

edited

a-r-j Aug 20, 2023

amorehead Aug 20, 2023

sonarcloud bot commented Sep 12, 2023

Add the ability for the PDBManager to perform interface-based chain filtering #333

Are you sure you want to change the base?

Add the ability for the PDBManager to perform interface-based chain filtering #333

Conversation

amorehead commented Aug 18, 2023

What does this implement/fix? Explain your changes

amorehead Aug 18, 2023

Choose a reason for hiding this comment

a-r-j Aug 19, 2023

Choose a reason for hiding this comment

a-r-j Aug 19, 2023

Choose a reason for hiding this comment

amorehead Aug 19, 2023 • edited

Choose a reason for hiding this comment

amorehead Aug 19, 2023

Choose a reason for hiding this comment

a-r-j Aug 20, 2023

Choose a reason for hiding this comment

amorehead Aug 20, 2023 • edited

Choose a reason for hiding this comment

a-r-j Aug 20, 2023

Choose a reason for hiding this comment

amorehead Aug 20, 2023

Choose a reason for hiding this comment

sonarcloud bot commented Sep 12, 2023

Add the ability for the `PDBManager` to perform interface-based chain filtering #333

Add the ability for the `PDBManager` to perform interface-based chain filtering #333

amorehead Aug 19, 2023 •

edited

amorehead Aug 20, 2023 •

edited