unexpected behavior when calculating key interactions, and asking for help when using your code #3

Atomu2014 · 2024-04-14T10:54:48Z

Hi:

Recently I've been using PoseCheck to evaluate clash, strain energy and key interactions for SBDD models. And I found the following unexpected behaviors:

When loading CrossDocked test set proteins, a ValueError occurs: vdw radii for types: CO. These can be defined manually using the keyword 'vdwradii', which results in protein loading failed. I did a quick search to resolve this bug but cannot find relavant solutions, except that copilot suggests inserting the following code:

vdw_radii = {'CO': 1.7}  # Define the vdw radius for 'CO'. The value 1.7 is just an example, replace it with the correct value.
ag.guess_bonds(vdwradii=vdw_radii)

I'm not sure whether 1.7 is correct for CO, and what is ag, how to call guess_bonds?

Since doing PoseCheck on >10K molecules is quite slow, multi-processing is necessary. An error would occur if multiple processes load the same protein at the same time. After some debugging, I found the following code (starting at line 60 in posecheck/utils/loading.py):

tmp_path = pdb_path.split(".pdb")[0] + "_tmp.pdb"

# Call reduce to make tmp PDB with waters
reduce_command = f"{reduce_path} -NOFLIP  {pdb_path} -Quiet > {tmp_path}"
subprocess.run(reduce_command, shell=True)

# Load the protein from the temporary PDB file
prot = load_protein_prolif(tmp_path)
os.remove(tmp_path)

I guess when multiple processes accessing the same tmp_path, a process may unexpectedly remove the file before another process load it. Thus I made the following quick fix:

tmp_path = pdb_path.split(".pdb")[0] + "_tmp.pdb"
while os.path.exists(tmp_path):
    hash_code = str(hash(tmp_path))[:4]
    tmp_path = tmp_path[:-8] + '_' + hash_code + "_tmp.pdb"
print(tmp_path)

This quick fix works for me. Please check if this is correct and consider update your source code for better multi-process support.

Except for those bugs, I will appreciate your help in the following cases:

In your README tips, you said Reading and processing all the PDB files using reduce can take a while for a large test set. If you are running PoseCheck frequently, it might be worth pre-processing all proteins yourself using prot = posecheck.utils.loading.load_protein_from_pdb(pdb_path) and setting this directly within PoseCheck using pc.protein = prot. I guess you are mentioning the following code which could be quite time-consuming:

# Call reduce to make tmp PDB with waters
reduce_command = f"{reduce_path} -NOFLIP  {pdb_path} -Quiet > {tmp_path}"
subprocess.run(reduce_command, shell=True)

How about adding an interface to preprocess all protein files? And I'm curious about how much time can be saved if proteins are preprocessed?

Can you provide some formal code for calculating the exact numbers for key interactions? Since your interface pc.calculate_interactions() returns a complicated dataframe, I'm not sure whether my parsing is correct:

df = pc.calculate_interactions()
columns = np.array([column[2] for column in df.columns])
flags = np.array([df[column][0] for column in df.columns])

def count_inter(inter_type):
    if len(columns) == 0:
        return 0
    count = sum((columns == inter_type) & flags)
    return count

# ['Hydrophobic', 'HBDonor', 'VdWContact', 'HBAcceptor']
hb_donor = count_inter('HBDonor')
hb_acceptor = count_inter('HBAcceptor')
vdw = count_inter('VdWContact')
hydrophobic = count_inter('Hydrophobic')

Please verify my code and provide some formal guidelines.

When using PoseCheck, many warning messages will be printed like this:

/opt/conda/lib/python3.9/site-packages/MDAnalysis/topology/guessers.py:146: UserWarning: Failed to guess the mass for the following atom types: CO
  warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))
/opt/conda/lib/python3.9/site-packages/MDAnalysis/converters/RDKit.py:473: UserWarning: No `bonds` attribute in this AtomGroup. Guessing bonds based on atoms coordinates
  warnings.warn(
WARNING: atom  ZN from ZN will be treated as zinc
*WARNING*: Residues GLN 84  and HIS 90  in chain  B appear unbonded 
            and will be treated as a chain break
*WARNING*: Residues GLN 84  and HIS 90  in chain  B appear unbonded 
            and will be treated as a chain break
*WARNING*: Residues LEU 98  and ASP 101  in chain  B appear unbonded 
            and will be treated as a chain break
*WARNING*: Residues LEU 98  and ASP 101  in chain  B appear unbonded 
            and will be treated as a chain break
*WARNING*: Res "ZN" not in HETATM Connection Database. Hydrogens not added.
*WARNING*: Res "K" not in HETATM Connection Database. Hydrogens not added.
*WARNING*: Res "K" not in HETATM Connection Database. Hydrogens not added.

I think this information doesn't change the results, right? Do you know how to suppress these warning message? A verbose flag would be quite useful if this message can be turned off.

I think many of your followers are using CrossDocked dataset. A formal and efficient evaluation script for CrossDocked (many proteins, each protein associated with many molecules) would be beneficial. I would like to discuss with you and contribute to this script.

If you want discussion, drop me an email at kevinqu16 [at] gmail [dot] com.
That's all. Thank you!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unexpected behavior when calculating key interactions, and asking for help when using your code #3

unexpected behavior when calculating key interactions, and asking for help when using your code #3

Atomu2014 commented Apr 14, 2024

unexpected behavior when calculating key interactions, and asking for help when using your code #3

unexpected behavior when calculating key interactions, and asking for help when using your code #3

Comments

Atomu2014 commented Apr 14, 2024