Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Labels for MolGraphConvFeaturizer #3926

Open
sumone-compbio opened this issue Mar 26, 2024 · 10 comments
Open

Feature Labels for MolGraphConvFeaturizer #3926

sumone-compbio opened this issue Mar 26, 2024 · 10 comments

Comments

@sumone-compbio
Copy link

sumone-compbio commented Mar 26, 2024

❓ Questions & Help

Hi, I need help labeling the feature names of MolGraphConvFeaturizer. I went through the code and found the features. Please correct me where my count for the feature is wrong:

  • Atom type: A one-hot vector of this atom, "C", "N", "O", "F", "P", "S", "Cl", "Br", "I", "other atoms" = 10
    • Formal charge: Integer electronic charge = 1
    • Hybridization: A one-hot vector of "sp", "sp2", "sp3" = 3
    • Hydrogen bonding: A one-hot vector of whether this atom is a hydrogen bond donor or acceptor = 1
    • Aromatic: A one-hot vector of whether the atom belongs to an aromatic ring = 1
    • Degree: A one-hot vector of the degree (0-5) of this atom = 6
    • Number of Hydrogens: A one-hot vector of the number of hydrogens (0-4) that this atom connected = 5

This sums up to 27 features, while there are 30 features. Please, let me know what I am missing. Thank you

@VaishnaviMudaliar
Copy link

Hi @sumone-compbio , are you considering the following features as well?
- Chirality: A one-hot vector of the chirality, "R" or "S". (Optional)
- Partial charge: Calculated partial charge. (Optional)

@sumone-compbio
Copy link
Author

  • Hi @VaishnaviMudaliar sorry but I am not using optional features. If I exclude these optional features the feature length is 30. I need to label each of these features.
  • Atom type: A one-hot vector of this atom, "C", "N", "O", "F", "P", "S", "Cl", "Br", "I", "other atoms" = 10
    Formal charge: Integer electronic charge = 1
    Hybridization: A one-hot vector of "sp", "sp2", "sp3" = 3
    Hydrogen bonding: A one-hot vector of whether this atom is a hydrogen bond donor or acceptor = 1
    Aromatic: A one-hot vector of whether the atom belongs to an aromatic ring = 1
    Degree: A one-hot vector of the degree (0-5) of this atom = 6
    Number of Hydrogens: A one-hot vector of the number of hydrogens (0-4) that this atom connected = 5

This sums up to 27 features, while there are 30 features. Please, let me know what I am missing. Thank you

@sumone-compbio
Copy link
Author

@VaishnaviMudaliar hi, is there any update? I really need to know the correct labels of these features for my thesis subsmission.

@rbharath
Copy link
Member

rbharath commented Apr 6, 2024

@sumone-compbio Can you come by our office hours? (MWF at 9am PST)

@sumone-compbio
Copy link
Author

@rbharath hi, I hope I'm not late. So, all I need to know is what are the 30 default features of MolGraphConvFeaturizer. Following the source code, I'm not able to count all the 30 default features when I add them up. As you can see in my comment above I'm only able to add up 27 default features. I don't understand what I am missing following the source code.

@rbharath
Copy link
Member

@sumone-compbio Try checking the source; where are you getting the 30 number? See in the code what is actually being put in. I don't have time right now to go through it, but I can show you where to look if you can join OH.

@sumone-compbio
Copy link
Author

sumone-compbio commented Apr 19, 2024

@rbharath hi, it's mentioned in the source code that the default atom or node level features are 30 and for edge (bond) level features it's 11 by default. Also, when I'm running this featurizer on smiles it indeed returns 30 features for each atom.

https://github.com/deepchem/deepchem/blob/master/deepchem/feat/molecule_featurizers/mol_graph_conv_featurizer.py

@sumone-compbio
Copy link
Author

@rbharath one more thing I would like to add is the atom positions from the featurizer (by comparing the first 10 elements of each atom feature vector) are different from the atom index you would get using mol.GetAtomWithIdx() in rdkit. You can also verify this by comparing the positions of the atoms in the featurizer result with the function below (the function simply marks the atom index to the atom in the mol image):

`def mol_with_atom_index(mol):
for atom in mol.GetAtoms():
atom.SetAtomMapNum(atom.GetIdx())
return mol

mol = Chem.MolFromSmiles(smiles)
mol_with_atom_index(mol)`

This is a suggestion to keep the indices the same to avoid any result misinterpretation. E.g. I am using GNNExplainer, I wish to highlight the substructures contributing the most towards a certain prediction e.g. antibiotic or not, etc. If the indices from your featurizer differ from those from the rdkit mol object results can't be interpreted. Thank you

@rbharath
Copy link
Member

@sumone-compbio Can you come by our office hours? I would be happy to discuss with you there

@sumone-compbio
Copy link
Author

@rbharath sure, thanks for the patience. I could also mail you if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants