Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data splitting with enumerated SMILES #9

Open
lorenzoFabbri opened this issue Aug 6, 2019 · 2 comments
Open

Data splitting with enumerated SMILES #9

lorenzoFabbri opened this issue Aug 6, 2019 · 2 comments

Comments

@lorenzoFabbri
Copy link

I'm trying to use LSTMs to predict a molecular property. I was writing my own code but then I found out that OpenChem has more or less everything I need already.

I have a question regarding data splitting. I must say I did not go over the entire library.
When I was using my own code, I decided to use SMILES enumeration since my dataset is rather small. In doing so, I was wondering whether to keep all the SMILES of the same compound in the same set (either training or validation). It seems that OpenChem does not take this into consideration and the split is done randomly (SMILES codes of the same compound can appear both in the training set and the validation set). Is my understanding correct? If so, isn't this a form of data leakage?

Thank you.

@isayev
Copy link
Collaborator

isayev commented Aug 6, 2019

Dear Lorenzo:
Thanks for using OpenChem! Please let us know about your experience.

It seems that OpenChem does not take this into consideration and the split is done randomly

It's a choice of a practitioner:) Our intention that SMILES augmentation should be applied either after the split or on-the-fly during the actual training.

@lorenzoFabbri
Copy link
Author

lorenzoFabbri commented Aug 6, 2019

Thanks for the quick response.

Taking for instance the provided examples (e.g., Tox21), if I understand correctly, the compounds in the training set are enumerated while the compounds in the validation set are not. Correct?

Have you tried enumerating also the compounds in the validation set, and perhaps averaging the predictions for each compound?


To be honest, I was not able to make it work with my dataset. It's extremely similar to Tox21 (CSV file with label + SMILES) but I keep getting many errors. Unfortunately I did not keep track of all of them: a recurring one was RuntimeError: cuda runtime error: device-side assert triggered at.... Also, the provided code for Tox21 does not seem to work when the batch size is 1. I'll try again tomorrow. I think we can close this issue, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants