New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add notebooks to document modules #834
base: main
Are you sure you want to change the base?
Conversation
@KnathanM thanks, I think these will be a great addition! We should probably link to these in the documentation pages to make them more accessible. |
That would be a really good addition. New users can quickly go through this to get a better understanding of how to use each module. Since more example notebooks will be added in the future, it may become more difficult to find a file among a bunch of examples. I think adding a README.md to demonstrate the content of each folder and file would be great (the current structure looks clear to me so it may not be necessary to add at this stage). Also, making the format style similar between the new notebooks and the existing ones would be great. |
a82d17d
to
22ce21a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was actually moved to examples/tutorials/featurizers/molecule_featurizer.ipynb
and then edited.
I have added the extra tutorials for loss functions and models. I also rebased onto main. These notebooks are ready for a review for accuracy and style. In terms of style it is an open question if we want to edit our other example notebooks to have the same style, change mine to have the same style as our current notebooks, or leave them as they are. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KnathanM I have read through all the notebooks. I truly appreciate your effort in writing them down. I left some comments and thoughts on those examples. You can address those you think are reasonable first, and we can discuss the rest of them. Of course, feel free to let me know if you need any help.
- I am thinking about the necessity of providing the super_minimal.ipynb. I can see that it would be helpful for people who want a quick start, but the example provided in this case is not comprehensive enough for a regression task (e.g., it lacks target scaling and only uses a training set). I also feel this notebook is very similar to multi_task.ipynb. I am fine with keeping both, but perhaps we could consider loading a file, splitting it into train/validation/test sets, and scaling target for this minimal example.
- IMO, if it is not too much work, maintaining consistent styles across different notebooks and the codebase is a good practice. I also feel that placing most imports in the first cell would make the notebook clearer, but I understand it might be straightforward to import some of them later.
- In data/data_splitting.ipynb, you mention that datapoints can also be used as input in make_split_indices (i.e.,
make_split_indices(datapoints)
). Is that correct? - In data/datasets.ipynb, the “Datasets with custom featurizers” section does not show any cases of using custom featurizers. Perhaps we could come up with a case that uses another package to create its own atom/bond/molecule featurizers. I suggest adding a link to those notebooks in this customized part to the other featurizer notebooks or including a case here.
- Are "Scaling" and "Computations" necessary to include in the message_passing notebook? Furthermore, do we have the computational cost for the message passing layer in the chemprop paper?
- I am thinking about to have a README page here so it is easier to find the corresponding examples. Do you have any thoughts on this?
- Relative links to other notebooks should be available in markdown cells in Jupyter notebooks. If it is not difficult to add and doesn't have any potential issues (e.g., how it works on ReadTheDocs), it might be worth adding those references when you refer to another notebook.
- How do we maintain the compatibility of every notebook with any new functions added into the main branch is important. I don't know if we can run CI on those notebooks. If there are any technical issues in achieving this, we might need to find some way to maintain those examples properly.
Minor things:
- In data/data_splitting.ipynb, the block about generating MoleculeDataset can be removed. Also, you mention that we will receive a warning from Astartes when the sizes do not match. However, the warning message, "Requested validation size of 0.25, got 0.30. Requested test size of 0.25, got 0.30," seems incorrect. It looks like the validation size was actually 0.20 in the end. Of course, this is not related to you, but if you think it is a potential bug, we should open an issue on Astartes.
- In data/datapoints.ipynb, for reaction datapoints, it is mentioned that “extra atom features, bond features, and atom descriptors are not supported.” Is this currently not supported, or will it never be supported? If it might be implemented in the future, perhaps add “not supported yet.” In the sentence of "Note that if an Agent is provided, its graph is concatenated to the reactant graph with no edges connecting them," there is a typo in "concatenated" and does “Agent” mean a reagent, solvent, or catalyst indicated by
>
? - Both in the dataloaders and datasets notebooks,
dataset[0]
is used to show theDatum
. I think you can remove the one in dataloaders and move thecollate_batch()
example to the datasets notebook. - I am thinking about whether we should provide a case where atom and bond featurizers are truly customized. Current examples only adjust the range for each feature.
- In the molgraph_featurizer notebook, I think the last two cells can be modified, where the generation of MoleculeDatapoint seems unnecessary.
- As you may have noticed, some hyperparameters are saved twice. The ones I have noticed are 'X_d_transform', 'output_transform', 'criterion', 'V_d_transform', and 'graph_transform'. This is not related to this PR. I'm just listing all the parameters I have seen.
- The generation of MoleculeDataset objects (two cells in the section of “Input dimensions”) seems unnecessary in the message_passing notebook.
- Instead of modifying the source code to support customizable activation functions, I am thinking about providing the flexibility to accept both Activation and nn.Module objects as inputs for activation functions (following the format here ), so we can use any user-defined activation function. We can discuss this and possibly open another PR.
- In the metrics notebook,
BoundedMAEMetric
is imported twice. One of them should beBoundedMSEMetric
.
Thank you Shih-Cheng for the comprehensive review. I have moved files around and done a bunch of edits so let me know if you need help knowing what all changed.
Minor things
|
Thanks for your hard work in making those changes.
It is a trade-off between conciseness and readability. I don’t have a strong preference for either, so let’s keep it as it is now.
I would suggest using a list of mol objects since we document the input as
I am fine with moving them to the documentation. Without running CI tests, I think those notebooks might be missed when code changes. This would thus leave the responsibility to us to check whether the documentation is always updated. I think it would still be great to somehow run the testing on it so we don’t need to worry about the examples being outdated in the documentation. I am fine if we only want to update once before every release (IMO, it would be additional work for us to keep tracking this). I think after this PR merge, we can leave this question into Hao-Wei’s PR #840 and see if we can and want to do the testing on these files.
Any update on whether to support extra atom/bond features in reactions? As for the description of the agent, it is clear to me. Thanks for the explanation.
Do you mean PR #832? They only fixed some of them.
I can open one. |
Some of my other comments:
|
Description
I am working on adding more documentation. It seemed easiest to document each key part of Chemprop in a separate notebook. These notebooks are ready for review. I'd like to also add notebooks that document the message passing, aggregation, ffn, and MPNN, and loss/metric modules. But I am also open to there being better ways to document the fundamentals of Chemprop.
I also added a super minimal example of using Chemprop.
Questions
I expect this PR is related to #772, so how to best join the work is an open question.