Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notebooks to document modules #834

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

KnathanM
Copy link
Contributor

Description

I am working on adding more documentation. It seemed easiest to document each key part of Chemprop in a separate notebook. These notebooks are ready for review. I'd like to also add notebooks that document the message passing, aggregation, ffn, and MPNN, and loss/metric modules. But I am also open to there being better ways to document the fundamentals of Chemprop.

I also added a super minimal example of using Chemprop.

Questions

I expect this PR is related to #772, so how to best join the work is an open question.

@kevingreenman kevingreenman added this to the v2.0.1 milestone Apr 24, 2024
@kevingreenman
Copy link
Member

@KnathanM thanks, I think these will be a great addition! We should probably link to these in the documentation pages to make them more accessible.

@shihchengli
Copy link
Contributor

That would be a really good addition. New users can quickly go through this to get a better understanding of how to use each module. Since more example notebooks will be added in the future, it may become more difficult to find a file among a bunch of examples. I think adding a README.md to demonstrate the content of each folder and file would be great (the current structure looks clear to me so it may not be necessary to add at this stage). Also, making the format style similar between the new notebooks and the existing ones would be great.

@KnathanM KnathanM marked this pull request as draft April 25, 2024 15:18
@KnathanM KnathanM marked this pull request as ready for review May 1, 2024 15:34
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was actually moved to examples/tutorials/featurizers/molecule_featurizer.ipynb and then edited.

@KnathanM
Copy link
Contributor Author

KnathanM commented May 1, 2024

I have added the extra tutorials for loss functions and models. I also rebased onto main. These notebooks are ready for a review for accuracy and style. In terms of style it is an open question if we want to edit our other example notebooks to have the same style, change mine to have the same style as our current notebooks, or leave them as they are.

Copy link
Contributor

@shihchengli shihchengli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KnathanM I have read through all the notebooks. I truly appreciate your effort in writing them down. I left some comments and thoughts on those examples. You can address those you think are reasonable first, and we can discuss the rest of them. Of course, feel free to let me know if you need any help.

  1. I am thinking about the necessity of providing the super_minimal.ipynb. I can see that it would be helpful for people who want a quick start, but the example provided in this case is not comprehensive enough for a regression task (e.g., it lacks target scaling and only uses a training set). I also feel this notebook is very similar to multi_task.ipynb. I am fine with keeping both, but perhaps we could consider loading a file, splitting it into train/validation/test sets, and scaling target for this minimal example.
  2. IMO, if it is not too much work, maintaining consistent styles across different notebooks and the codebase is a good practice. I also feel that placing most imports in the first cell would make the notebook clearer, but I understand it might be straightforward to import some of them later.
  3. In data/data_splitting.ipynb, you mention that datapoints can also be used as input in make_split_indices (i.e., make_split_indices(datapoints)). Is that correct?
  4. In data/datasets.ipynb, the “Datasets with custom featurizers” section does not show any cases of using custom featurizers. Perhaps we could come up with a case that uses another package to create its own atom/bond/molecule featurizers. I suggest adding a link to those notebooks in this customized part to the other featurizer notebooks or including a case here.
  5. Are "Scaling" and "Computations" necessary to include in the message_passing notebook? Furthermore, do we have the computational cost for the message passing layer in the chemprop paper?
  6. I am thinking about to have a README page here so it is easier to find the corresponding examples. Do you have any thoughts on this?
  7. Relative links to other notebooks should be available in markdown cells in Jupyter notebooks. If it is not difficult to add and doesn't have any potential issues (e.g., how it works on ReadTheDocs), it might be worth adding those references when you refer to another notebook.
  8. How do we maintain the compatibility of every notebook with any new functions added into the main branch is important. I don't know if we can run CI on those notebooks. If there are any technical issues in achieving this, we might need to find some way to maintain those examples properly.

Minor things:

  1. In data/data_splitting.ipynb, the block about generating MoleculeDataset can be removed. Also, you mention that we will receive a warning from Astartes when the sizes do not match. However, the warning message, "Requested validation size of 0.25, got 0.30. Requested test size of 0.25, got 0.30," seems incorrect. It looks like the validation size was actually 0.20 in the end. Of course, this is not related to you, but if you think it is a potential bug, we should open an issue on Astartes.
  2. In data/datapoints.ipynb, for reaction datapoints, it is mentioned that “extra atom features, bond features, and atom descriptors are not supported.” Is this currently not supported, or will it never be supported? If it might be implemented in the future, perhaps add “not supported yet.” In the sentence of "Note that if an Agent is provided, its graph is concatenated to the reactant graph with no edges connecting them," there is a typo in "concatenated" and does “Agent” mean a reagent, solvent, or catalyst indicated by >?
  3. Both in the dataloaders and datasets notebooks, dataset[0] is used to show the Datum. I think you can remove the one in dataloaders and move the collate_batch() example to the datasets notebook.
  4. I am thinking about whether we should provide a case where atom and bond featurizers are truly customized. Current examples only adjust the range for each feature.
  5. In the molgraph_featurizer notebook, I think the last two cells can be modified, where the generation of MoleculeDatapoint seems unnecessary.
  6. As you may have noticed, some hyperparameters are saved twice. The ones I have noticed are 'X_d_transform', 'output_transform', 'criterion', 'V_d_transform', and 'graph_transform'. This is not related to this PR. I'm just listing all the parameters I have seen.
  7. The generation of MoleculeDataset objects (two cells in the section of “Input dimensions”) seems unnecessary in the message_passing notebook.
  8. Instead of modifying the source code to support customizable activation functions, I am thinking about providing the flexibility to accept both Activation and nn.Module objects as inputs for activation functions (following the format here ), so we can use any user-defined activation function. We can discuss this and possibly open another PR.
  9. In the metrics notebook, BoundedMAEMetric is imported twice. One of them should be BoundedMSEMetric.

@KnathanM
Copy link
Contributor Author

KnathanM commented May 20, 2024

Thank you Shih-Cheng for the comprehensive review. I have moved files around and done a bunch of edits so let me know if you need help knowing what all changed.

  1. I agree that it is similar enough to multi_task.ipynb that we'll probably be okay without it. I removed it.

  2. I felt it was clearer to put many imports in the code blocks where the imports are used. Do you disagree? I could be wrong and am willing to collect the imports all to the first cell if others think that is clearer.

  3. It is only kind of correct. make_split_indices asks for the first argument to be mols: Sequence[Chem.Mol] because for some of the split methods rely on the molecule structures. For other split methods (random, CV) we only use mols to get the number of datapoints. Passing in the list of datapoints is sufficient for that. I was divided on how to present this in the splitting tutorial because on one hand showing the user how to make the list of mol objects will always work regardless of split type, but on the other it could be confusing for users why they need to make a list of mol objects if they are just doing a random split. What do you think?

  4. I've added a link to the molgraph notebook which has a section about customizing the featurizer.

  5. After more thought, I agree that the scaling sections aren't really necessary as there is the separate scaling notebook. Also I originally added the note about computations in case someone went to the tutorial for the equations used. I think it is reasonable that they would go straight to the API docstring for that so I removed this section.

  6. As I was writing the README's, I felt that what I wrote for this PR aren't really tutorials as much as they are module documentation. So I moved all these module tutorials to the docs and added an index.rst file. I'm not sure if the docs auto build on this PR, but I have built them locally and can show you what they look like.

  7. Added, I think in all the relevant places. Let me know if you feel I missed some.

  8. With the tutorials moved to documentation, then this falls in the same category of needing to update documentation when the Chemprop code changes. These tutorial/doc notebooks don't need to be run except when the docs are updated, so I don't plan to include them in the CI.

Minor things

  1. I originally intended to show that the output of split_data_by_indicies is a list of datapoints which needs to be turned into datasets, but that isn't necessary for data splitting so I removed it. I opened an issue on astares about the warning message.

  2. I'm not sure myself whether extra atom/bond features will be supported for reactions in the future. I'll ask in our meeting today. Thanks for the typo check. The agent can be a reagent, solvent, catalyst, or anything in between the two ">" in the reaction smiles "REACTANT>AGENT>PRODUCT". Is that unclear or is there a better way to explain that?

  3. I removed the Datum showing from dataloaders, but left the collage_batch() example in dataloaders as it is used by the dataloader.

  4. I added an example in atom and bond featurizers for how to make a custom non-Chemprop featurizer, which I called a generic featurizer.

  5. I needed the datapoint to show how the featurization works, but now I think showing that isn't necessary as the featurization is done by MolGraphCache or MolGraphCacheOnTheFly and not the user.

  6. Yeah we should fix that soon. There is an open issue for it.

  7. Yes that could be more focused, so I removed the dataset creation.

  8. That sounds like a good idea to me. Do you want to open an issue to track that? Or I can.

  9. Good catch

@shihchengli shihchengli mentioned this pull request May 22, 2024
2 tasks
@shihchengli
Copy link
Contributor

Thanks for your hard work in making those changes.

I felt it was clearer to put many imports in the code blocks where the imports are used. Do you disagree? I could be wrong and am willing to collect the imports all to the first cell if others think that is clearer.

It is a trade-off between conciseness and readability. I don’t have a strong preference for either, so let’s keep it as it is now.

It is only kind of correct. make_split_indices asks for the first argument to be mols: Sequence[Chem.Mol] because for some of the split methods rely on the molecule structures. For other split methods (random, CV) we only use mols to get the number of datapoints. Passing in the list of datapoints is sufficient for that. I was divided on how to present this in the splitting tutorial because on one hand showing the user how to make the list of mol objects will always work regardless of split type, but on the other it could be confusing for users why they need to make a list of mol objects if they are just doing a random split. What do you think?

I would suggest using a list of mol objects since we document the input as Sequence[Chem.Mol], and perhaps also consider using another split method like scaffold_balanced. If you want, you could add a comment saying that for random and CV splits, the input can actually be a list of datapoints (it can be anything as long as the number of elements equals the number of datapoints). Actually, the Sequence[Chem.Mol] could be replaced with Sequence[MoleculeDatapoint], as a list of mols can be obtained by [datapoint.mol for datapoint in datapoints]. However, this would lead to another issue about how to only use the reactant for a ReactionDatapoint, which might be the reason it is implemented the current way and modifying it to a list of datapoints would make this function not generic enough (i.e., it fully relies on the datapoints objects and people cannot just call these functions in their program). I fully agree with your thoughts but I would want the description of the code and the examples we generated to be as consistent as possible. If you still want to use datapoints in the example, I suggest directly mentioning the CV and Random are the only two cases at the end of the sentence “Otherwise you can use the datapoints” in the notebook for clarification.

With the tutorials moved to documentation, then this falls in the same category of needing to update documentation when the Chemprop code changes. These tutorial/doc notebooks don't need to be run except when the docs are updated, so I don't plan to include them in the CI.

I am fine with moving them to the documentation. Without running CI tests, I think those notebooks might be missed when code changes. This would thus leave the responsibility to us to check whether the documentation is always updated. I think it would still be great to somehow run the testing on it so we don’t need to worry about the examples being outdated in the documentation. I am fine if we only want to update once before every release (IMO, it would be additional work for us to keep tracking this). I think after this PR merge, we can leave this question into Hao-Wei’s PR #840 and see if we can and want to do the testing on these files.

I'm not sure myself whether extra atom/bond features will be supported for reactions in the future. I'll ask in our meeting today. Thanks for the typo check. The agent can be a reagent, solvent, catalyst, or anything in between the two ">" in the reaction smiles "REACTANT>AGENT>PRODUCT". Is that unclear or is there a better way to explain that?

Any update on whether to support extra atom/bond features in reactions? As for the description of the agent, it is clear to me. Thanks for the explanation.

Yeah we should fix that soon. There is an open issue for it.

Do you mean PR #832? They only fixed some of them.

That sounds like a good idea to me. Do you want to open an issue to track that? Or I can.

I can open one.

@shihchengli
Copy link
Contributor

Some of my other comments:

  • In the dataset examples, there is a bullet point before “To make a dataset, you first need a list of datapoints.” I think this sentence should end with a period. A similar issue also occurs in data splitting, atom featurizers, bond featurizers, molecule MolGraph featurizers, message passing, aggregation, predictors, and scaling examples.
  • In the data splitting notebook, the link to the Astartes documentation directs to its GitHub repo. Is it supposed to link to its ReadTheDocs page?
  • In metrics, please add links for “See the source code for computation details of the metrics” and “See the loss function notebook for more details,” if it is possible to connect to their corresponding pages.
  • The “Reaction MolGraph Featurizers” example is not finished yet. I will review it when it is ready.
  • Consider adding notebooks for multiclass classifications in the tutorial. If you find some way to reduce overlap but can show how to do that, that would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants