Added a contribution file

EBjerrum · Sep 22, 2023 · 0848f6d · 0848f6d
1 parent 21229a8
commit 0848f6d
Show file tree

Hide file tree

Showing 3 changed files with 41 additions and 10 deletions.
diff --git a/CONTRIBUTION.md b/CONTRIBUTION.md
@@ -0,0 +1,33 @@
+# Contribution
+
+Thanks for your interest in contributing to the project. Please read on in the sections that apply.
+
+
+## Installation
+Clone and install in editable more like this
+
+    git clone git@github.com:EBjerrum/scikit-mol.git
+    pip install -e .
+
+## Adding transformers
+The projects transformers subclasses the BaseEstimator and Transformer mixin classes from sklearn. Their documentation page contains information on what requisites are necessary [https://scikit-learn.org/stable/developers/develop.html](https://scikit-learn.org/stable/developers/develop.html). Most notably:
+* The arguments accepted by __init__ should all be keyword arguments with a default value.
+* Every keyword argument accepted by __init__ should correspond to an attribute on the instance. 
+* * There should be no logic, not even input validation, and the parameters should not be changed.
+Scikit-learn classes depends on this in order to for e.g. the .get_params(), .set_params(), cloning abilities and representation rendering to work.
+
+### Docstrings
+We should ultimately consolidate on the numpy docstring format [https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard) which is also used by SciPy and other scikits.
+
+### Testing
+New transformer classes should be added to the pytest tests in the tests directory. There may be a need for specific tests for the specific transformer, but it should also be added to the general tests that test sklearn necessary aspects of the transformer such as clonability. 
+
+### Tips
+* We have observed that some external tools used "exotic" types such at np.int64 when doing hyperparameter tuning. It is thus necessary to cast to standard types before making calls to rdkit functions. This behaviour is tested in the test_parameter_types test
+
+* @property getters and setters can be used if additional logic are needed when setting the attributes from the keywords while at the same time adhering to the sklearn requisites. 
+
+* Some RDKit features uses objects as generators which may not be picklable. If instantiated and added to the object rather than instantiated at each function call for individual molecules, these should thus be removed and recreated via overloading the _get_state() and _set_state() methods. See MHFingerprintTransformer for an example.
+
+
+
diff --git a/README.md b/README.md
@@ -59,11 +59,6 @@ Bleeding edge
 
     pip install git+https://github.com:EBjerrum/scikit-mol.git
 
-Developers 
-
-    git clone git@github.com:EBjerrum/scikit-mol.git
-    pip install -e .
-
 ## Documentation
 
 There are a collection of notebooks in the notebooks directory which demonstrates some different aspects and use cases
@@ -76,6 +71,9 @@ There are a collection of notebooks in the notebooks directory which demonstrate
 * [Integrated hyperparameter tuning of Scikit-Learn estimator and Scikit-Mol transformer](https://github.com/EBjerrum/scikit-mol/blob/main/notebooks/06_hyperparameter_tuning.ipynb)
 * [Using parallel execution to speed up descriptor and fingerprint calculations](https://github.com/EBjerrum/scikit-mol/blob/main/notebooks/07_parallel_transforms.ipynb)
 
+## Contributing
+
+There are more information about how to contribute to the project in [CONTRIBUTION.md](https://github.com/EBjerrum/scikit-mol/CONTRIBUTION.md)
 
 ## BUGS
 Probably still, please check issues at GitHub and report there

diff --git a/notebooks/06_hyperparameter_tuning.py b/notebooks/06_hyperparameter_tuning.py
@@ -64,7 +64,7 @@
 
 # %%
 
-mol_list_train, mol_list_test, y_train, y_test = train_test_split(data.ROMol, data.pXC50, random_state=0)
+mol_list_train, mol_list_test, y_train, y_test = train_test_split(data.ROMol, data.pXC50, random_state=42)
 
 
 # %% [markdown]
@@ -111,10 +111,10 @@
 # %%
 
 param_dist = {'ridge__alpha': loguniform(1e-2, 1e3),
-            "morgantransformer__nBits": [256,512,1024,2048,4096],
-            'morgantransformer__radius':[1,2,3,4],
-            'morgantransformer__useCounts': [True,False],
-            'morgantransformer__useFeatures':[True,False]}
+            "morganfingerprinttransformer__nBits": [256,512,1024,2048,4096],
+            'morganfingerprinttransformer__radius':[1,2,3,4],
+            'morganfingerprinttransformer__useCounts': [True,False],
+            'morganfingerprinttransformer__useFeatures':[True,False]}
 
 # %% [markdown]
 # The report function was taken from [this example](https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html#sphx-glr-auto-examples-model-selection-plot-randomized-search-py) from the scikit learn documentation.