The goal of this project is to study the presence or absence of a certain property in molecules. More specifically, we have several thousands of molecules in the form of graphs and we would like to predict a binary label attesting of whether they have a property of interest. We followed a supervised approach and we implemented both kernel algorithms and kernel for graphs by hand.
A more detailed explanation of our results can be found in our report under report/report.pdf
Please refer to the following sections for more information about the package usage:
In the following table, we present the model which allowed us to reach the best ROC AUC score of 0.89480 on the test set. It was trained using Kernel logistic regression with a regularization parameter of
Kernel | Label enrichment | Parameters |
---|---|---|
Counting | ||
Edge histogram | ||
Node histogram |
|
|
Geometric walk |
|
|
Shortest path |
|
In order to use our package and run your own experiments, we advise you to set up a virtual environment.
You will need Python 3 and the virtualenv package:
pip3 install virtualenv
Then, create your virtual environment and switch to it:
python3 -m venv venv
source venv/bin/activate (Linux)
.\venv\Scripts\Activate.ps1 (Windows PowerShell)
Finally, install all the requirements:
pip3 install -r requirements.txt (Linux)
pip3 install -r .\requirements.txt (Windows PowerShell)
Note: Tested on Linux with Python 3.10.9 and on Windows.
Below, we give a brief tree view of our package.
.
├── doc # contains a generated documentation of src/ in html
├── report # contains our complete report in pdf format
├── src # source code
| ├── engine
| | ├── __init__.py
| | ├── gridsearch.py
| | ├── hub.py
| | └── objective.py
| ├── kernels
| | ├── __init__.py
| | ├── base.py # abstract class
| | ├── count.py
| | ├── edge_histogram.py
| | ├── geometric_walk.py
| | ├── node_histogram.py
| | ├── order_walk.py
| | ├── shortest_path.py
| | └── sum.py
| ├── models
| | ├── __init__.py
| | ├── base.py # abstract class
| | ├── logistic_regression.py
| | ├── ridge_regression.py
| | └── svc.py
| ├── utils
| | ├── __init__.py
| | ├── data.py
| | ├── functions.py
| | ├── graph_ops.py
| | ├── kernel_ops.py
| | └── misc.py
| ├── __init__.py
| └── start.py
├── README.md
├── data_analysis.ipynb # simple analysis of the dataset
└── requirements.txt # contains the required Python packages to run our files
The main file to use for experimenting is src/start.py. The command is as follows:
python3 src/start.py [options]
--gridsearch-subset
: Select a subset of the training set to fasten cross validation. Default: 0.2.--kernel-name
: Name of the kernel following package usage. Default: "sum".--model-name
: Name of the model following package usage. Default: "logreg".--normalize
: If "True", kernels matrices are normalized to zero mean and unit variance. Default: "True".--center
: If "True", kernels matrices are centered. Default: "False".--trials
: Choose the number of gridsearch trials. Default: 0.--folds
: Number of cross-validation splits. Default: 5.--eval-metric
: Evaluation metric for gridsearch. Default: "roc_auc".--verbose
: Verbosity. Default: "False".--submission
: If "True", a submission file will be produced after hyperparameter optimization. Default: "True".--results-dir
: Directory where submission files are stored. Default: "results".
Example: In order to reproduce our results, simply run:
python3 src/start.py
A complete documentation is available in the doc/src/ folder. If it is not generated, you can run from the root folder:
pip3 install pdoc3
python3 -m pdoc -o doc/ --html --config latex_math=True --force src/
Then, open doc/src/index.html in your browser and follow the guide!