Skip to content

2017 Summer School on the Machine Learning in the Molecular Sciences. This project aims to help you understand some basic machine learning models including neural network optimization plan, random forest, parameter learning, incremental learning paradigm, clustering and decision tree, etc. based on kernel regression and dimensionality reduction,…

License

nickcafferry/Machine-Learning-in-Molecular-Sciences

Repository files navigation

Documentation Status MIT License Mathematica Python version Wolfram Cloud Huawei Cloud

Aims | Panel Topics | Course Schedule | Internal Links | External Links |

Welcome to Machine Learning in the Molecular Sciences


UosEmd.md.jpg

Aims

The NYU-ECNU Center for Computational Chemistry at New York University Shanghai (a.k.a, NYU Shanghai) announced a summer school dedicated to machine learning and its applications in the molecular sciences to be held June, 2017 at the NYU Shanghai Pudong Campus. Using a combination of technical lectures and hands-on exercises, the school aimed to instruct attendees in both the fundamentals of modern machine learning techniques and to demonstrate how these approaches can be applied to solve complex computational problems in chemistry, biology, and materials science. In order to promote the idea of free to code, this project is built to help you understand some basic machine learning models mentioned below.

Panel-Topics

Fundamental topics to be covered include basic machine learning models such as kernel methods and neural networks optimization schemes, parameter learning and delta learning paradigms, clustering, and decision trees. Application areas will feature machine learning models for representing and predicting properties of individual molecules and condensed phases, learning algorithms for bypassing explicit quantum chemical and statistical mechanical calculations, and techniques applicable to biomolecular structure prediction, bioinformatics, protein-ligand binding, materials and molecular design and various others.

Course-Schedule

  • Monday, June 12

    8:45 - 9:00: Welcome and Introduction

    9:00 - 10:00: Introduction to Machine Learning (presented by Matthias Rupp)

    10:00 - 10:20: Coffee Break

    10:20 - 11:20: Kernel-based Regression (presented by Matthias Rupp)

    11:20 - 12:30: Dimensional Reduction, Feature Selection, and Clustering techniques (presented by Alex Rodriguez)

    12:30 - 14:00: Lunch Break

    14:00 - 15:00: Introduction to Neural Networks (presented by Mark Tuckerman)

    15:00 - 15:30: Coffee Break

    15:30 - 17:30: Practical Session: Clustering with Feature Selection and Validation (presented by Alex Rodriguez)

  • Tuesday, June 13

    9:00 - 10:00: Random Forests(presented by Yingkai Zhang)

    10:00 - 10:30: Coffee Break

    10:30 - 11:30: Learning Curves, Representations, and Training Sets I (presented by Anatole von Lilienfeld)

    11:30 - 12:30: Learning Curves, Representations, and Training Sets II (presented by Anatole von Lilienfeld)

    12:30 - 14:00: Lunch Break

    14:00 - 15:00: Review of Electronic Structure, Atomic, Molecular, and Crystal Representations (presented by Mark Tuckerman)

    15:00 - 15:30: Coffee Break

    15:30 - 17:30: Practical Session: Learning Curves (presented by Anatole von Lilienfeld)

  • Wednesday, June 14

    9:00 - 10:00: Predictiong Properties of Molecules and Materials (presented by Matthias Rupp)

    10:00 - 10:30: Coffee Break

    10:30 - 11:30: Parameter Learning and Delta Learning (presented by Anatole von Lilienfeld)

    11:30 - 12:30: Learning Electronic Densities (presented by Mark Tuckerman),ML Models of Crystal Properties (presented by Anatole von Lilienfeld)

    12:30 - 14:00: Lunch Break

    14:00 - 15:30: Practical Session: Machine Learning and Property Prediction I (presented by Matthias Rupp)

    15:30 - 16:00: Coffee Break

    16:00 - 17:30: Practical Session: Machine Learning and Property Prediction II (presented by Matthias Rupp)

  • Thursday, June 15

    9:00 - 10:00: Machine Learning of Potential Enenery Surfaces (Ming Chen, California Institute Technology)

    10:00 - 10:30: Coffee Break

    10:30 - 11:30: Machine Learning Based Enhanced Sampling (Ming Chen)

    11:30 - 12:30: Machine Learning of Free Energy Surfaces (presented by Mark Tuckerman)

    12:30 - 14:00: Lunch Break

    14:00 - 15:00: Cluster-based Analysis of Molecular Simulations (presented by Alex Rodriguez)

    15:00 - 15:30: Coffee Break

    15:30 - 17:30: Practical Session: Neural Network Learning of Free Energy Surface (presented by Mark Tuckerman)

  • Friday, June 16

    9:00 - 10:00: Development of Protein-ligand Scoring Functions (presented by Yingkai Zhang)

    10:00 - 10:30: Coffee Break

    10:30 - 11:30: Machine Learning in Structural Biology I (presented by Yang Zhang)

    11:30 - 12:30:Machine Learning in Structural Biology II (presented by Yang Zhang)

    12:30 - 14:00: Lunch Break

    14:00 - 15:30: Practical Session: Random Forests and Scoring Functions (presented by Yingkai Zhang)

    15:30 - 16:00: Coffee Break

    16:00 - 17:30: Practical Session: Machine Learning for Structural Bioinformatics (presented by Yang Zhang)

Codes

  • Tuesday-June-13

    For Practical Session: Learning Curves, please run these commands on Jupyterlab via huawei cloud:

    !pip install qml
    !git clone https://github.com/qmlcode/tutorial.git
    ls
    cd tutorial
    ls
    %load exercise_2_1.py
    %run exercise_2_1.py
    %load exercise_2_2.py
    %run exercise_2_2.py
    %load exercise_2_3.py
    %run exercise_2_3.py
    %load exercise_2_4.py
    %run exercise_2_4.py
  • Wednesday-June-14

    For Practical Session: Machine Learning and Property Prediction, please run these commands on Wolfram Cloud:

    (*Please adjust the following path to where you unpacked the reference implementation code from the supplementary material.*)
    
    AppendTo[$Path,FileNameJoin[{"Path","to","library"}]]; (* Parent directory containing QMMLPack directory *)
    
  • Thursday-June-15

    For Practical Session: Machine Learning of Free Energy Surfaces, please run these commands on Linux system (In order to compile the code, a C++ compiler and the mkl library will be needed):

    1. Unpack the tar file:

                                      tar -xzvf Neural_network_practical_software.tar.gz
    

    2. Change Command-Line-Interface to the directory created by unpacking and compile the source code. At the beginning, edit 'Makefile' and change the C and C++ compliers to the corresponding ones you have available on your sytem, e.g., 'gcc' and 'g++' or 'icc' if necessary. The complie the code by typing

                                                           make
    

    3. Create a training data set from the full dataset. One of two commands is avaiable for use:

                                       head -n ala-dip-data_all.txt > ala-dip-data.txt
                                       tail -nl ala-dip-data_all.txt > ala-dip-data.txt
    

    Here n is the number of training points you wish to extract from the full dataset.

    4. Edit the 2nd, 3rd, 4th, and 5th lines in the file "INPUT.txt" if you want to change the calculation type, number of conjugate gradient steps, checkpointing frequency of weights, and number of conjugate gradient line-minmization steps. As to the calculation type, '1' indicates caculating neural network parameters starting from scratch, and '-1' calculating neural network parameters starting from an old set contained in file "weight.txt", and '0' means validation calculation of the neural network.

Deployment

Machine Learning in Molecular Sciences

Internal-Links

  • Annual Conference on Neural Information Processing Systems (NIPS)

  • International Conference on Machine Learning (ICML)

  • Conference on Learning Theory (COLT)

External-Links

One of the exciting aspects of Machine-Learning (ML) techniques is their possible to democratize molecular and materials modelling with relatively economical computational calculations and low level entry for common folks. (Pople's Gassian software makes quantum chemistry calculations really approachable).

The success of machine-learning technology relies on three contributing factors: open data, open software and open education.

Open data:

Publicly accessible structure and property databases for molecules and solid materials.
Computed structures and properties:

AFLOWLIB (Structure and property repository from high-throughput ab initio calculations of inorganic materials)

Computational Materials Repository (Infrastructure to enable collection, storage, retrieval and analysis of data from electronic-structure codes)

GDB (Databases of hypothetical small organic molecules)

Harvard Clean Energy Project (Computed properties of candidate organic solar absorber materials)

Materials Project (Computed properties of known and hypothetical materials carried out using a standard calculation scheme)

NOMAD (Input and output files from calculations using a wide variety of electronicstructure codes)

Open Quantum Materials Database (Computed properties of mostly hypothetical structures carried out using a standard calculation scheme)

NREL Materials Database (Computed properties of materials for renewable-energy applications)

TEDesignLab (Experimental and computed properties to aid the design of new thermoelectric materials)

ZINC (Commercially available organic molecules in 2D and 3D formats)

Experimental structures and properties:

ChEMBL (Bioactive molecules with drug-like properties)

ChemSpider (Royal Society of Chemistry’s structure database, featuring calculated and experimental properties from a range of sources)

Citrination (Computed and experimental properties of materials)

Crystallography Open Database (Structures of organic, inorganic, metal–organic compounds and minerals )

CSD (Repository for small-molecule organic and metal–organic crystal structures)

ICSD (Inorganic Crystal Structure Database)

MatNavi (Multiple databases targeting properties such as superconductivity and thermal conductance)

MatWeb (Datasheets for various engineering materials, including thermoplastics, semiconductors and fibres)

NIST Chemistry WebBook (High-accuracy gas-phase thermochemistry and spectroscopic data)

NIST Materials Data Repository (Repository to upload materials data associated with specifc publications)

PubChem (Biological activities of small molecules)

Open Software:

Publicly accessible learning resources and tools related to machine learning
General-purpose machine-learning frameworks:

Caret (Package for machine learning in R)

Deeplearning4j (Distributed deep learning for Java)

H2O.ai (Machine-learning platform written in Java that can be imported as a Python or R library)

Keras (High-level neural-network API written in Python)

Mlpack (Scalable machine-learning library written in C++)

Scikit-learn (Machine-learning and data-mining member of the scikit family of toolboxes built around the SciPy Python library)

Weka (Collection of machine-learning algorithms and tasks written in Java)

Machine-learning tools for molecules and materials:

Amp (Package to facilitate machine learning for atomistic calculations)

ANI (Neural-network potentials for organic molecules with Python interface)

COMBO (Python library with emphasis on scalability and eficiency)

DeepChem (Python library for deep learning of chemical systems)

GAP (Gaussian approximation potentials)

MatMiner (Python library for assisting machine learning in materials science)

NOMAD (Collection of tools to explore correlations in materials datasets)

PROPhet (Code to integrate machine-learning techniques with quantum-chemistry approaches)

TensorMol (Neural-network chemistry package)

Open education:

  • fast.ai is a course that is “making neural nets uncool again” by making them accessible to a wider community of researchers. One of the advantages of this course is that users start to build working machine-learning models almost immediately. However, it is not for absolute beginners, requiring a working knowledge of computer programming and high-school-level mathematics.

  • DataCamp ofers an excellent introduction to coding for data-driven science and covers many practical analysis tools relevant to chemical datasets. This course features interactive environments for developing and testing code and is suitable for non-coders because it teaches Python at the same time as machine learning.

  • Academic MOOCs are useful courses for those wishing to get more involved with the theory and principles of artifcial intelligence and machine learning, as well as the practice. The Stanford MOOC is popular, with excellent alternatives available from sources such as edx (see, for example, ‘Learning from data (introductory machine learning)’) and udemy (search for ‘Machine learning A–Z’). The underlying mathematics is the topic of a course from Imperial College London coursera.

  • Many machine-learning professionals run informative blogs and podcasts that deal with specifc aspects of machine-learning practice. These are useful resources for general interest as well as for broadening and deepening knowledge. There are too many to provide an exhaustive list here, but we recommend machinelearningmastery and lineardigression as a starting point.

About | Committee | Speakers | Schedule | Location | Sponsor |

About

2017 Summer School on the Machine Learning in the Molecular Sciences. This project aims to help you understand some basic machine learning models including neural network optimization plan, random forest, parameter learning, incremental learning paradigm, clustering and decision tree, etc. based on kernel regression and dimensionality reduction,…

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Packages

No packages published