- A brief history of machine learning
- Main challenges
- Why predictions fail?
- Machine learning project checklist
- Resources
Created by gh-md-toc
Machine Learning is the study of computer algorithms that improve automatically through experience. --Tom Mitchell
Adapted from the section 1.2 of the book "Deep Learning with R".
- Probabilistic modelling was one of the earliest forms of machine learning, where the principles of statistics were applied to data analysis.
- The Naive Bayes algorithm is one of the best known methods for carrying out probabilistic modelling.
- Logistic regression is an algorithm with roots that date back to the early 19th century.
- The core ideas of neural networks were investigated back in the 1950s.
- In the 1970s, the backpropagation algorithm was originally introduced.
- Kernel methods, especially Support Vector Machines (SVMs), become popular in the 90s.
- Decision trees date back in the 1960s.
- The first algorithm for random decision trees was created in 1995 and an extension of the algorithm, called Random Forests, was created in 2001.
- Gradient boosting is a machine-learning technique based on ensembling weak prediction models, generally decision trees, which originated in 1997.
- In 2012, the SuperVision team led by Alex Krizhevsky and advised by Geoffrey Hinton was able to achieve a top-five accuracy of 83.6% on the ImageNet challenge.
- Most machine learning algorithms require a lot of data to work properly; if your sample is too small, sampling noise will have a larger effect.
- In addition, training data needs to representative, i.e. equally sampling from different labels or range of values, for a model to generalise. Non-representative data can result from a flawed sampling method (but sometimes it is just difficult to collect the necessary data).
- Poor quality data will make it more difficult to detect underlying trends and a lot of time is required to "clean up" the data, e.g. dealing with missing values and removing or fixing outliers.
- Models are only capable of learning if there are enough relevant features. A critical part of machine learning is feature engineering, which typically involves feature selection (selecting the most useful features) and feature extraction (combining existing features to produce more useful ones).
- Avoiding overfitting, which is when the model performs well on the training data but does not generalise and this usually occurs when the model is too complex. Constraining a model to make it simpler and reducing the risk of overfitting is called regularisation. The amount of regularisation to apply during learning can be controlled by a hyperparameter, which is a parameter that is not part of the model.
- Avoiding underfitting, which is when the model is too simple to learn the underlying structure of the data.
Reasons machine learning models fail to make correct predictions despite having enough data:
- Inadequate pre-processing of data
- Inadequate model validation (the procedure where a trained model is assessed with testing data)
- Inappropriate model was used
- Unjustified extrapolation (making predictions on new data that is characteristically different from the training data)
- Over-fitting the model on existing data
A useful checklist for machine learning projects from Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 3rd edition.
- Frame the problem and look at the big picture.
- Get the data.
- Explore the data to gain insights.
- Prepare the data to better expose the underlying data patterns to machine learning algorithms.
- Explore many different models and shortlist the best ones.
- Fine-tune your models and combine them into a great solution.
- Present your solution.
- Launch, monitor, and maintain your system.
Adapt this checklist to your project needs!
A list of useful resources for learning about machine learning with an emphasis on biological applications.
- Some Things Every Biologist Should Know About Machine Learning by Robert Gentleman
- How to Perform a Logistic Regression in R
- A gentle introduction to decision trees using R
- A gentle introduction to random forests using R
- Random Forest Regression and Classification in R and Python
- Fitting a Neural Network in R using the neuralnet package
- A Tour of The Top 10 Algorithms for Machine Learning Newbies
- Comparing supervised learning algorithms
- How to get better at data science
- What is Bayesian statistics by Sean Eddy
- What is a support vector machine? by William Noble
- What is a hidden Markov model? by Sean Eddy
- What are artificial neural networks by Anders Krogh
- Deep learning for computational biology by Angermueller et al.
- Machine learning applications in genetics and genomics by Maxwell Libbrecht and William Noble
- Conditional variable importance for random forests by Strobl et al.
- Ten quick tips for machine learning in computational biology by Davide Chicco
- The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
- An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
- Introduction to Machine Learning by Ethem Alpaydin
- Deep Learning with R by François Chollet with J. J. Allaire
- Medical Data for Machine Learning by Andrew Beam
- UCI Machine Learning Repository
- Kaggle Datasets
- A collection of microarray data by John Ramey
- Tom Mitchell's home page
- My notes on Random Forests
- No Free Lunch Theorems