A brief history of machine learning

Machine Learning is the study of computer algorithms that improve automatically through experience. --Tom Mitchell

Adapted from the section 1.2 of the book "Deep Learning with R".

Probabilistic modelling was one of the earliest forms of machine learning, where the principles of statistics were applied to data analysis.
The Naive Bayes algorithm is one of the best known methods for carrying out probabilistic modelling.
Logistic regression is an algorithm with roots that date back to the early 19th century.
The core ideas of neural networks were investigated back in the 1950s.
In the 1970s, the backpropagation algorithm was originally introduced.
Kernel methods, especially Support Vector Machines (SVMs), become popular in the 90s.
Decision trees date back in the 1960s.
The first algorithm for random decision trees was created in 1995 and an extension of the algorithm, called Random Forests, was created in 2001.
Gradient boosting is a machine-learning technique based on ensembling weak prediction models, generally decision trees, which originated in 1997.
In 2012, the SuperVision team led by Alex Krizhevsky and advised by Geoffrey Hinton was able to achieve a top-five accuracy of 83.6% on the ImageNet challenge.

Main challenges

Most machine learning algorithms require a lot of data to work properly; if your sample is too small, sampling noise will have a larger effect.
In addition, training data needs to representative, i.e. equally sampling from different labels or range of values, for a model to generalise. Non-representative data can result from a flawed sampling method (but sometimes it is just difficult to collect the necessary data).
Poor quality data will make it more difficult to detect underlying trends and a lot of time is required to "clean up" the data, e.g. dealing with missing values and removing or fixing outliers.
Models are only capable of learning if there are enough relevant features. A critical part of machine learning is feature engineering, which typically involves feature selection (selecting the most useful features) and feature extraction (combining existing features to produce more useful ones).
Avoiding overfitting, which is when the model performs well on the training data but does not generalise and this usually occurs when the model is too complex. Constraining a model to make it simpler and reducing the risk of overfitting is called regularisation. The amount of regularisation to apply during learning can be controlled by a hyperparameter, which is a parameter that is not part of the model.
Avoiding underfitting, which is when the model is too simple to learn the underlying structure of the data.

Why predictions fail?

Reasons machine learning models fail to make correct predictions despite having enough data:

Inadequate pre-processing of data
Inadequate model validation (the procedure where a trained model is assessed with testing data)
Inappropriate model was used
Unjustified extrapolation (making predictions on new data that is characteristically different from the training data)
Over-fitting the model on existing data

Machine learning project checklist

A useful checklist for machine learning projects from Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 3rd edition.

Frame the problem and look at the big picture.
Get the data.
Explore the data to gain insights.
Prepare the data to better expose the underlying data patterns to machine learning algorithms.
Explore many different models and shortlist the best ones.
Fine-tune your models and combine them into a great solution.
Present your solution.
Launch, monitor, and maintain your system.

Adapt this checklist to your project needs!

Resources

A list of useful resources for learning about machine learning with an emphasis on biological applications.

Presentations

Some Things Every Biologist Should Know About Machine Learning by Robert Gentleman

Tutorials

Online content

Papers

What is Bayesian statistics by Sean Eddy
What is a support vector machine? by William Noble
What is a hidden Markov model? by Sean Eddy
What are artificial neural networks by Anders Krogh
Deep learning for computational biology by Angermueller et al.
Machine learning applications in genetics and genomics by Maxwell Libbrecht and William Noble
Conditional variable importance for random forests by Strobl et al.
Ten quick tips for machine learning in computational biology by Davide Chicco

Books

The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
Introduction to Machine Learning by Ethem Alpaydin
Deep Learning with R by François Chollet with J. J. Allaire

Datasets

Medical Data for Machine Learning by Andrew Beam
UCI Machine Learning Repository
Kaggle Datasets
A collection of microarray data by John Ramey

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
ann		ann
caret		caret
data		data
deep_learning		deep_learning
evaluation		evaluation
gmm		gmm
hclust		hclust
kmeans		kmeans
knn		knn
logit_regression		logit_regression
naive_bayes		naive_bayes
pca		pca
proximus		proximus
random_forest		random_forest
ref		ref
script		script
som		som
svm		svm
template		template
tidymodels		tidymodels
tree		tree
variant		variant
xgboost		xgboost
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

License

davetang/machine_learning

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

A brief history of machine learning

Main challenges

Why predictions fail?

Machine learning project checklist

Resources

Presentations

Tutorials

Online content

Papers

Books

Datasets

Others

About

Topics

Resources

License

Stars

Watchers

Forks

Languages