Skip to content

jbrinchmann/MLD2019

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine learning and Databases at CAUP/IA in 2019

We have started!

Course overview

This course is an advanced course at CAUP during March and April 2019. Lectures will take place on Mondays at 14:00 while practical classes will take place on Thursdays at 10:00. Both have duration 2 hours with a short break.

The aim of this course is to get a good practical grasp of machine learning. I will not spend a lot of time on algorithm details but more on how to use these in python and try to discuss what methods are useful for what type of scientific question/research goal.

March 4 - Managing data and simple regression
  • Covering git and SQL
  • Introducing machine learning through regression techniques.
March 11 - Visualisation and inference methods
  • Visualisation of data, do's and don't's
  • Classical inference
  • Bayesian inference
  • MCMC
March 18 - Density estimation and model choice
  • Estimating densities, parametric & non-parametric
  • Bias-variance trade-off
  • Cross-validation
  • Classification
March 25 - Dimensional reduction
  • Standardising data.
  • Principal Component Analysis
  • Manifold learning
April 8 - Ensemble methods, neural networks, deep learning
  • Local regression methods
  • Random forests and other boosting methods
  • Neural networks & deep learning

Literature for the course

I expect that you have read through these two documents:

  • A couple of Python & Topcat pointers. This is a very basic document and might not contain a lot of new stuff. It does have a couple of tasks to try out - the solution for these you can find in the [ProblemSets/0 - Pyton and Topcat](ProblemSets/0 - Pyton and Topcat) directory.

  • A reminder/intro to relevant math contains a summary of some basic facts from linear algebra and probability theory that are useful for this course.

Below you can find some books of use. The links from the titles get you to the Amazon page. If there are free versions of the books legally available online, I include a link as well.

-"Elements of Statistical Learning - Hastie et al, is a more advanced version of the Introduction to Statistical Learning with much the same authors. This is also freely available on the web.

Making a copy of the repository that you can edit

In this case you will want to fork the repository rather than just clone this. You can follow the instructions below (credit to Alexander Mechev for this) to create a fork of the repository:

Software you need for the course

The course will make use of python throughout, and for this you need a recent version of python installed. I use python 3 by default but will try to make all scripts compatible with python 2 and python 3. For python you will need (well, I recommend it at least) at least these libraries installed:

  • numpy - for numerical calculations
  • astropy - because we are astronomers
  • scipy - because we are scientists
  • sklearn - Machine learning libraries with full name scikit-learn.
  • matplotlib - plotting (you can use alternatives of course)
  • pandas - nice handling of data
  • seaborn - nice plots

(the last two are really "nice to have" but if you can install the others then these are easy).

You should also get astroML which has a nice web page at XX and a git repository at https://github.com/astroML/astroML

It turns out that the astroML distribution that is often picked up when you install it using a package manager (maybe also pip?) is outdated and does not work with new versions of sklearn. To check whether you have a problem, try:

from astroML.datasets import fetch_sdss_sspp

If this crashes with a complaint about a module GMM, you have the old version. To fix this the best way is probably to check out the git version of astroML linked above using e.g.:

git clone https://github.com/astroML/astroML.git

To use astroML in Anaconda you need to get it from the astropy channel. For a one-off you can do:

conda install -c astropy astroML

If you want to add the astropy channel permanently (which probably is a good idea), you can do:

conda config --add channels astropy

Lecture 1 - links and information

The slides are available in the Lectures directory. You can find some files for creating tables in the ProblemSets/MakeTables directory.

Lecture 2-4

The slides are available in the Lectures directory.

Getting ready for deep learning in python

In the final problem class we will look at using deep learning in python. There are quite a few libraries for this around but we will use the most commonly used one, TensorFlow and we will use the keras python package for interacting with TensorFlow. Keras is a high-level interface (and can also use other libraries, Theano and CNTK, in addition to TensorFlow).

There are many pages that detail the installation of these packages and what you need for them. A good one with a bias towards Windows is this one. I will give a very brief summary here of how I set things up. This is not optimised for Graphical Processing Unit (GPU) work so for serious future work you will need to adjust this.

Create an environment in anaconda

I am going to assume you use anaconda for your python environment. If not, you need to change this section a bit - use virtualenv instead of setting up a conda environment. It is definitely better to keep your TensorFlow/keras etc setup out of your default Python work environment. Most of the packages are also installed with pip rather than conda, so what I use is

conda create -n tensorflow pip python=3.6

This creates an environment called tensorflow which uses python 3.6 and pip for installation. To use this we need to activate it first:

activate tensorflow

(assuming you use bash - I do not so I need to do some more tricks. Use bash). Your prompt should not change to include (tensorflow).

Install tensorflow and keras

I went for the simplest approach here:

pip install --upgrade tensorflow

This takes a while - the package is fairly large, 71.6Mb in my installation, and it requires a fair number of additional packages.

pip install keras

This is quicker.

pip install ipython

because that is not installed by default (you can skip this if you prefer not to use ipython).

pip install jupyter

because my example is a jupyter notebook.

You will also need to install some other packages I am sure you will need:

pip install matplotlib

pip install astropy

pip install pandas

pip install sklearn

pip install seaborn

and you might have others that you want to use but that should set up you fairly well for deep learning.

About

Machine learning and Databases at CAUP/IA in 2019

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages