Skip to content

GSoC_2017_project_fundamental_usual_suspects

Heiko Strathmann edited this page Feb 15, 2017 · 6 revisions

Fundamental Machine Learning Algorithms II

... continuing from last year

We are continuing the highly popular project of last year: the aim is to improve our implementation of fundamental ML algorithms.

The usual suspects

Update: This project is likely to be high demand. As Shogun contains many algorithms, we might take two students. They will work on separate projects -- each student will benchmark and tune their own distinct set of different algorithms or framework parts. Note that we can also imagine that one of you works together with the supervised learning pipeline project

Collaboration and splitting requires some preparation: we need a well-documented list of all the algorithms that need work. So if you're interested in this project, it would be a good idea to start benchmarking some of our algorithms. Eventually, rather than doing this with Python scripts, we would like to use an existing benchmarking system. This could be your first pull-request :)

But make sure you also look at the other projects -- the point is to get involved in Shogun itself. Make sure to subscribe to the mailing list to get a feeling how many people will apply for each project.

Mentors

Difficulty & Requirements

Easy to Medium -- depends on you. You need

  • ML Algorithms in C++
  • Re-factoring existing code / design patterns
  • Knowledge of very basic ML algorithms
  • Basic Linear Algebra
  • Desirable: Experience with other ML toolkits (preferably Python, such as scikit-learn, or c++ such as MLPack)
  • Desirable: Some initial experience in an existing benchmarking system
  • Desirable: Practical experience with ML
  • Desirable: Some experience with Shogun

Description

This project is about improving Shogun's implementations of "the usual suspects". That is, basic ML algorithms that should be available in every toolbox (see below). The focus in Shogun often lies on cutting-edge algorithms, leaving the usual suspects with too little attention. This results in non competitive implementations in terms of speed, scalability, and stability. We aim to identify such algorithms, benchmark them, and finally improve efficiency, code cleanliness, and test coverage. We want Shogun's implementations to be (at least) as fast as the fastest third party library!

Details

Involved algorithms definitely would include (but there are many more candidates), also check the improvements last year

  • Model selection
  • Preprocessing
  • Neural networks (our implementations are slow!)
  • Gaussian Processes (see also the GP project)
  • (kernel) SVM (in all the 100s variations we have)
  • Gaussian Mixture models

As an example, have a look at the (solved) issue #2987, which illustrates the problems on KMeans: It is slow, partly wrong, and does not scale. Another example is #3048 which shows the kinds of discussions we have around this topic.

In addition, we want to clean up the multi-core implementations of our algorithms, cleaning up old (messy) pthreads code using openmp, thread safety, cache locality, etc. See this or this patch for examples.

Waypoints and initial work

After having identified a number of algorithms, the typical approach would be to

1.) Write a script/program to compare performance with an existing ML library, on a challenging practical application. Such a benchmark should test various aspects: correctness, speed, different data sizes, different problem flavours. UPDATE: This step now should be done as part of mlpack's benchmarks. Ryan will be an additional mentor to help us. The results page can be found here. We highly appreciate any ideas to make this part as smooth as possible.

2.) Identify the most severe bottlenecks where Shogun does not perform well. This might be pure software-engineering related questions, but also depends on the mathematical formulations of the algorithms (see e.g. PCA improvements).

3.) Re-write the code that concerns the algorithms in question. Test. Produce clean code that is easy to read (see for example our HMM implementation if you want to see unreadable code)

4.) Give the whole implementation a general clean-up: Documentation, unit testing, examples.

5.) IPython notebook with a real-world application

6.) A report showcasing Shogun's performance compared to other libraries.

Optional

Once the Shogun implementation runs competitive, we can look to gain further speed-ups by multicore computations, approximations, and other means. We can also have a survey on Shogun's mailing list to identify methods that people like to be improved. A third option would be to simplify existing interfaces to make our algorithms easier to use.

Why this is cool

This project offers the chance to learn about many fundamental ML algorithms from a practical perspective, with a focus on efficiency. As the usual suspects are the most used algorithms by Shogun users, it is likely that many people will execute code that you wrote.

Useful resources

Entrance issues:

Clone this wiki locally