GSoC_2019_project_efficient_ml

Fundamental Machine Learning Algorithms III: Finding the bad guys

... continuing from 2017 GSoC

We are continuing the highly popular project of the last years: the aim is to improve our implementations of fundamental ML algorithms. As this year's focus is on user experiences with Shogun, we focus on finding the bad guys. Who are the bad guys? Those are implementation of algorithms in Shogun that are embarrassingly in one of: runtime, memory efficiency, code-style, API, documentation ... we don't want to embarrass ourselves ;)

While we don't need Shogun to be the fastest/best/most pretty library in all tasks, it at least should not suck. This project is about identifying fixing all those "bad guys".

Mentors

Heiko (github: karlnapf, IRC: HeikoS)
Sergey (github: lisitsyn, IRC: lisitsyn)
Ryan Curtin from mlpack, IRC: rcurtin
Marcus Edel from mlpack, IRC: zoq

Difficulty & Requirements

Medium to difficult, you need to dig into existing code and you will need:

ML Algorithms in C++
Re-factoring existing code / design patterns
Knowledge of basic ML
Basic Linear Algebra, Shogun's linalg framework
Experience with other ML toolkits (preferably Python, such as scikit-learn, or c++ such as mlpack)
Desirable: Experience with the benchmarking system
Desirable: The ability to make algorithms more cache friendly

Details

Here are some examples of what topics should be covered.

Runtime

Have a look at benchmark comparisons of Shogun with other libraries at mlpack's benchmarking framework. You will notices that sometimes Shogun does quite well, like for KMeans

dataset	mlpy	scikit	shogun	weka	mlpack
corel-histogram	3.59s	0.73s	1.11s	19.43s	1.92s
mnist	119.83s	46.13s	16.02s	1558.07s	61.35s

On the other hand, there are situations that are less than optimal, like for linear regression, where Shogun fails.

dataset	mlpy	scikit	shogun	weka	mlpack
arcene	failure	0.24s	failure	3.16s	0.42s
cosExp	0.13s	0.08s	failure	17.42s	0.13s

Anotoher one is linear ridge regression, where Shogun is extremely slow

dataset	scikit	shogun
webpage	1.94s	>9000s

Again, we don't want Shogun to be the fastest candidate everywhere. We only don't want it to be the slowest by far.

Awkward API

Example: have a look at GMM. It has 3 train methods, awkward methods like get_nth_mean, multiple methods to apply it (::cluster, ::get_likelihood_example), etc. A first step would be to rename the methods to something that looks nice, or to remove them (we have tags so no need for getters/setters anymore). Next, GMM is nothing else but a supervised learning algorithm, so it should support that interface: fit, predict, and not offer its own methods. Next, GMM is also a distribution that can be sampled from, so it should be possible to turn it into an API that supports sampling.

We actually wrote some API desiderata for the user experience project, which overlaps with the project in terms of API. Think: you identify bad API, and how it should be instead, user experience project person implements basics for your changes to be possible, you change the algorithm.

Documentation issues

Some bad examples:

Example of one sentence docs
Example of no documentation at all
Example of bad documentation -- no description of what happens or how expensive it is.
Example of bad documentation -- talks about using the features in ERM, but this class is just about feature embeddings, so it should talk about the embedding, it computational costs, and what one can do with it: pass it to linear algorithms (like a linear SVM).

You get the point...

First steps

Increase coverage of Shogun in the benchmark framework. Ideally all algorithms in the framework should be populate with Shogun
Make a priority list of algorithms where Shogun doesn't do well: runtime & memory
Make a list of badly or un-documented algorithm classes (missing@brief, one sentence docs)
Make a list of algorithms with awkward API
Take a single instance and work on it until things are better.
Whenever you touch the internals, make sure to also polish: linalg usage, API, class design
Work on a one-by-one basis
Whenever you improve something, make sure to provide a "before-after" comparison.

Why this is cool

This project offers the chance to learn about many fundamental ML algorithms from a practical perspective, with a focus on usability and efficiency. As we want to start with important algorithms first, it is likely that many people will use (and appreciate) code that you wrote.

Useful resources

benchmark framework entrance task

Home
Readmes:
Development
- Getting involved
- Dev tips.
GSoC
Credits
Authors
Contributions
License
msufsort
SVMlight
Tapkee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly