Skip to content

GSoC_2017_applications

Heiko Strathmann edited this page Feb 8, 2017 · 18 revisions

Applying Shogun to the real world

This year, we would like to try something new: application based projects that are more focusing on using Shogun rather than modifying it. In practice, they might be a mix of the two. The idea is that the projects are stand-alone and the result is something really cool.

Mentors

Difficulty & Requirements

Totally depends on what you are after.

Most important are

  • that you are extremely motivated and ready to work independently
  • some Shogun and Machine Learning basics
  • Knowledge of the context of the chosen application

Description

Every line of code in SHOGUN has a long history and have gone through many brains and hands. This made SHOGUN what it is today: a powerful toolbox with a lot of features. But most of the code has been written by researchers for their studies. Usually the focus is on "getting things done", proving awesome ideas and optimize them "as fast as possible".

As a drawback, people didn't care too much about software engineering aspects. In addition, lots of new technologies have shown up since some parts of the code have been written, which allows us to do even cooler things with less code now.

We want this project to improve maintainability, stability, and beauty:

  • Making heavily used base classes more lightweight to improve performance and memory consumption.
  • Use new and cool technologies
  • New language features (think of C++1x)
  • and more

Is this project for you?

The target group of this project are people with C/C++ background, an idea about "good software" engineering, and reliable software. In return we offer that you'll learn a lot about basic machine learning algorithms; of course there are some low-hanging fruits, but if you're an advanced hacker, we have a lot of great ideas how to push the project forward.

GSoC is a marathon, not a sprint. We expect "good" performance over the whole project and to stay in contact with us. Get on board and commit to contribute actively and we'll promise to bring you on speed with magic internals that are hidden in SHOGUN. :)

Details

Here are some sub-projects. We are open for more:

Serialization framework finishing touches

tl;dr: Dirty work with binary data. Beat the NIH out of here!

  • Working title: Dirty deeds done with with binary data.
  • Alternative working title: Beat the NIH out of here!

Last year, we implemented a new cerealisation framework, which needs some love. And the old one needs to die! Deep-copy of objects? Checking equality? Dump objects to disk and get 'em back? All done used to be done in here. Thousands lines of code, uncountable many switch-case statements, and more special per-class and per-data-type code than we want to maintain. Only one good reason why we didn't tackle it yet: It used to work working.

Smart pointers

tl;dr: We want to stop making use of SG_REF, but use c++11 magic instead.

Ancient multi-threading code

tl;dr: We want to get rid of the old threading code that is: unusable, unmaintainable, and uncool. Replace with openmp or similar.

Progress bars and premature stopping

tl;dr: We want to have unified progress bars in Shogun (using SG_PROGRESS). It should be possible to prematurely stop algorithms in Shogun (and still getting some results if that makes sense).

Bug fixes.

Shogun has many many bugs, we could actually fix some of them. Pick your favourite! https://github.com/shogun-toolbox/shogun/labels/BUG https://github.com/shogun-toolbox/shogun/labels/bugfixing

File readers.

tl;dr: File IO and parsing done right using modern C++.

SHOGUN contains tons (how many lines?) of code to just parse input data formats. The code is basically working, some of parsers have minor bugs, most of them read like "C89 with classes", and static code analysis tells us we need to do something here.

Lot things possible here: refactorings, deduplication, new API, make it less code, make it less NIH.

Redesign data classes.

tl;dr: Being a software architect.

The foundation of every learning problem is data structures to be used by all algorithms. Dense/Sparse Features, for instance, or Dense/Sparse Streaming... duplicated functionality, special handling of feature classes in algorithm code; online algorithms not possible on non-stream features.

Buzzword bingo: Separation of concerns; finding invariants in the existing classes; redesign of features APIs; going back to the board and analyze what's really needed; gain flexibility.

Waypoints and initial work

What's to be done here depends on you. The minimal goal is a small prototype to prove the idea of the topic you are working on. The full-fletched solution is, well, you guessed it: Hard work and lot of fame.

Optional

Whatever you can imagine

Why this is cool

It attempts to improve one of the biggest open problems we have in Shogun: Being unable to move because of being chained the framework. A modern, slim Shogun is the dream of every of our developers :)

Useful ressources

  • All core developers

Github issues, in particular

Data structures:

Get back to the main projects page.

Clone this wiki locally