Skip to content
Heiko Strathmann edited this page Mar 30, 2017 · 15 revisions

Shogun detox 2

...continuing from last year

Mentors

Difficulty & Requirements

Medium. But requires a lot of initiative and willingness to dive into existing code (that is not pretty).

You need know

  • Shogun's internals (to an extend)
  • C++
  • Software engineering principles

Description

Every line of code in SHOGUN has a long history and have gone through many brains and hands. This made SHOGUN what it is today: a powerful toolbox with a lot of features. But most of the code has been written by researchers for their studies. Usually the focus is on "getting things done", proving awesome ideas and optimize them "as fast as possible".

As a drawback, people didn't care too much about software engineering aspects. In addition, lots of new technologies have shown up since some parts of the code have been written, which allows us to do even cooler things with less code now.

We want this project to improve maintainability, stability, and beauty:

  • Making heavily used base classes more lightweight to improve performance and memory consumption.
  • Use new and cool technologies
  • New language features (think of C++1x)
  • and more

Is this project for you?

The target group of this project are people with C/C++ background, an idea about "good software" engineering, and reliable software. In return we offer that you'll learn a lot about basic machine learning algorithms; of course there are some low-hanging fruits, but if you're an advanced hacker, we have a lot of great ideas how to push the project forward.

GSoC is a marathon, not a sprint. We expect "good" performance over the whole project and to stay in contact with us. Get on board and commit to contribute actively and we'll promise to bring you on speed with magic internals that are hidden in SHOGUN. :)

Details

Here are some sub-projects. We are open for more:

NOTE: A GSoC project will address multiple (or ideally all) of those topics.

Plugins and tags finalization

Sanuj did a great job last in GSoC 2016 in writing a new parameter framework. We are working on integrating it, and this needs some more effort. Getting rid of old code, fully integrating new code. Tying thing together with the rest of the framework. Moving towards plugin architecture. Interesting topic!

Key points

  • Replace all SG_ADD with tags registration.
  • Make sure Shogun still works afterwards
  • Once SG_ADD is removed, serialization and equals will stop working, but they need to work, see serialization below.

Initial work:

  • Read the docs on tags
  • Register member variables in a selected class in tag (without removing the SG_ADD yet)
  • Think about automating the ref-actoring.

Serialization framework finishing touches

Once, tags are (more or less) integrated, serialization and equals will be next Last year, Pan implemented a new cerealisation framework, which needs some love. And the old one needs to die! Deep-copy of objects? Checking equality? Dump objects to disk and get 'em back? All done used to be done in here. Thousands lines of code, uncountable many switch-case statements, and more special per-class and per-data-type code than we want to maintain. Only one good reason why we didn't tackle it yet: It used to work working.

Key points:

  • All SG_ADD instances need to be replaced (or extended, matter of discussion) with the new tag framework
  • Once a class's parameters are registered, the class needs to be serializable via cereal
  • Once that works, all old serialization code will be deleted
  • An new equals method should also be easy from here

Initial work:

  • Read and understand old serialization code (roughly), the cereal feature branch, and of course cereal docs
  • Draft a prototype:
  • Take a Shogun class
  • Register its parameters in the tags framework
  • Add a dump method to CSGObject that uses existing cereal code
  • Write a test to ensure serialization works.

Smart pointers

tl;dr: We want to stop making use of SG_REF, but use c++11 magic instead.

Ancient multi-threading code

tl;dr: We want to get rid of the old threading code that is: unusable, unmaintainable, and uncool. Replace with openmp or similar. Example

Progress bars and premature stopping

tl;dr: We want to have unified progress bars in Shogun (using SG_PROGRESS). It should be possible to prematurely stop algorithms in Shogun (and still getting some results if that makes sense).

Bug fixes.

Shogun has many many bugs, we could actually fix some of them. Pick your favourite! https://github.com/shogun-toolbox/shogun/labels/BUG https://github.com/shogun-toolbox/shogun/labels/bugfixing

File readers.

NOTE: This is such a big topic that we decided to not make it part of the project this year. tl;dr: File IO and parsing done right using modern C++.

SHOGUN contains tons (how many lines?) of code to just parse input data formats. The code is basically working, some of parsers have minor bugs, most of them read like "C89 with classes", and static code analysis tells us we need to do something here.

Lot things possible here: refactorings, deduplication, new API, make it less code, make it less NIH.

Redesign data classes.

tl;dr: Being a software architect.

The foundation of every learning problem is data structures to be used by all algorithms. Dense/Sparse Features, for instance, or Dense/Sparse Streaming... duplicated functionality, special handling of feature classes in algorithm code; online algorithms not possible on non-stream features.

Buzzword bingo: Iterators(!), Separation of concerns; finding invariants in the existing classes; redesign of features APIs; going back to the board and analyze what's really needed; gain flexibility.

Waypoints and initial work

What's to be done here depends on you. The minimal goal is a small prototype to prove the idea of the topic you are working on. The full-fletched solution is, well, you guessed it: Hard work and lot of fame.

Extremely important Producing documents that will code the touched internals of Shogun to make future developers' lives easier.

Optional

Whatever you can imagine

Why this is cool

It attempts to improve one of the biggest open problems we have in Shogun: Being unable to move because of being chained the framework. A modern, slim Shogun is the dream of every of our developers :)

Useful resources

  • All core developers

Github issues / pull requests, in particular

Data structures:

Clone this wiki locally