Skip to content

GSoC_2018_project_inside_blackbox

Fernando J. Iglesias García edited this page Apr 2, 2018 · 12 revisions

Inside that black box

...picking from last year, featuring 2017 student Giovanni as a mentor.

This project picks up Giovanni's great initiative to "open the black box" of Shogun implementations. The idea is simple and appealing: when calling an expensive C++ algorithm from e.g. a Python interface, it blocks until the call finishes which can take a long time to complete. The user has no control over the program execution and therefore he has no clue on how long it will take and, more importantly, how well the model is doing.

We would like to have Shogun's iterative algorithms to be:

  • emitting their progress (how much was done? how much is left?);
  • emitting the internals (residuals, partial results, convergence diagnostics);
  • being stoppable (resulting in partially trained models that can be deployed!);

The foundations of this have been laid last year. This year is about polishing and extending these ideas to the whole of Shogun and illustrate some cool use-cases. This project is also slightly related to the others (like the Detox or Usability projects) since the internal API for premature stopping and parameter's observers may require a bit of refactoring, to reflect the future changes of Shogun's class and interfaces.

Mentors

Difficulty & Requirements

Easy to medium

You need know:

  • C++;
  • Shogun (how its internals work);
  • RxCpp (just a basic level is enough);
  • Machine learning code basics (what are iterative algorithms);
  • openmp (optional);

First steps

The first thing that has to be done is create a list of algorithms that have an iterative nature. This can be done via grepping for our progress bar, the stopping mechanics (CSignal), algorithms that use RxCpp, and generally reading the code base (there are lots of algorithms that are iterative but don't contain any of the mentioned mechanics. Example: ICA Secondly, for each of them we will have to do the following:

  • Propose/Discuss which are the suitable actions that can be performed when pausing/stopping the method;
  • For each of them implements the corresponding methods (on_pause(), on_complete(), on_next());
  • Add a progress bar were possible;
  • Add a meta example which shows these new features such that a prospective user can see how premature stopping works;
  • Write tests where you make sure that one can use partially trained models, i.e. can call apply/predict after having stopped training, the instance is in a "good" state, it is serializable, etc.;

If you don't know where to start, do not worry ;) see the entrance task testing iterative algorithms and the one regarding premature stopping.

Optional

  • For each of the algorithms, make them emit values such that to be able to record some information and visualize them with Tensorboard;

Refine the TensorBoard integration even further. Tensorboard permits to visualize also different types of data, more specifically: images, audios and texts. Currently, Shogun supports only scalars and histograms visualization. We would like to know if we could extend the observable/observer code to output also such data from Shogun's algorithms, in a way to being able to show them through Tensorboard.

Why this is cool

It makes Shogun so much more applicable to large problems and it improves the workflow of using Shogun. It will enhance the codebase and it will modernize our library by giving users the possibility to actually see what is going on inside their models. You will touch a lot of ML code and therefore you will learn how commonly used algorithms work and how they are implemented in Shogun. You will also come in touch with many new technologies which build the skeleton of the project's framework (SWIG, RxCPP, Tensorflow, Tensorboard etc.) which will make your CV even cooler!

Useful resources

Clone this wiki locally