Skip to content

Inside the Black Box II

Giovanni De Toni edited this page Jan 29, 2020 · 3 revisions

Picking from 2018 and 2017, featuring former GSoC students Shubham and Giovanni as mentors.

This project picks up Shubham's great work to "open the black box" of Shogun implementations. The idea is simple and appealing: when calling an expensive C++ algorithm from e.g. a Python interface, it blocks until the call finishes which can take a long time to complete. The user has no control over the program execution and therefore he has no clue on how long it will take and, more importantly, how well the model is doing.

Now, we have mixin classes in Shogun that allow iterative algorithms to implement an API for stopping/continuing training. This year, we would like to work more in the lines of each machine

  • emitting their progress (how much was done? how much is left?);
  • emitting the internals (residuals, partial results, convergence diagnostics);

The foundations of this have been in the last two years. This year is about polishing and extending these ideas to the whole of Shogun and illustrate some cool use-cases. This project is also slightly related to the others (like the Detox or Usability projects) since the internal API for premature stopping and parameter's observers may require a bit of refactoring, to reflect the future changes of Shogun's class and interfaces.

Another point is that we want most algorithms in shogun to behave in a generic manner in the sense that they are type independent. Generally the train method can accept any type of features as a CFeatures* pointer however it is later assumed that the features provided are of a particular type. We introduced feature dispatching classes last year to enable this behaviour in a more automated way. This year we want to improve the this idea and implement it shogun wide.

Mentors

  • Shubham (github: shubham808, IRC: shubham808)
  • Giovanni (github: geektoni, IRC: geektoni)

Difficulty & Requirements

Medium

You need know:

  • C++;
  • Shogun (how its internals work);
  • Mixins in C++ (basic is enough, and check out last year's work at the end of the post)
  • Python
  • RxCpp (just a basic level is enough);
  • Machine learning code basics (understanding of iterative algorithms);

First steps

The first thing that has to be done is to improve the list of machines that have an iterative nature. This can be done via grepping for our progress bar, the Iterative Machine mixin, algorithms that use RxCpp, and generally reading the codebase (there are lots of algorithms that are iterative but don't contain any of the mentioned mechanics. Example: ICA

Secondly, for each of them we will have to do the following:

  • Use existing machine members (For eg: m_w, bias of CLinearMachine for weights and bias) instead of local member copies. If there are corresponding local members present they must be removed. This is to make sure the model updates its state every iteration.
  • Identify the main training loop. This is where the magic is happening.
  • Decouple the main training loop from the algorithm's implementation while making sure its behaviour does not change. This is done to split the logic into initialization and iteration phases.
  • Write automatically generated tests where you make sure that one can use partially trained models, i.e. can call predict after having stopped training, the instance is in a "good" state, it is serializable, etc. Currently the tests are written for each Iterative machine separately.

Optional

  • For each of the algorithms, make them emit values such that to be able to record some information and visualize them with Tensorboard;

Refine the TensorBoard integration even further. Tensorboard permits to visualize also different types of data, more specifically: images, audios and texts. Currently, Shogun supports only scalars and histograms visualization. We would like to know if we could extend the observable/observer code to output also such data from Shogun's algorithms, in a way to being able to show them through Tensorboard.

Why this is cool

It makes Shogun so much more applicable to large problems and it improves the workflow of using Shogun. It will enhance the codebase and it will modernize our library by giving users the possibility to actually see what is going on inside their models. You will touch a lot of ML code and therefore you will learn how commonly used algorithms work and how they are implemented in Shogun. You will also come in touch with many new technologies which build the skeleton of the project's framework (SWIG, RxCPP, Tensorflow, Tensorboard etc.) which will make your CV even cooler!

Useful resources

Clone this wiki locally