GSoC_2020_project_detox

Detox++ ... +

...continuing from last year.

You like C++? Come on in!

This project continues our main focus during the last few years: modernining Shogun internals and cleaning the old parts of the project that are holding us from cool advancements.

This year, we want to focus on data representations and linear algebra API.

Mentors

Fernando (github: iglesias, IRC: feriglegarc)
Heiko (github: karlnapf, IRC: karlnapf)

Difficulty & Requirements

Medium to difficult

You need know

C++
In particular, type systems & safety (I.e. a language that is type safe, like C++ or Java)
Design patterns and software engineering principles
Linear algebra in computers, Eigen3, Shogun's linalg
Very basics of machine learning code

First steps

For every sub-project (see below):

Write down a list of classes/methods/concepts that will need change (there are comments below)
Think (and discuss) how every sub-project's problems could be solved efficiently
Write down pseudo-code of how the API should look like
Write down pseudo-code of how the internals would look like
Draft minimal a prototype of how you want to implement your change
Work on a one-by-one basis

Details

Here are some sub-projects. We are open for more:

NOTE: A GSoC project will address multiple (or ideally all) of those topics.

Modern C++

Many places in Shogun are based on well, kinda old-school C-99. While that was en-vogue in 1999 (when Shogun started), it now is more like ... ridiculous. C++11 was such a milestone for the language and sparked continuous development of features that come in handy when building ML toolboxes.

Such an example would be explicit template instantiations, as for example done here. A much better pattern to solve such problems is SFINAE/tag dispatch. Check out this cool blog post by a contributor, who actually deployed the pattern in a patch. We would like to see more of those modern concepts being used in Shogun, iteratively replacing the old.

Stateless Distance/Kernel API

see api draft We need a way to port current caching mechanism in distance/kernel. After we have new distance/kernel api, we can adapt DistanceMachine to the new machine API (see LinearMachine refactor for example)

Clean up StringFeatures

Fully add string features to the new API. See some initial work here
Implement EmbededStringFeatures. StringFeatures support high order mapping (turning several characters into real value data, which we call EmbededStringFeatures). Currently StringFeatures can represent both kinds (original strings and mapped ones), which make it confusing. We need to separate them into two types. After this, we need to figure out what kinds of StringFeatures each StringKernel and StringFeatures actually accept, and then replace it with the correct feature types.

Re-designing `CFeatures`

We want to modernize Shogun's main data representation, CFeatures.

Immutable features

In order to make thread-safety easier, we plan to make the interface immutable. This means making all methods const. Everything that changes the state or content of a features object will have to create a copy first, and return a new instance (that might share the underlying memory). The first step is to come up with a list of all non-const methods in features, decide which ones can be made const easily.

Continue from the last year, we can further improve the feature class:

A common usage that need mutable features is calling get_feature_matrix and then modify the data. We can implement copy-on-write data structures (SGVector, SGMatrix).
There are several non-const math methods (dot, dense_dot, add_to_vec). We can use iterator (see below) and dropped these methods.

Iterators instead of direct memory access.

Features should not offer direct access to the underlying memory, i.e. the feature vector for CDenseFeatures. This is since for that one needs to know the basic word-size (float32, float64) of the underlying data, which would convolute the algorithm codes (those should be independent of the word-size). As a consequence, we would like to remove all methods that return vectors/matrices/etc. This is a long list of changes, and we need to start by collecting all cases, and discussing which ones to change first.

Instead, we would like to perform computation over features using an API based on iterators. We have already made a few transitions in this direction, for example see Perceptron. Have a look here for inspiration.

Cross-validation

Once that immutable features are done, we really would like to see a multithreaded version of cross-validation implemented, using shared memory for the features (no cloning), see here. Here is a to-do list for this task.

Something in the lines of (could be more elegant, but this is to illustrate the purpose)

#pragma omp parallel 
for (auto i in range(num_folds))
{
  inds_train, inds_test = splitting->fold(i) # generate train/test indices
  auto feats_train = features->view(inds_train) # returns training view instance (thread-safe, no data copying)
  # ... same for validation set & labels

  fold_machine = machine->clone() # clones learning machine instance (without data), cheap
  fold_machine->fit(feats_train, labels_train) # non const call, but on my own instance, so thread safe
  result[i] = evaluation(fold_machine->predict(fets_test), labels_test)
}

Split implementation from header files

Many parts of Shogun are split into .cpp and .h, which makes compilation/development much easier: changes in the implementation of a low level data-structure does not cause the whole project to be re-compiled. There are many cases, however, where this is not done (especially in templated code). This part of the project can serve as a nice initial contribution.

Optional

More topics that one could work on include: serialization, smart pointers, using std:: instead of our own data-structures, and more. Let us know if something in particular is of interest for you. We might also change things around while the project is running ;)

Why this is cool

Cleaning up the internal APIs of Shogun will lead to a huge exposure to advanced software concept, and you can be sure to learn a lot about API design, algorithms, and good practices in software development. The project will make it much easier to develop clean code within Shogun, and as such make the project more attractive for scientists to implement their work in.

Home
Readmes:
Development
- Getting involved
- Dev tips.
GSoC
Credits
Authors
Contributions
License
msufsort
SVMlight
Tapkee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC_2020_project_detox

Detox++ ... +

Mentors

Difficulty & Requirements

First steps

Details

Modern C++

Stateless Distance/Kernel API

Clean up StringFeatures

Re-designing `CFeatures`

Immutable features

Iterators instead of direct memory access.

Cross-validation

Split implementation from header files

Optional

Why this is cool

Clone this wiki locally

GSoC_2020_project_detox

Detox++ ... +

Mentors

Difficulty & Requirements

First steps

Details

Modern C++

Stateless Distance/Kernel API

Clean up StringFeatures

Re-designing CFeatures

Immutable features

Iterators instead of direct memory access.

Cross-validation

Split implementation from header files

Optional

Why this is cool

Clone this wiki locally

Re-designing `CFeatures`