GSoC_2017_project_kaggle

Shogun as a pipeline for supervised learning competitions

Mentors

Heiko (github: karlnapf, IRC: HeikoS)
Viktor (github: vigsterkr, IRC: wiking)
Soumyajit (github: lambday, IRC: lambday)

Difficulty & Requirements

Medium. The difficulty highly depends on how ambitious you are in this project.

You need to know

Shogun's (C++) API and how to apply it
Shogun's internals (features, machine, model-selection, evaluation)
Machine Learning basics (supervised learning, ensemble methods!)
How Kaggle works (experience is a plus)

Description

Wouldn't it be cool if applying Shogun to Kaggle problems (or other prediction competitions) was really straight-forward? If combining various feature representations & models was just one click away, and selecting the best ones was all done automatically while you grab a cup of coffee?

In fact, Shogun can do (almost) everything that is required to win a data-science competition (it did so many times in the past, though before Kaggle was founded ;) ). This project's goal is to take all the individual parts and glue them together. That is, to improve automation for building and evaluating combinations of features and models. Note we want to integrate and improve rather than extend Shogun. Only if it turns out that we badly need a particular algorithm, we might think about adding it.

Finally, we want to reproduce some of the best performing models out there and show how easy this is with Shogun.

Like for all other projects, a docker image, jupyter notebook and meta examples are required!

Is this project for you?

You have participated in data-science competitions and know how painful it can be to build frameworks for supervised learning? You have written crappy Python pipelines and thought: "It would be cool if someone did this properly"? Then you are our student!

We are looking for hackers who are fluent (proven!) in data-science workflows ... and especially for those who are not afraid of digging into the Shogun framework and fix problems on the fly (there will be problems :) ).

While we have a basic list of requirements and aims for this project, we highly encourage you to bring your own ideas. These could, for example, be concrete prediction problems. Own ideas are big plus for your application!

Note: we expect a large number of applicants for this project.

Details

The project will touch all of the below topics. In some cases Shogun can already do the things listed here, in others some work is required to enable them. In any case, we need a pretty API to build workflows.

Data IO Reading (multiple big) files efficiently, understand various data formats. Streaming data.
Features Extract features from raw data, reduce dimensionality, stack multiple features of different origin, select few features from many.
Models Do all the algorithms that Shogun implements work in this pipeline? Do we want to add particular new ones? How do our algorithms scale? We probably need to manage computational load (sub-sampling, divide and conquer approaches)?
Ensembles Most Kaggle competitions are won by ensemble methods. There seem to be secret weapons.
Modelselection Shogun can already use cross-validation to learn hyperparameters using grid, and gradient-based approaches. This will need some amounts of integration to be useful in automated frameworks. Plus there is the question whether all algorithms work with our model-selection framework?!?
Parallelisation Most CPU cycles in such competitions go into model-selection, which is what should be done in a parallelised manner. We can imagine future projects that take the developed infrastructure and embed it into distributed computing environments.

Waypoints and initial work

Before the project starts:

Proof of concept for a pipeline that does all the required steps with a minimal number of methods involved.
Condense a list of core methods that should work within the pipeline.
Identify potential problems with existing implementations
Come up with an API to compose workflows

While the project is running, starting from the minimal proof of concept, we can proceed iteratively and add methods one after another while we go, ensuring all parts works as expected.

Optional

There is an incredible amount of things that could be added to the project: New algorithms, working on distributed computing, reproducing the results of a number of competitions, ideas for composing workflows, etc.

Why this is cool

Shogun already has a lot of the methods required to work well in Kaggle & co. This project will make using them much easier. We also expect many problems (to be fixed), good for Shogun! For you students, this project is cool since you get a chance to embed you data-science skills into an open-source project so that other people can benefit from them. You will touch most parts of the framework, so expect a very diverse hacking experience, and expect to learn a lot about big-code bases. A successful project will produce a showcase of a data-science workflow application, with which you can show-off with in interviews :)

We wanted to do this project for a long time, so we are excited that we finally decided to do so.

Useful ressources

Home
Readmes:
Development
- Getting involved
- Dev tips.
GSoC
Credits
Authors
Contributions
License
msufsort
SVMlight
Tapkee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly