Skip to content

GSoC_2017_project_kaggle

Heiko Strathmann edited this page Feb 8, 2017 · 5 revisions

Shogun as a pipeline for supervised learning competitions

Mentors

Difficulty & Requirements

Medium. This highly depends on how ambitious you are in this project.

You need know

  • Shogun's (C++) API and how to apply it
  • Shogun's internals (features, machine, model-selection, evaluation)
  • Machine Learning basics (supervised learning, ensemble methods!)
  • How Kaggle works (experience is a plus)

Description

Wouldn't it be cool if applying Shogun to Kaggle problems (or other prediction competitions) was really straight-forward? If combining various feature representations & models was just one click away, and selecting the best ones was all done automatically while you grab a cup of coffee?

In fact, Shogun can do (almost) everything that is required to win a data-science competition (it did so many times in the past, though before Kaggle was founded ;) ). This project's goal is to take all the individual parts, and glue them together. That is, to improve automation for building and evaluating combinations of features and models. Note we want to integrate and improve, rather than extending Shogun. If it turns out that we badly need a particular algorithm, we might think about adding it.

Finally, we want to reproduce some of the best performing models out there, and show how easy that is with Shogun. Docker image & jupyter notebook & meta examples required!

Is this project for you?

You have participated in data-science competitions and know how painful it can be to build frameworks for supervised learning? You have written crappy Python pipelines and have had the thought: "it would be cool if someone did this properly". Then you are our student!

We are looking for hackers who are fluent (proven!) in data-science workflows ... and especially for those who are not afraid of digging into the Shogun framework, and fix problems on the fly (there will be problems).

While we have a basic requirements list of things we want to do with this project, if you bring your own ideas, this is extremely welcome and a big plus for your application. This could for example be concrete prediction problems.

Note: we expect a larger number of applicants for this project.

Details

The project will definitely touch upon all of the below topics. In some cases Shogun can already the things listed here, in others it might require some work to enable them. In any case, we need a pretty API to build workflows.

  • Data IO. Reading (multiple big) files efficiently, understand various data formats. Streaming data.

  • Features Extract features from raw data, reduce dimensionality, stack multiple features of different origin, select few features from many.

  • Models Do all the algorithms that Shogun implements work in this pipeline? Do we want to add particular new ones? How do our algorithms scale? We probably need to manage computational load (sub-sampling, divide and conquer approaches)?

  • Ensembles Most Kaggle competitions are won by ensemble methods. There seem to be secret weapons.

  • Modelselection Shogun can already use cross-validation to learn hyperparameters using grid, and gradient-based approaches. This will need some amounts of integration to be useful in automated frameworks. Plus there is the question whether all algorithms work with our model-selection framework?!?

  • Parallelisation Most CPU cycles in such competitions go into model-selection, which is what should be done in a parallelised manner. We can imagine future projects that take the developed infrastructure and embed it into distributed computing environments.

Waypoints and initial work

Before the project starts:

  • Proof of concept for a pipeline that does all the required steps with a minimal number of methods involved.
  • Condense a list of core methods that should work within the pipeline.
  • Identify potential problems with existing implementations
  • Come up with an API to compose workflows

While the project is running, starting from the minimal proof of concept, we can proceed iteratively and add methods one after another while we go, ensuring all parts works as expected.

Optional

There is an incredible amount of things that could be added to the project: New algorithms, working on distributed computing, reproducing the results of a number of competitions, ideas for composing workflows, etc.

Why this is cool

Shogun already has a lot of the methods required to work well in Kaggle & co. This project will make using them much easier. We also expect many problems (to be fixed), good for Shogun! For you students, this project is cool since you get a chance to embed you data-science skills into an open-source project so that other people can benefit from them. You will touch most parts of the framework, so expect a very diverse hacking experience, and expect to learn a lot about big-code bases. A successful project will produce a showcase of a data-science workflow application, with which you can show-off with in interviews :)

We wanted to do this project for a long time, so we are excited that we finally decided to do so.

Useful ressources

Clone this wiki locally