Skip to content

GSoC_2020_project_kaggle

Fernando J. Iglesias García edited this page Feb 2, 2020 · 2 revisions

Shogun as a pipeline for supervised learning competitions

Mentors

Difficulty & Requirements

Medium. The difficulty highly depends on how ambitious you are in this project.

You need to know

  • Shogun's (C++) API and how to apply it
  • Shogun's internals (features, machine, model-selection, evaluation)
  • Machine Learning basics (supervised learning, ensemble methods!)
  • How Kaggle works (experience is a plus)

Description

Wouldn't it be cool if applying Shogun to Kaggle problems (or other prediction competitions) was really straight-forward? If combining various feature representations & models was just one click away, and selecting the best ones was all done automatically while you grab a cup of coffee?

In fact, Shogun can do (almost) everything that is required to win a data-science competition (it did so many times in the past, though before Kaggle was founded ;) ). This project's goal is to take all the individual parts and glue them together. That is, to improve automation for building and evaluating combinations of features and models. Note we want to integrate and improve rather than extend Shogun. Only if it turns out that we badly need a particular algorithm, we might think about adding it.

Finally, we want to reproduce some of the best performing models out there and show how easy this is with Shogun.

Like for all other projects, a docker image, jupyter notebook and meta examples are required!

Is this project for you?

You have participated in data-science competitions and know how painful it can be to build frameworks for supervised learning? You have written crappy Python pipelines and thought: "It would be cool if someone did this properly"? Then you are our student!

We are looking for hackers who are fluent (proven!) in data-science workflows ... and especially for those who are not afraid of digging into the Shogun framework and fix problems on the fly (there will be problems :) ).

While we have a basic list of requirements and aims for this project, we highly encourage you to bring your own ideas. These could, for example, be concrete prediction problems. Own ideas are big plus for your application!

Note: we expect a large number of applicants for this project.

Details

The project will touch all of the below topics. In some cases Shogun can already do the things listed here, in others some work is required to enable them. In any case, we need a pretty API to build workflows.

  • Data IO Reading (multiple big) files efficiently, understanding various data formats, streaming data.

  • Features Extracting features from raw data, reducing dimensionality, stacking multiple features of different origin, selecting few features from many.

  • Models Do all the algorithms that Shogun implements work in this pipeline? Do we want to add particular new ones? How do our algorithms scale? Do we need to manage computational load and if so, how (sub-sampling, divide and conquer approaches)?

  • Ensembles Most Kaggle competitions are won by ensemble methods. There seem to be secret weapons.

  • Model selection Shogun can already use cross-validation to learn hyperparameters using grid, and gradient-based approaches. However, this needs some integration to be useful in automated frameworks. Plus there is the question of whether all algorithms work with our model-selection framework?!?

  • Parallelisation Most CPU cycles in such competitions go into model-selection, which is what should be done in a parallelised manner. We can imagine future projects that take the developed infrastructure and embed it into distributed computing environments.

Waypoints and initial work

Before the project starts:

  • Make a proof of concept for a pipeline that does all the required steps with a minimal number of methods involved.
  • Condense a list of core methods that should work within the pipeline.
  • Identify potential problems with existing implementations.
  • Come up with an API to compose workflows.

While the project is running, starting from the minimal proof of concept, we can proceed iteratively and add methods one after another while we go, ensuring all parts works as expected.

Optional

There is an incredible amount of things that could be added to the project: new algorithms, working on distributed computing, reproducing the results of a number of competitions, ideas for composing workflows, etc.

Why this is cool

Shogun already has a lot of the methods required to work well in Kaggle & Co. This project will make using them much easier. We also expect to stuble upon problems which - once they're fixed - improve the quality of Shogun! For you as a student, this project is cool because you get a chance to embed your data-science skills into an open-source project so that other people can benefit from them. You will touch most parts of the framework, so expect a very diverse hacking experience and expect to learn a lot about big-code bases. A successful project will produce a showcase of a data-science workflow application, that you could use to show-off in interviews :)

We have long wanted to do this project and we are excited that this summer it is finally happening!

Useful resources

Clone this wiki locally