Skip to content

Web Tool for Disease Incidence Estimation with Shogun

k4rth33k edited this page Mar 11, 2020 · 11 revisions

Disease outbreaks will be a major issue in the future for our civilized society. The recent Western Africa Ebola virus epidemic (2013-2016) and Zika virus epidemic are proofs of it. However, the next pandemic could be caused by some more common viruses, like the flu (as it is happening for the current Coronavirus epidemic). Therefore, being able to estimate how these diseases spread and their incidence over the population is really important to prevent and contain a possible epidemic. Predicting how some diseases spread among the population of certain countries is an example of application to machine learning techniques to a real-world problem. Several studies highlighted how it is possible to estimate the incidence of certain diseases by looking at social networks (e.g. Twitter) and/or other sources of information. Major companies were also providing their own forecasting tools (e.g. Google Flu Trends)

This project aims to replicate the work of McIver and Brownstein [1] by using Shogun and by creating an interactive tool which can be used to monitor influenza-like illnesses in near-real-time. In the presented paper, the author shows how it is possible to estimate the incidence of influenza-like illness in the USA by looking at Wikipedia's page views of certain articles.

Description

This GSoC project can be divided into three parts:

  • Develop a machine learning model by using Wikipedia's data with Shogun;
  • Expose the model to the internet with some REST API and provide a web interface to access the results;
  • Prepare a final presentation/demo to show the results of the project.

The first part is aimed to replicate the results of the original paper by using Shogun architecture. The output of this first part will be a complete script/notebook which shows the obtained results with a data analysis/visualization section and a description of the models used with their strength and weaknesses. To have an idea of what it should look like, you can have a look at several good Kaggle Notebooks (for instance, this one here about predicting house prices). Ideally, the students will have to play with different techniques to see which one best fits our purposes. Moreover, the first part must also produce a serialized version of the final Shogun model such to be loaded and used again. The final model will also be uploaded to OpenML to ensure reproducibility and to make it available to the community.

In parallel with building the model, potentially the students will engage also with more proper software developer activities within the library. Shogun may miss some features needed for the project (or it could even have some bugs! 😨) and therefore those will need to be implemented. However, this will not be the primal focus of the project.

The second part of the project is aimed to expose this model to the internet by an API. The ideal outcome of this would be a docker container which deploys the model. A web interface must be available such to show the current ILI estimated levels by fetching data from the Wikipedias API [3]. This web interface could show a small plot about the current estimated levels, or a more complex visualization (see [4] for an example).

Ultimately, the student will be required to write a small presentation/demo of the project, which will be presented to the other students/mentors at the end of GSoC.

Mentors

Requirements

You need to know:

  • C++
  • Python
  • Shogun (just a little bit 😉 )
  • Machine Learning Basics (understanding of regression models)
  • Docker and Flask (basic level)
  • HTML/CSS/Javascript (basic level)

If you already have experience in working on machine learning projects (e.g., previous open-source contributions, coursework, etc.) then it would be a plus, but it is not mandatory.

Bear in mind that the focus of this project will be on the machine learning application of Shogun. Therefore, you do not necessarily need to possess phantasmagorical frontend/backend skills.

Why this is cool

This project will give you the opportunity to apply machine learning techniques to a real-world project. Moreover, it will be possible for you to advance your skills in several areas (programming, data analysis, data visualizations, web servers, etc.). You will be able to develop a full-fledged system which exposes automatically a trained machine learning model online. Moreover, this project could be used in the future to showcase your abilities and strengthen your applications for jobs or university admissions.

First Steps

The first step would be to document yourselves about the topic itself and to do a little research about which are the solutions already available out there. If you do not know where to start looking, you can consult the list of papers/projects which is shown at the end of the page.
You should then produce a plan about which kind of techniques you plan to use (this can be discussed directly with the mentors) and about what you imagine the web application to be like. Be creative!. Don't be afraid to propose new ideas. In the end, this will be YOUR project and the mentor(s) will just help you to make it a reality 😉

This project requires close collaboration between you and your future mentor(s), so in order to increase the chances to be selected, be sure to start interacting as soon as possible with them to discuss your ideas and various details.

Useful Resources

  1. Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time
  2. Wikipedia's Pageview + Influenza Incidence in Europe Dataset
  3. Wikipedia Pageview API
  4. Interactive Coronavirus Map

Related works

Clone this wiki locally