Bloated dependencies and scope of Jupyter notebooks #59

jia200x · 2020-04-14T23:28:32Z

After an online session with @njdussai, I'm will try to summarize some issues in our current Jupyter pipeline, and how I think this could be improved:

Issues

Bloated dependencies

In theory only scipy, numpy and sklearn are direct dependencies of the Open-SIR package. We are currently pulling more packages than needed. Besides adding more "moving pieces" (e.g in @njdussai's laptop there were some issues with the Pipenv's Jupyter kernel), there are some problems of this approach:

Packages and its dependencies can be quite heavy. For instance, the pandas package downloads ~200 Mb of data. While this is OK for modern computers, we should take into consideration that several countries have an average internet speed below 10 Mbps. Also, the CI is getting slower and slower and the Docker images are heavier.
Unless there's a core feature that requires a dependency (for instance a plot function in the model class), IMO the user should have full control of what can be used for doing a certain task or not. There are valid reasons for this (a user might now want to upgrade a package to keep old support, or might use the opensir core in a different way). Specially if opensir may have several use cases (web users might require to install d3, some other people might want to use xlrd to export excel data, etc).

Scope of Jupyter notebooks

I'm a big fan of Jupyter notebooks for documentations and having an initial contact with the Open-SIR model. However, I think it would be beneficial for the end users and developers to define the scope of the Jupyter notebooks to something "production ready". IMO "core" notebooks should be used exclusively for documentation/teaching purposes, given that:

It's very hard to keep Jupyter notebooks up to standards. There's no way to run the linter there and there are a lot of non-secuential-execution bugs. Just today, I realized both Jupyter notebooks were broken (the SIR had a import models instead of import opensir.models and the SIR-X had a missing import sys in the beginning).
Reviewing a Jupyter notebook is hard. The "diffs" between consecutive notebooks are hard to trace. Jupyter notebooks have a lot of metadata that make them hard to read when reviewing (see https://github.com/open-sir/open-sir/pull/32/files#diff-c28928cf31250cfe43c74c5ed04e26a1R678)

See this article that shows some advantages and disadvantages of notebooks.

Proposed solutions

In order to target to different kind of users, simplify the usage and reduce the maintenance burden, I propose this:

Define the "minimal" dependencies for the OpenSIR package (`pip install opensir`).

This means, only the dependencies required to run the public API.

Prefer well known platforms for prototype notebooks

The SIR and SIR-X notebooks are good examples of official documentation (nbsphinx generate pages from those).

For prototyping and certain tutorials, I think we should consider using well known platforms like Kaggle for the following reasons:

Kaggle is a well known community. Adding prototype notebooks there might attract potential users of Open-SIR, or more people wanting to contribute with notebooks.
Most prototypes don't require the same coding standards as "core" code. Thus, maintaining and adding new prototypes would become much easier.

Of course, we would need to point to this documentation from the Sphinx and README, so users can jump directly to the workshops.

Add an explicit dependency to "opensir" and other packages in Jupyter notebooks and not the other way around

As discussed in another PR, we shouldn't encourage the end user to run the setup using Pipenv, since it's only intended to be used for development or fixed production environments. In practice, users without Python 3.7 cannot use Open-SIR with Pipenv.

Given that, it's totally OK to require explicit dependencies in Jupyter notebooks. This patterns is quite common (see e.g Kalman and Bayesian filter notebooks dependency on PyFilter)

I think the following usage for a first timer is quite standard:

pip install opensir

/* notebook A requires pandas */
pip install pandas
jupyter notebook notebook_A

More over, there will be cases where we will need a strict dependency with a specific Open SIR version.
For instance:

pip install opensir==1 0.3

/* Notebook B doesn't run with opensir 2.1.0 */
Jupyter notebook notebook_b

Regarding the Sphinx documentation, it is always possible to run "pip install opensir" in the CI before running nbsphinx.

All feedback is more than welcome. Please let me know what do you think.
Have a nice week!

The text was updated successfully, but these errors were encountered:

felipehuerta17 · 2020-04-17T15:53:06Z

Sorry for not replying before @jia200x

Bloated dependencies: I agree. Actually I was trying to have a look whether we can remove sklearn for the resample function in the confidence intervals, and implement our own. But it became clear that it was quite complex, so let's keep sklearn.
Agree! let's start with Kaggle as non-default packages can be installed as long as they are in Pypi:

https://www.kaggle.com/docs/notebooks#the-notebooks-environment

I think the notebook in #56, adding more comments, would be an excellent way to showcase Open-SIR. @gwenzel, have you ever used the Kaggle environment to install non-default packages? As dockerfiles can be provided, the notebooks can remain reproducible :)

The example that you posted motivates me to create a separate repository of tutorials and advanced applications of open-sir. This could serve as an interface, and many users would be able to contribute to that one as it will be far easier to contribute than to pass the CircleCI here for prototyping. Additionally, it will have more visibility, and will contain more "finished products" notebooks that can be git-cloned run immediately, so the lazy user can change the name of a country and look at the predictions.

jia200x · 2020-04-17T16:17:00Z

Bloated dependencies: I agree. Actually I was trying to have a look whether we can remove sklearn for the resample function in the confidence intervals, and implement our own. But it became clear that it was quite complex, so let's keep sklearn.

Scipy already has a function for resampling (from scipy.signal import resample). We could use that one.

Agree! let's start with Kaggle as non-default packages can be installed as long as they are in Pypi:

Ok, good! Then we should have the Pypi package soon :)

The example that you posted motivates me to create a separate repository of tutorials and advanced applications of open-sir. This could serve as an interface, and many users would be able to contribute to that one as it will be far easier to contribute than to pass the CircleCI here for prototyping. Additionally, it will have more visibility, and will contain more "finished products" notebooks that can be git-cloned run immediately, so the lazy user can change the name of a country and look at the predictions.

Exactly. I think that's the best way to go.

felipehuerta17 · 2020-04-17T16:43:14Z

@jia200x I saw that but it's not the same resample. The resample from scipy is for time-series to change the frequency. The one that we are looking for is to perform a random resampling with replacement. Thanks for checking though!

jia200x assigned leandrolanzieri, sasalatart, gwenzel and felipehuerta17 Apr 14, 2020

felipehuerta17 added the documentation Improvements or additions to documentation label Apr 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bloated dependencies and scope of Jupyter notebooks #59

Bloated dependencies and scope of Jupyter notebooks #59

jia200x commented Apr 14, 2020 •

edited

felipehuerta17 commented Apr 17, 2020

jia200x commented Apr 17, 2020

felipehuerta17 commented Apr 17, 2020

Bloated dependencies and scope of Jupyter notebooks #59

Bloated dependencies and scope of Jupyter notebooks #59

Comments

jia200x commented Apr 14, 2020 • edited

Issues