You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After an online session with @njdussai, I'm will try to summarize some issues in our current Jupyter pipeline, and how I think this could be improved:
Issues
Bloated dependencies
In theory only scipy, numpy and sklearn are direct dependencies of the Open-SIR package. We are currently pulling more packages than needed. Besides adding more "moving pieces" (e.g in @njdussai's laptop there were some issues with the Pipenv's Jupyter kernel), there are some problems of this approach:
Packages and its dependencies can be quite heavy. For instance, the pandas package downloads ~200 Mb of data. While this is OK for modern computers, we should take into consideration that several countries have an average internet speed below 10 Mbps. Also, the CI is getting slower and slower and the Docker images are heavier.
Unless there's a core feature that requires a dependency (for instance a plot function in the model class), IMO the user should have full control of what can be used for doing a certain task or not. There are valid reasons for this (a user might now want to upgrade a package to keep old support, or might use the opensir core in a different way). Specially if opensir may have several use cases (web users might require to install d3, some other people might want to use xlrd to export excel data, etc).
Scope of Jupyter notebooks
I'm a big fan of Jupyter notebooks for documentations and having an initial contact with the Open-SIR model. However, I think it would be beneficial for the end users and developers to define the scope of the Jupyter notebooks to something "production ready". IMO "core" notebooks should be used exclusively for documentation/teaching purposes, given that:
It's very hard to keep Jupyter notebooks up to standards. There's no way to run the linter there and there are a lot of non-secuential-execution bugs. Just today, I realized both Jupyter notebooks were broken (the SIR had a import models instead of import opensir.models and the SIR-X had a missing import sys in the beginning).
See this article that shows some advantages and disadvantages of notebooks.
Proposed solutions
In order to target to different kind of users, simplify the usage and reduce the maintenance burden, I propose this:
Define the "minimal" dependencies for the OpenSIR package (pip install opensir).
This means, only the dependencies required to run the public API.
Prefer well known platforms for prototype notebooks
The SIR and SIR-X notebooks are good examples of official documentation (nbsphinx generate pages from those).
For prototyping and certain tutorials, I think we should consider using well known platforms like Kaggle for the following reasons:
Kaggle is a well known community. Adding prototype notebooks there might attract potential users of Open-SIR, or more people wanting to contribute with notebooks.
Most prototypes don't require the same coding standards as "core" code. Thus, maintaining and adding new prototypes would become much easier.
Of course, we would need to point to this documentation from the Sphinx and README, so users can jump directly to the workshops.
Add an explicit dependency to "opensir" and other packages in Jupyter notebooks and not the other way around
As discussed in another PR, we shouldn't encourage the end user to run the setup using Pipenv, since it's only intended to be used for development or fixed production environments. In practice, users without Python 3.7 cannot use Open-SIR with Pipenv.
Bloated dependencies: I agree. Actually I was trying to have a look whether we can remove sklearn for the resample function in the confidence intervals, and implement our own. But it became clear that it was quite complex, so let's keep sklearn.
Agree! let's start with Kaggle as non-default packages can be installed as long as they are in Pypi:
I think the notebook in #56, adding more comments, would be an excellent way to showcase Open-SIR. @gwenzel, have you ever used the Kaggle environment to install non-default packages? As dockerfiles can be provided, the notebooks can remain reproducible :)
The example that you posted motivates me to create a separate repository of tutorials and advanced applications of open-sir. This could serve as an interface, and many users would be able to contribute to that one as it will be far easier to contribute than to pass the CircleCI here for prototyping. Additionally, it will have more visibility, and will contain more "finished products" notebooks that can be git-cloned run immediately, so the lazy user can change the name of a country and look at the predictions.
Bloated dependencies: I agree. Actually I was trying to have a look whether we can remove sklearn for the resample function in the confidence intervals, and implement our own. But it became clear that it was quite complex, so let's keep sklearn.
Scipy already has a function for resampling (from scipy.signal import resample). We could use that one.
Agree! let's start with Kaggle as non-default packages can be installed as long as they are in Pypi:
Ok, good! Then we should have the Pypi package soon :)
The example that you posted motivates me to create a separate repository of tutorials and advanced applications of open-sir. This could serve as an interface, and many users would be able to contribute to that one as it will be far easier to contribute than to pass the CircleCI here for prototyping. Additionally, it will have more visibility, and will contain more "finished products" notebooks that can be git-cloned run immediately, so the lazy user can change the name of a country and look at the predictions.
@jia200x I saw that but it's not the same resample. The resample from scipy is for time-series to change the frequency. The one that we are looking for is to perform a random resampling with replacement. Thanks for checking though!
After an online session with @njdussai, I'm will try to summarize some issues in our current Jupyter pipeline, and how I think this could be improved:
Issues
Bloated dependencies
In theory only
scipy
,numpy
andsklearn
are direct dependencies of the Open-SIR package. We are currently pulling more packages than needed. Besides adding more "moving pieces" (e.g in @njdussai's laptop there were some issues with the Pipenv's Jupyter kernel), there are some problems of this approach:pandas
package downloads ~200 Mb of data. While this is OK for modern computers, we should take into consideration that several countries have an average internet speed below 10 Mbps. Also, the CI is getting slower and slower and the Docker images are heavier.plot
function in the model class), IMO the user should have full control of what can be used for doing a certain task or not. There are valid reasons for this (a user might now want to upgrade a package to keep old support, or might use the opensir core in a different way). Specially if opensir may have several use cases (web users might require to installd3
, some other people might want to usexlrd
to export excel data, etc).Scope of Jupyter notebooks
I'm a big fan of Jupyter notebooks for documentations and having an initial contact with the Open-SIR model. However, I think it would be beneficial for the end users and developers to define the scope of the Jupyter notebooks to something "production ready". IMO "core" notebooks should be used exclusively for documentation/teaching purposes, given that:
import models
instead ofimport opensir.models
and the SIR-X had a missingimport sys
in the beginning).See this article that shows some advantages and disadvantages of notebooks.
Proposed solutions
In order to target to different kind of users, simplify the usage and reduce the maintenance burden, I propose this:
Define the "minimal" dependencies for the OpenSIR package (
pip install opensir
).This means, only the dependencies required to run the public API.
Prefer well known platforms for prototype notebooks
The SIR and SIR-X notebooks are good examples of official documentation (
nbsphinx
generate pages from those).For prototyping and certain tutorials, I think we should consider using well known platforms like Kaggle for the following reasons:
Of course, we would need to point to this documentation from the Sphinx and README, so users can jump directly to the workshops.
Add an explicit dependency to "opensir" and other packages in Jupyter notebooks and not the other way around
As discussed in another PR, we shouldn't encourage the end user to run the setup using Pipenv, since it's only intended to be used for development or fixed production environments. In practice, users without Python 3.7 cannot use Open-SIR with Pipenv.
Given that, it's totally OK to require explicit dependencies in Jupyter notebooks. This patterns is quite common (see e.g Kalman and Bayesian filter notebooks dependency on PyFilter)
I think the following usage for a first timer is quite standard:
More over, there will be cases where we will need a strict dependency with a specific Open SIR version.
For instance:
Regarding the Sphinx documentation, it is always possible to run "pip install opensir" in the CI before running
nbsphinx
.All feedback is more than welcome. Please let me know what do you think.
Have a nice week!
The text was updated successfully, but these errors were encountered: