New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for a new HoloViz Datasets package #394
Comments
Great write-up! 'hvsampledata' would make it a bit more clear that this is data not about any holoviz thing, but a sample dataset to work with along side holoviz things. Although it's more of a mouthful, 'hvsampledata' would show up after hvplot in your editor :) |
I like hvsampledata too. I'm wondering about the API options:
I'd also like to encourage discoverability within the IDE with autocomplete, e.g. |
I am 100% in favor of streamlining and simplifying how we access datasets. However, I'm not sure how the proposed datasets package would work, given the size of the datasets we use, e.g. for Datashader examples. Are you proposing that all the datasets would actually be stored within a single conda package? Seems like that would be an impractically large conda package. Maybe you could list the number and size of the datasets you'd expect to be included there? Also, adding a new package is problematic because it would then need to go onto Anaconda Defaults for a defaults user to be able to get the examples, presumably? HoloViz tools in general are on defaults, which means anything they require also needs to be there. I don't think defaults is going to want a huge package, nor to add another package without a good reason. Maybe we could split the difference and have a package on conda-forge but then have HoloViews have a function like the one in Bokeh, first checking if the datasets package is installed and if not, downloading that data directly from the internet? |
To be honest I actually prefer a web site describing a set of datasets and their urls instead of a package. When I was a newbee I found those package a bit lit of hocus pokus. Probably because in the old days there was no description of the dataset and no type annotations. I.e. you did not really know what you got. |
@jbednar, the small datasets will be in the package, but larger datasets will need to be downloaded and cached. Also, a +1 on My opinion is that the packages should have no dependencies at all. And let it be up to the other packages to specify the packages needed along with the |
I like
I'm thinking about shipping the typically small datasets that are used on the website of hvPlot or Panel, like iris, penguins, gapminders, air_temperature, etc. But the package can expose larger datasets that will be downloaded on the fly, and cached.
Yes this list is needed. Hm maybe each project maintainer could come up with their wishlist?
For what it's worth, For the larger datasets:
num=10000
np.random.seed(1)
dists = {cat: pd.DataFrame(dict([('x',np.random.normal(x,s,num)),
('y',np.random.normal(y,s,num)),
('val',val),
('cat',cat)]))
for x, y, s, val, cat in
[( 2, 2, 0.03, 10, "d1"),
( 2, -2, 0.10, 20, "d2"),
( -2, -2, 0.50, 30, "d3"),
( -2, 2, 1.00, 40, "d4"),
( 0, 0, 3.00, 50, "d5")] }
df = pd.concat(dists,ignore_index=True)
df["cat"]=df["cat"].astype("category")
Bokeh does a pretty good job at showing what the datasets contain: https://docs.bokeh.org/en/latest/docs/reference/sampledata.html. By making the datasets available through a callable, we can add in their docstring a link to the site so you can quickly have access to more information.
Yep, that's why I like to make the datasets available through a callable, it gives us some more flexibility. |
Thanks for the excellent writeup @maximlt!
I agree with @Hoxbro on both counts: I think this package should be free to serve any format and any package using it should decide which ones they care about (which will probably be listed as dependencies already anyway).
Makes sense to me!
Yes, good idea. This is an optional extension we should plan for even if we have trouble publishing such a large package. Note that maybe we don't necessarily need to publish a large data package: we could get the package to air-gapped users some other way if they need it. As for the naming, I hate all the names so I won't get involved. :-) I suppose a long and ugly (but descriptive) name like |
+1 on |
I initially had a concern that this doesn't say "holoviz", so that people looking at it in their environments or in the packages included in Anaconda Distribution will have no idea what it might be (whereas But I think someone pointed out above that it would sort next to The alternatives I'd imagine are |
I wouldn't be so much in favor of that, to avoid increasing our maintenance burden. Besides that, it sounds like you're not against Next, I'll submit a list of the most used datasets currently across the HoloViz websites, to figure out which lightweight packages we should expose. |
So after some not-perfect regex + pandas, here's the breakdown per package of the types/name of the datasets used:
And the 15 most used datasets:
Pretty sure the analysis isn't perfect and is somewhat missing/not grouping the files uploaded on our S3. I am sure the |
Supersedes holoviz/hvplot#1274
Context
The HoloViz tools are very much data-oriented, pretty much every example across the board starts by either loading a dataset or creating one on the fly. Loading datasets is done in various ways:
bokeh.sampledata
. Bokeh ships some data and downloads some others. I don't know when we rely on the latter, but for sure we runbokeh sampledata
quite often on the CI (to download these non-shipped datasets).xarray.tutorial.open_dataset
sample_data
module that exposes and Intake catalog and its datasets. It needs the optional depsintake
,intake_parquet
,intake_xarray
ands3fs
. The catalog is shipped with hvPlot but the datasets are fetched from the internet.https://datasets.holoviz.org
(e.g.https://datasets.holoviz.org/penguins/v1/penguins.csv
)occupancy.csv
inpanel/examples/assets
)This isn't great:
There has to be a better way :) We started to discuss how to improve the situation by creating a new HoloViz package dedicated to giving a unified and easier approach to loading datasets.
How others do it
Bokeh
Bokeh has a
sampledata
sub-package that gives access to its datasets. They ship 24 files (CSVs mostly). More datasets are available after being downloaded via either the CLIbokeh sampledata
or the Python APIbokeh.sampledata.download()
. Downloaded files are stored in$HOME/.bokeh/data
(it can be configured). Datasets are in their sub-modules, sometimes grouped.Plotly
Plotly has a
data
sub-package that gives access to 11 datasets. They are all csv.gz files and are all shipped.Altair
The
vega-datasets
package contains 17 shipped datasets and gives access to many more datasets that are downloaded from vega's CDN. Downloaded datasets are not cached.vega-datasets
is not a dependency ofaltair
, it has Pandas as a unique direct dependency. Its code and API are a little more elaborate.Matplotlib
Matplotlib ships with about 15 datasets. The API is quite rudimentary:
Seaborn
Seaborn doesn't ship any datasets. Instead it offers the high-level
loat_dataset
function that fetches files (CSVs mostly) from this Github repository https://github.com/mwaskom/seaborn-data. Datasets are cached by default once downloaded. The cache location can be accessed viasns.get_data_home()
and controlled by passing an argument toload_dataset
or setting the env varSEABORN_DATA
.Xarray
Xarray has implemented an API similar to seaborn, available in the
xarray.tutorial
module. The main difference with seaborn is that runningopen/load_dataset
requires pooch, that can be installed withpip install xarray[io]
along with other deps likenetCDF4
,zarr
orfsspec
. Pooch itself depends onplatformdirs
,packaging
andrequests
.xarray.tutorial
also has thescatter_example_dataset
function that just generates a Dataset using Numpy functions.Proposal
Name
Simon already reserved hvdata on PyPI, though I'm not sure we reached an agreement on the name of this package? I think Andrew called it
hvdataset
in its original issue (holoviz/hvplot#1274).I find
hvdata
a little too close tohvplot
in the spirit and not super explicit. I preferhvdatasets
, even if I'm a bit annoyed it'll show up beforehvplot
in my editor.API
Considering all the approaches above, I think my favorite API is when the datasets are available via a callable (plotly, vega-datasets) that returns a data object, as this enables autocompletion and allows us to augment their signature with optional parameters.
I also like what Plotly does by providing a unique interface to their datasets for Plotly and Plotly express.
Datasets
The package will ship a list (to be determined) of small datasets. They should be as small as possible. We need to watch the overall package size, setting a limit upfront would be a good idea.
Other datasets can be downloaded on the fly. They're cached.
Dependencies
I'm not sure. None, or maybe just
pandas
?platformdirs
is good for finding the right cache paths on various platforms, but I think we can do without it. The package will have optional dependencies that will be required for its test suite to pass (e.g. xarray).If we want to increase the likelihood for users to be able to run code snippets without getting any errors, then the package should be added as a direct dependency to the HoloViz packages that use it? After all, Plotly, Bokeh, Matplotlib all ship with their package some datasets (and others like plotnine, and possibly many more), so why not us?
Unfortunately, I don't think there's a way yet in the pip world to define default extras. But if there was, I'd go with that option, making a hypothetical default
datasets
extra, with an option for users not to install it (e.g. to reduce their environment size).If we don't make it a direct dependencies of some packages, then their documentation will have to explain that it needs to be installed (either directly or via a new extra or the existing recommended one).
Documentation
The package will have a simple website built with the usual Sphinx stack. It'll list all the datasets it contains, with a description and their license.
The text was updated successfully, but these errors were encountered: