Adding `reader_options` kwargs to open_virtual_dataset. #67

norlandrhagen · 2024-03-29T20:31:06Z

In a step to start reading remote files #61, this PR adds in reader_options to the open_virtual_dataset function.

These reader_options are passed into each Kerchunk file reader (SingleHdf5ToZarr, NetCDF3ToZarr, etc..) in read_kerchunk_references_from_file. Once open_virtual_dataset is replaced with the Xarray backend, we could pass them through: ds = xr.open_dataset(fp, engine='virtualizarr', backend_kwargs={'reader_options': {'storage_options': {'anon': True}}}).

This approach relies on the user knowing what options are available in each Kerchunk file reader.

This example should work off of this PR pointing to a public s3 bucket.

from virtualizarr import open_virtual_dataset
path = 's3://carbonplan-share/virtualizarr/local.nc'

vds = open_virtual_dataset(path)

edit: Tests are passing. Index generation and filetype guessing are now working.

codecov · 2024-03-29T20:33:44Z

Codecov Report

Attention: Patch coverage is 16.66667% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 74.31%. Comparing base (f226093) to head (4c6cb63).
Report is 20 commits behind head on main.

❗ Current head 4c6cb63 differs from pull request most recent head e6f047f. Consider uploading reports for the commit e6f047f to get more accurate results

Files	Patch %	Lines
virtualizarr/tests/test_xarray.py	25.00%	6 Missing ⚠️
virtualizarr/kerchunk.py	0.00%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #67       +/-   ##
===========================================
- Coverage   90.18%   74.31%   -15.87%     
===========================================
  Files          14       14               
  Lines         998      946       -52     
===========================================
- Hits          900      703      -197     
- Misses         98      243      +145

Flag	Coverage Δ
unittests	`74.31% <16.66%> (-15.87%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

for more information, see https://pre-commit.ci

TomNicholas · 2024-04-30T19:05:32Z

@norlandrhagen I've merged main in here because @jbusecke is using this branch for testing with CMIP6 data in #93. It would be great to get this PR into main!

virtualizarr/kerchunk.py

…o allows for reading of cloud storage

… filetypes

for more information, see https://pre-commit.ci

virtualizarr/utils.py

TomNicholas

Thanks @norlandrhagen ! I think this is a nice minimal addition to support reading from s3. I like how the fsspec stuff is generally kept separate too.

virtualizarr/utils.py

TomNicholas · 2024-05-03T22:32:08Z

virtualizarr/xarray.py

@@ -27,6 +28,7 @@ def open_virtual_dataset(
    loadable_variables: Optional[Iterable[str]] = None,
    indexes: Optional[Mapping[str, Index]] = None,
    virtual_array_class=ManifestArray,
+    reader_options: Optional[dict] = {'storage_options':{'key':'', 'secret':'', 'anon':True}},


When you normally point xr.open_dataset at an S3 url, you don't need to pass reader_options do you? Can we try to follow the signature of xr.open_dataset as closely as possible? (Maybe this already is as close as we can get)

I could be wrong, but I thought you had to pass in some sort of fsspec/s3fs mapper.

For me this fails:

ds = xr.open_dataset('s3://carbonplan-share/virtualizarr/local.nc')

virtualizarr/tests/test_xarray.py

norlandrhagen · 2024-05-08T22:35:00Z

Looks like the docs + CI builds are failing with:

Installing pip dependencies: ...working... Pip subprocess error:
  Running command git clone --filter=blob:none --quiet https://github.com/TomNicholas/xarray.git /tmp/pip-install-rmq_i38f/xarray_61220b7470e0456fa17857a7af71f04b
  WARNING: Did not find branch or tag 'concat-avoid-index-auto-creation', assuming revision or ref.
  Running command git checkout -q concat-avoid-index-auto-creation
  error: pathspec 'concat-avoid-index-auto-creation' did not match any file(s) known to git
  error: subprocess-exited-with-error

@TomNicholas should we pin another version of Xarray?

TomNicholas · 2024-05-09T14:22:01Z

should we pin another version of Xarray?

pydata/xarray#8872 was merged yesterday so now we should be able to release xarray, remove the xarray pin in virtualizarr, then release the first version of virtualizarr!

EDIT: Tracking the xarray release pydata/xarray#9018

TomNicholas · 2024-05-13T23:52:53Z

This seems very close now @norlandrhagen ?

norlandrhagen · 2024-05-14T02:23:25Z

CI is passing now @TomNicholas!

TomNicholas · 2024-05-14T15:14:59Z

Thanks @norlandrhagen ! One final request: Can we add a quick explanatory line to the docs? Something like

To open remote files as virtual datasets, pass the reader_kwargs options, e.g.

vds = open_virtual_dataset("s3://fake-bucket/file.nc", reader_kwargs={whatever would be needed})

This would go on the usage page, either as a quick entry underneath the first Opening files as virtual datasets heading, or under a new heading at the bottom.

docs/usage.md

TomNicholas · 2024-05-14T16:12:44Z

Thank you so much @norlandrhagen ! Will merge this now

jbusecke · 2024-05-15T17:17:28Z

Fantastic. Thanks so much. Ill refactor my code in the coming days.

adding reader_options kwargs to open_virtual_dataset

4c6cb63

norlandrhagen changed the title ~~Adding reader_options kwargs to open_virtual_dataset.~~ [Draft] Adding reader_options kwargs to open_virtual_dataset. Mar 29, 2024

TomNicholas mentioned this pull request Mar 29, 2024

Generating references from files in S3 (using kerchunk + fsspec) #61

Closed

jbusecke mentioned this pull request Apr 25, 2024

Real world use case: Virtualizarring CMIP6 data #93

Open

TomNicholas and others added 2 commits April 30, 2024 13:04

Merge branch 'main' into reader_options

adf311a

[pre-commit.ci] auto fixes from pre-commit.com hooks

ba5ac6d

for more information, see https://pre-commit.ci

TomNicholas reviewed Apr 30, 2024

View reviewed changes

virtualizarr/kerchunk.py Outdated Show resolved Hide resolved

TomNicholas and others added 21 commits April 30, 2024 13:06

fix typing

ea30914

modifies _automatically_determine_filetype to open file with fsspec t…

448800b

…o allows for reading of cloud storage

using UPath to get file protocol and open with fsspec

8c5dff7

tests passing locally. Reading over s3/local w+w/o indexes & guessing…

6cd77ce

… filetypes

merge w/ main

f0daafe

add s3fs to test

ed3d0f4

typing school 101

beec724

anon

e669841

tying

09f89a6

test_anon update

e4db860

anon failing

ba8b1e3

double down on storage_options

b12d32c

fsspec nit

f9478b9

[pre-commit.ci] auto fixes from pre-commit.com hooks

6958b59

for more information, see https://pre-commit.ci

seting s3 defaults as empty to try to appease the cruel boto3 gods

aefa22d

merge

464ffd3

added fpath to SingleHDF5ToZarr

d108978

hardcode in empty storage opts for s3

5cc5ecd

hardcode default + unpack test

3509a1f

changed reader_options defaults

80cf22b

Merge branch 'main' into reader_options

a3fc72e

norlandrhagen and others added 2 commits May 3, 2024 15:08

updated docs install

0235f51

[pre-commit.ci] auto fixes from pre-commit.com hooks

1e9e2fe

for more information, see https://pre-commit.ci

norlandrhagen requested a review from TomNicholas May 3, 2024 21:30

TomNicholas reviewed May 3, 2024

View reviewed changes

virtualizarr/utils.py Outdated Show resolved Hide resolved

TomNicholas reviewed May 3, 2024

View reviewed changes

norlandrhagen added 4 commits May 6, 2024 12:46

changed docstring type in utils to numpy style

55031f9

added TYPE_CHECKING for fsspec and s3fs mypy type hints

6a3d7be

merged w/ main and lint

5aec9db

fixed TYPE_CHECKING import

83b3c4b

pinned xarray to latest commit on github

a143cf4

norlandrhagen closed this May 9, 2024

norlandrhagen reopened this May 9, 2024

norlandrhagen changed the title ~~[Draft] Adding reader_options kwargs to open_virtual_dataset.~~ Adding reader_options kwargs to open_virtual_dataset. May 9, 2024

norlandrhagen added 3 commits May 13, 2024 12:53

merged w/ main to pin xarray and kerchunk

9d124ef

re-add upath

3a29b41

Merge branch 'main' into reader_options

b9c056a

merged w/ main

13fc295

ådds section to usage

4f766d9

TomNicholas reviewed May 14, 2024

View reviewed changes

docs/usage.md Outdated Show resolved Hide resolved

Minor formatting nit of code example in docs

e6f047f

TomNicholas merged commit 8923b8c into main May 14, 2024
5 checks passed

TomNicholas deleted the reader_options branch May 14, 2024 17:08

ayushnag mentioned this pull request May 23, 2024

open_virtual_dataset with dmr++ #113

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding `reader_options` kwargs to open_virtual_dataset. #67

Adding `reader_options` kwargs to open_virtual_dataset. #67

norlandrhagen commented Mar 29, 2024 •

edited

codecov bot commented Mar 29, 2024 •

edited

TomNicholas commented Apr 30, 2024

TomNicholas left a comment

TomNicholas May 3, 2024

norlandrhagen May 8, 2024

norlandrhagen commented May 8, 2024

TomNicholas commented May 9, 2024 •

edited

TomNicholas commented May 13, 2024

norlandrhagen commented May 14, 2024

TomNicholas commented May 14, 2024 •

edited

TomNicholas commented May 14, 2024

jbusecke commented May 15, 2024

Adding reader_options kwargs to open_virtual_dataset. #67

Adding reader_options kwargs to open_virtual_dataset. #67

Conversation

norlandrhagen commented Mar 29, 2024 • edited

codecov bot commented Mar 29, 2024 • edited

Codecov Report

TomNicholas commented Apr 30, 2024

TomNicholas left a comment

Choose a reason for hiding this comment

TomNicholas May 3, 2024

Choose a reason for hiding this comment

norlandrhagen May 8, 2024

Choose a reason for hiding this comment

norlandrhagen commented May 8, 2024

TomNicholas commented May 9, 2024 • edited

TomNicholas commented May 13, 2024

norlandrhagen commented May 14, 2024

TomNicholas commented May 14, 2024 • edited

TomNicholas commented May 14, 2024

jbusecke commented May 15, 2024

Adding `reader_options` kwargs to open_virtual_dataset. #67

Adding `reader_options` kwargs to open_virtual_dataset. #67

norlandrhagen commented Mar 29, 2024 •

edited

codecov bot commented Mar 29, 2024 •

edited

TomNicholas commented May 9, 2024 •

edited

TomNicholas commented May 14, 2024 •

edited