Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test fsspec roundtrip #42

Merged
merged 22 commits into from May 16, 2024
Merged

Test fsspec roundtrip #42

merged 22 commits into from May 16, 2024

Conversation

TomNicholas
Copy link
Owner

@TomNicholas TomNicholas commented Mar 18, 2024

This works (locally) except for two things:

  1. It requires this PR to xarray so as not to try to create pandas indexes for the dimension coordinates automatically Opt out of auto creating index variables pydata/xarray#8711

  2. The time coordinate is not being decoded, so the roundtripped-dataset only matches the original because I used decode_times=False when opening the original tutorial dataset. If I get rid of that this test fails with

        # assert equal to original dataset
>       xrt.assert_equal(roundtrip, ds)
E       AssertionError: Left and right Dataset objects are not equal
E       
E       Differing coordinates:
E       L * time     (time) float32 12kB 1.867e+06 1.867e+06 ... 1.885e+06 1.885e+06
E       R * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00

virtualizarr/tests/test_integration.py:28: AssertionError

@TomNicholas TomNicholas added the xarray Requires changes to xarray upstream label Mar 18, 2024
Copy link

codecov bot commented Mar 18, 2024

Codecov Report

Attention: Patch coverage is 71.42857% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 88.92%. Comparing base (f226093) to head (43f5870).
Report is 22 commits behind head on main.

❗ Current head 43f5870 differs from pull request most recent head 5dbdfcd. Consider uploading reports for the commit 5dbdfcd to get more accurate results

Files Patch % Lines
virtualizarr/kerchunk.py 25.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #42      +/-   ##
==========================================
- Coverage   90.18%   88.92%   -1.26%     
==========================================
  Files          14       14              
  Lines         998      867     -131     
==========================================
- Hits          900      771     -129     
+ Misses         98       96       -2     
Flag Coverage Δ
unittests 88.92% <71.42%> (-1.26%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dcherian
Copy link

The time coordinate is not being decoded

I strongly recommend implementing data read just for catching errors in the time coordinate. It is incredibly common for time vectors to be encoded differently in different files in the "same dataset".

@TomNicholas
Copy link
Owner Author

TomNicholas commented Mar 18, 2024

What do you mean specifically? Like an optional argument to open_dataset_via_kerchunk that will actually load values from any coordinate named 'time'? And then check what about them?

Comment on lines +33 to +35
ds = xr.tutorial.open_dataset("air_temperature", decode_times=False).isel(
time=slice(None, 2000)
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ds = xr.tutorial.open_dataset("air_temperature", decode_times=False).isel(
time=slice(None, 2000)
)
# set up example xarray dataset
ds = xr.tutorial.open_dataset("air_temperature", decode_times=True).isel(time=slice(None, 2000))
del ds.time.encoding["units"]

This is the hard case, the two time variables will be encoded differently, and the concat makes no sense any more.

IIRC kerchunk supports this by decoding time and inlining bytes in the JSON. I think you'll want to at least raise an error here.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why will they be encoded differently?

In what case should I be raising an error? When the encoding attributes are different?

(Sorry I'm not very familiar with the encoding step in general)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why will they be encoded differently?

The encoding chooses units separately for each dataset since they are written separately.

This happens in the wild, and you'll want to catch it.

When the encoding attributes are different?

I think so. How do you handle someone trying to concat different files with different encodings? Are the encodings simply discarded and you use those from the first file?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay thanks.

How do you handle someone trying to concat different files with different encodings? Are the encodings simply discarded and you use those from the first file?

At the moment I'm not doing anything with encodings explicitly, and my tests have been testing with ManifestArrays created in-memory. So I'm not actually deliberately using them anywhere...

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #69 added the ability to load selected variables in as normal lazy numpy arrays, so they will get decoded by xarray automatically.

EDIT: But we can't use it to make this roundtripping test work until I implement writing those numpy arrays out "inlined" into the kerchunk reference files, see #62.

I also opened issue #68 to discuss the encoding problem more generally.

@TomNicholas TomNicholas mentioned this pull request May 13, 2024
@TomNicholas TomNicholas merged commit ca99d5a into main May 16, 2024
5 checks passed
@TomNicholas TomNicholas deleted the test_fsspec_roundtrip branch May 16, 2024 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
xarray Requires changes to xarray upstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants