Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression with Zarr: ReadOnlyError #135

Open
rabernat opened this issue Feb 20, 2023 · 14 comments · Fixed by #136
Open

Regression with Zarr: ReadOnlyError #135

rabernat opened this issue Feb 20, 2023 · 14 comments · Fixed by #136

Comments

@rabernat
Copy link
Member

Tests with the latest dev environment are failing with errors like this


tmp_path = PosixPath('/private/var/folders/kl/7rfdrpx96bb0rhbnl5l2dnkw0000gn/T/pytest-of-rabernat/pytest-69/test_rechunk_group_mapper_temp7')
executor = 'python', source_store = 'mapper.source.zarr', target_store = <fsspec.mapping.FSMap object at 0x1174e3520>
temp_store = <fsspec.mapping.FSMap object at 0x1174e3400>

    @pytest.mark.parametrize(
        "executor",
        [
            "dask",
            "python",
            requires_beam("beam"),
            requires_prefect("prefect"),
        ],
    )
    @pytest.mark.parametrize("source_store", ["source.zarr", "mapper.source.zarr"])
    @pytest.mark.parametrize("target_store", ["target.zarr", "mapper.target.zarr"])
    @pytest.mark.parametrize("temp_store", ["temp.zarr", "mapper.temp.zarr"])
    def test_rechunk_group(tmp_path, executor, source_store, target_store, temp_store):
        if source_store.startswith("mapper"):
            fsspec = pytest.importorskip("fsspec")
            store_source = fsspec.get_mapper(str(tmp_path) + source_store)
            target_store = fsspec.get_mapper(str(tmp_path) + target_store)
            temp_store = fsspec.get_mapper(str(tmp_path) + temp_store)
        else:
            store_source = str(tmp_path / source_store)
            target_store = str(tmp_path / target_store)
            temp_store = str(tmp_path / temp_store)
    
>       group = zarr.group(store_source)

tests/test_rechunk.py:457: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../mambaforge/envs/rechunker/lib/python3.9/site-packages/zarr/hierarchy.py:1355: in group
    init_group(store, overwrite=overwrite, chunk_store=chunk_store,
../../../mambaforge/envs/rechunker/lib/python3.9/site-packages/zarr/storage.py:648: in init_group
    _init_group_metadata(store=store, overwrite=overwrite, path=path,
../../../mambaforge/envs/rechunker/lib/python3.9/site-packages/zarr/storage.py:711: in _init_group_metadata
    store[key] = store._metadata_class.encode_group_metadata(meta)  # type: ignore
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <zarr.storage.FSStore object at 0x1174e34c0>, key = '.zgroup', value = b'{\n    "zarr_format": 2\n}'

    def __setitem__(self, key, value):
        if self.mode == 'r':
>           raise ReadOnlyError()
E           zarr.errors.ReadOnlyError: object is read-only

../../../mambaforge/envs/rechunker/lib/python3.9/site-packages/zarr/storage.py:1410: ReadOnlyError
==

This is the cause of the test failures in #134.

@rsignell-usgs
Copy link
Member

Shoot, I'm still getting the read_only errors with 0.5.1:
https://nbviewer.org/gist/85a34aed6e432d0d8502841076bbab92

@rabernat
Copy link
Member Author

rabernat commented Mar 14, 2023

I think you may be hitting a version of zarr-developers/zarr-python#1353 because you are calling

m = fs.get_mapper("")

Try updating to the latest zarr version, or else creating an FSStore instead.

@rsignell-usgs
Copy link
Member

Okay, will do!

@rabernat
Copy link
Member Author

Would be helpful to confirm which Zarr version you had installed.

@rsignell-usgs
Copy link
Member

rsignell-usgs commented Mar 14, 2023

Hmm, zarr=2.13.6, the latest from conda-forge. I see that zarr=2.14.2 has been released though. I'll try pip installing that.

@rsignell-usgs
Copy link
Member

rsignell-usgs commented Mar 15, 2023

Okay, with the latest zarr=2.14.2, I don't get the read_only errors.

But the workflow fails near the end of the rechunking process:


KilledWorker: Attempted to run task ('copy_intermediate_to_write-bca90f45d4dc080cca14b54ce5a10d1f', 2) on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tls://10.10.105.181:35291. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.

The logs from those workers are not available on the dashboard, I guess because the workers died, right?

This rechunker workflow was working in December. Should I revert to zarr and rechunker from that era?

@rabernat
Copy link
Member Author

rabernat commented Mar 15, 2023

Ideally you would figure out what is going wrong and help us fix it, rather than rolling back to an earlier version. After all, you're a rechunker maintainer now! 😉

Are you sure that all your package versions match on your workers?

@rsignell-usgs
Copy link
Member

rsignell-usgs commented Mar 15, 2023

I'm certainly willing to try to help debug it, but don't really know where to start. If you have ideas, I'm game to try them.

One of the nice things about nebari/conda-store is the notebook and workers see the same environment (accessed from the conda-store pod), so the versions always match.

I added you to the ESIP Nebari deployment if you are interested in checking it out.

https://nebari.esipfed.org/hub/user-redirect/lab/tree/shared/users/Welcome.ipynb

https://nebari.esipfed.org/hub/user-redirect/lab/tree/shared/users/rsignell/notebooks/NWM/rechunk_grid/03_rechunk.ipynb

@rabernat
Copy link
Member Author

I won't be able to log into the ESIP cluster to debug your failing computation. If you think there has been a regression in rechunker in the new release, I strongly encourage you to develop a minimum reproducible example and share it via the issue tracker.

If you have ideas, I'm game to try them.

My first idea would be to freeze every package version except rechunker in your environment, and then try running the exact same workflow with only different rechunker versions (say 0.5.0 vs 0.5.1). Your example has a million moving pieces. Dask, Zarr, kerchunk, xarray, etc etc. It's impossible to say whether your problem is caused by a change in rechunker unless you can isolate this. There have been extremely few changes to rechunker over the past year. Nothing that obviously would cause your dask workers to start running out of memory.

@rsignell-usgs
Copy link
Member

I've confirmed that my rechunking workflow runs successfully if I pin zarr=2.13.3:

cf_xarray                 0.8.0              pyhd8ed1ab_0    conda-forge
dask                      2023.3.1           pyhd8ed1ab_0    conda-forge
dask-core                 2023.3.1           pyhd8ed1ab_0    conda-forge
dask-gateway              2022.4.0           pyh8af1aa0_0    conda-forge
dask-geopandas            0.3.0              pyhd8ed1ab_0    conda-forge
dask-image                2022.9.0           pyhd8ed1ab_0    conda-forge
fsspec                    2023.3.0+5.gbac7529          pypi_0    pypi
intake-xarray             0.6.1              pyhd8ed1ab_0    conda-forge
jupyter_server_xarray_leaflet 0.2.3              pyhd8ed1ab_0    conda-forge
numcodecs                 0.11.0          py310heca2aa9_1    conda-forge
pint-xarray               0.3                pyhd8ed1ab_0    conda-forge
rechunker                 0.5.1                    pypi_0    pypi
rioxarray                 0.13.4             pyhd8ed1ab_0    conda-forge
s3fs                      2022.11.0       py310h06a4308_0  
xarray                    2023.2.0           pyhd8ed1ab_0    conda-forge
xarray-datatree           0.0.12             pyhd8ed1ab_0    conda-forge
xarray-spatial            0.3.5              pyhd8ed1ab_0    conda-forge
xarray_leaflet            0.2.3              pyhd8ed1ab_0    conda-forge
zarr                      2.13.3             pyhd8ed1ab_0    conda-forge
  • If I change to zarr=2.13.6 I get the ReadOnlyError: object is read-only error.
  • If I change to zarr=2.14.2 I get the dask workers dying.

@rsignell-usgs
Copy link
Member

rsignell-usgs commented Mar 15, 2023

@gzt5142 has a minimal reproducible example he will post shortly. But should this be raised as a zarr issue?

@rabernat
Copy link
Member Author

Thanks a lot for looking into this Rich!

But should this be raised as a zarr issue?

How minimal is it? Can you decouple it from the dask and rechunker issues? Can you say more about what you think the root problem is?

@rsignell-usgs
Copy link
Member

rsignell-usgs commented Mar 22, 2023

Unfortunately it turns out the minimal example we created works fine -- does not trigger the problem described here. :(

@rabernat
Copy link
Member Author

I'm going to reopen this issue.

If there is a bug somewhere in our stack that is preventing rechunker from working properly, we really need to get to the bottom of it.

@rabernat rabernat reopened this Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants