Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fill_value is not preserved in the rechunked output #133

Open
flamingbear opened this issue Feb 16, 2023 · 0 comments
Open

fill_value is not preserved in the rechunked output #133

flamingbear opened this issue Feb 16, 2023 · 0 comments

Comments

@flamingbear
Copy link
Contributor

Hi,

This is the follow on to #131. (and an updated #132)

In comparing the source and target zarr stores from my regression tests, I noticed that the fill_value changed between my source and target data. I guess that it's not preserved in the rechunk, but this can lead to much larger than needed output stores.

This is an updated script from my previous test script that creates a degenerate case of almost all the same data being rechunked.

If you run this script you will see the fillvalue of "foo/bar/.zarray" changes from "fill_value": 1.0, to "fill_value": null, between the source and target zarr stores. And the output disk size of the stores is significantly different, an order of magnitude.

Thanks,
Matt

❯ du -hs *
 36K	source.zarr
3.1M	target.zarr

Here's a script that demonstrates the issue.

import zarr
from rechunker import rechunk
import shutil


def run_create_input_store():
    shutil.rmtree('testoutput/', ignore_errors=True)
    store = zarr.DirectoryStore('testoutput/source.zarr')
    root = zarr.group(store=store, overwrite=True)
    foo = root.create_group('foo')
    root.attrs['description'] = 'root description'
    foo.attrs['description'] = 'foo description'
    bar = foo.ones('bar', shape=(10000, 10000))
    bar[5000, 5000] = 3
    bar.attrs['description'] = 'foo description'
    zarr.consolidate_metadata(store)


def rechunkit():
    openstore = zarr.open_consolidated('testoutput/source.zarr')
    array_plan = rechunk(openstore, {'foo/bar': (1000, 1000)},
                         '1GB',
                         'testoutput/target.zarr',
                         temp_store='testoutput/temp.zarr')
    array_plan.execute()
    zarr.consolidate_metadata('testoutput/target.zarr')


if __name__ == '__main__':
    run_create_input_store()
    rechunkit()
    print('Compare the .zmetadata files in both your source.zarr and target.zarr directories')
    print('You will see that the "fill_value" in the source is 1.0 and it is null in the target.')
    source = zarr.open('testoutput/source.zarr')
    target = zarr.open('testoutput/target.zarr')
    print(source['foo']['bar'].fill_value)
    print(target['foo']['bar'].fill_value)
    
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant