Batches to zarr #40

leifdenby · 2021-11-17T17:33:03Z

Add BatchGenerator.to_zarr and BatchGenerator.from_zarr to make it possible to save generated batches to zarr and later load them from zarr. By chunking along the batch dimension this enables fast data-loading at training time.

Add `BatchGenerator.to_zarr` and `BatchGenerator.from_zarr` to make it possible to save generated batches to zarr and later load them from zarr. By chunking along the batch dimension this enables fast data-loading at training time.

codecov · 2022-02-03T00:54:26Z

Codecov Report

Merging #40 (08a9e94) into main (34ca3f9) will decrease coverage by 3.04%.
The diff coverage is 85.71%.

@@             Coverage Diff             @@
##              main      #40      +/-   ##
===========================================
- Coverage   100.00%   96.95%   -3.05%     
===========================================
  Files            3        3              
  Lines          134      164      +30     
  Branches        30       38       +8     
===========================================
+ Hits           134      159      +25     
- Misses           0        2       +2     
- Partials         0        3       +3

Impacted Files	Coverage Δ
xbatcher/generators.py	`95.65% <85.71%> (-4.35%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34ca3f9...08a9e94. Read the comment docs.

jhamman · 2022-02-03T01:14:34Z

@leifdenby - thanks for opening this PR and apologies the review wasn't picked up sooner.

My main question/comment is about whether or not we want to think about serializing some of the batch-generator's attributes in the Zarr dataset. It seems like without too much effort, we could effectively reconstruct the full BatchGenerator attribute namespace.

Also, as an aside, this fits nicely within the caching-api's element in the xbatcher roadmap: https://xbatcher.readthedocs.io/en/latest/roadmap.html#caching-apis. We hadn't gotten there yet but I'm glad to see this moving.

leifdenby · 2022-02-08T16:08:15Z

My main question/comment is about whether or not we want to think about serializing some of the batch-generator's attributes in the Zarr dataset. It seems like without too much effort, we could effectively reconstruct the full BatchGenerator attribute namespace.

Yes, that sounds like a good idea. Do you have list of attributes in mind? Looking at BatchGenerator.__init__ maybe I should start with input_dims, input_overlap, batch_dims and concat_input_dims. I don't think it makes sense to store preload_batch and of course not the source dataset ds. What do you think?

Also, as an aside, this fits nicely within the caching-api's element in the xbatcher roadmap: https://xbatcher.readthedocs.io/en/latest/roadmap.html#caching-apis. We hadn't gotten there yet but I'm glad to see this moving.

Great! :) I've used this a few times now and it works well for me.

jhamman · 2022-02-09T06:03:09Z

Looking at BatchGenerator.init maybe I should start with input_dims, input_overlap, batch_dims and concat_input_dims

Yes, this is exactly what I was thinking.

Also, looking at the code coverage report, it looks like we're in pretty good shape but could use a bit more testing on the edge cases. I'll leave a few more comments in the code to highlight areas that could use a test.

jhamman

a few pointers for possible test improvements

jhamman · 2022-02-09T06:04:16Z

xbatcher/generators.py

+        ds_all = xr.concat(batch_datasets, dim='batch_number').reset_index(
+            'sample'
+        )
+        if 'batch' in chunks:


test when 'batch' not in chunks

jhamman · 2022-02-09T06:04:47Z

xbatcher/generators.py

+        if 'batch' in chunks:
+            chunks['batch_number'] = chunks.pop('batch')
+
+        if len(chunks) > 0:


test when len(chunks) == 0

jhamman · 2022-02-09T06:06:28Z

xbatcher/generators.py

+        self.path = path
+
+    def __iter__(self):
+        for batch_id in self.ds_batches.batch_number.values:


I'm not exactly why but codecov think something in this for loop is not being covered by the existing tests. Perhaps its the empty iterable (.values) or it could be the if` statement in line 194. Any thoughts?

…atches-to-zarr

RichardScottOZ

typo in computere by the way

leifdenby added 2 commits November 17, 2021 17:27

Functionality for storing batches with zarr

0e7b538

Add `BatchGenerator.to_zarr` and `BatchGenerator.from_zarr` to make it possible to save generated batches to zarr and later load them from zarr. By chunking along the batch dimension this enables fast data-loading at training time.

cleanup test

45e20ec

leifdenby mentioned this pull request Nov 17, 2021

Generating the batches seems slow #37

Open

leifdenby and others added 4 commits November 17, 2021 17:51

Apply linting etc with pre-commit

9083881

add zarr to dev-requirements

7b2341b

Merge branch 'main' into batches-to-zarr

66a9ec5

Merge branch 'main' into batches-to-zarr

dd4108c

jhamman reviewed Feb 9, 2022

View reviewed changes

leifdenby added 3 commits May 10, 2022 15:16

Merge branch 'main' of https://github.com/pangeo-data/xbatcher into b…

e7dca77

…atches-to-zarr

store init attrs and create BatchGeneratorBase

1ce312f

linting fixes

08a9e94

RichardScottOZ reviewed May 12, 2022

View reviewed changes

jhamman mentioned this pull request Oct 14, 2022

Cache batches #109

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batches to zarr #40

Batches to zarr #40

leifdenby commented Nov 17, 2021

codecov bot commented Feb 3, 2022 •

edited

jhamman commented Feb 3, 2022

leifdenby commented Feb 8, 2022 •

edited

jhamman commented Feb 9, 2022

jhamman left a comment

jhamman Feb 9, 2022

jhamman Feb 9, 2022

jhamman Feb 9, 2022

RichardScottOZ left a comment

Batches to zarr #40

Are you sure you want to change the base?

Batches to zarr #40

Conversation

leifdenby commented Nov 17, 2021

codecov bot commented Feb 3, 2022 • edited

Codecov Report

jhamman commented Feb 3, 2022

leifdenby commented Feb 8, 2022 • edited

jhamman commented Feb 9, 2022

jhamman left a comment

Choose a reason for hiding this comment

jhamman Feb 9, 2022

Choose a reason for hiding this comment

jhamman Feb 9, 2022

Choose a reason for hiding this comment

jhamman Feb 9, 2022

Choose a reason for hiding this comment

RichardScottOZ left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 3, 2022 •

edited

leifdenby commented Feb 8, 2022 •

edited