Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Data Reuse in StoreToZarr and Similar Processes #730

Open
moradology opened this issue Apr 10, 2024 · 1 comment
Open

Optimize Data Reuse in StoreToZarr and Similar Processes #730

moradology opened this issue Apr 10, 2024 · 1 comment

Comments

@moradology
Copy link
Contributor

moradology commented Apr 10, 2024

Description

Currently, the process of rechunking data and ingesting it into a Zarr store encounters significant performance bottlenecks, primarily due to inefficient data reading strategies. A substantial amount of effort and network requests may be wasted on rereading byte ranges from the source datasets. This inefficiency is particularly pronounced in workflows that involve transferring large datasets over the network or reading from slow storage systems.

The Problem

When reading data from source files (e.g., NetCDF, Zarr) to write into an output Zarr store, the current implementation does not effectively reuse data that has already been read into memory. Instead, overlapping byte ranges might be read multiple times from the same or different source files, leading to unnecessary I/O operations and increased execution time.

This issue is compounded in scenarios where the chunking scheme of the output store does not align with that of the input files, necessitating partial reads of larger chunks and leading to both inefficiencies in data transfer and increased memory usage.

Ideas

  1. Intelligent Caching: Implement a caching mechanism that temporarily stores read chunks in memory. Subsequent write operations requiring the same byte ranges could utilize this cache, reducing the need for additional reads from the source. We already do byte caching via fsspec constructs - might be able to use the DoFn lifecycle to ensure that cache is shared across units of work within a worker

  2. Graph-based Data Dependency Analysis: Construct a graph that models the dependencies between read and write operations. Nodes in the graph represent chunks in both the input and output datasets, while edges denote the data flow from read chunks to write chunks. Optimizing this graph could help in scheduling reads and writes in a way that maximizes data reuse. There is surely prior art on this - anyone familiar?

  3. Heuristic-based Read Scheduling: Develop heuristics for read scheduling that prioritize the use of data already in memory as a further optimization so that LRU or similar cache invalidation is sensible

Illustration of caching strategy

Consider a scenario where two adjacent write chunks (W1) and (W2) in the output Zarr store depend on overlapping ranges of a read chunk [R1] from a source file. Currently, the overlapping portion of [R1] might be read twice, once for each write operation. An optimized approach would read [R1] once, cache it, and then use the cached data for both (W1) and (W2), effectively halving the read operations for this segment.

Read: [R1] -----> [Cache]
                     |
                    / \
Write:           (W1) (W2)

We can likely use the DoFn setup step which can initialize shared resources (even across bundles of work!) within a worker for a given stage of pipeline execution
image

@moradology
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant