Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use chunk_iter to avoid slow performance for files with many chunks #67

Closed
rly opened this issue May 13, 2024 · 2 comments · Fixed by #68
Closed

Use chunk_iter to avoid slow performance for files with many chunks #67

rly opened this issue May 13, 2024 · 2 comments · Fixed by #68

Comments

@rly
Copy link
Contributor

rly commented May 13, 2024

When working with datasets with a large number of chunks and requesting that their chunk information be cached in the LINDI file (e.g., by setting num_dataset_chunks_threshold to a very large number or None), h5py/HDF5 performance for get_chunk_info for getting chunk locations and offsets is terrible. This was reported here: h5py/h5py#2117

Since then, h5py 3.8 was released, which adds the method h5py.h5d.DatasetID.chunk_iter(), which is significantly faster at retrieving chunk information. This method works only for HDF5 1.12.3 and above. Unfortunately, the latest release of h5py on PyPI (3.11.0) for Mac includes HDF5 1.12.2. Pre-built packages on Linux & Windows bundle HDF5 version 1.14.2. To use this faster method on Mac, we have to install the latest h5py from conda-forge or build it from source with HDF5 1.12.3.

Use case: this nwb file in dandiset 000717 has just over 1 million chunks. Using the old method, getting the chunk info for the first 100 chunks takes about 15-23 seconds and the time appears to be roughly, but not exactly, linear in the number of chunks requested. If we assume it is linear, then to get chunk info for ALL chunks would take about 56 hours. Using the new method, getting the chunk info for ALL chunks takes about 1-6 seconds. The variation between 1 and 6 seconds might depend on hdf5 caching.

I suggest we use chunk_iter if available and fall back to get_chunk_info with a warning if there are a large number of chunks. I was going to suggest using tqdm to monitor getting the chunks, but I think given the 1-6 second speed of the new method, it is not necessary.

from tqdm import tqdm
import h5py
import timeit

url_or_path = "/Users/rly/Downloads/sub-R6_ses-20200206T210000_behavior+ophys.nwb"
with h5py.File(url_or_path, "r") as f:
    start_time = timeit.default_timer()
    h5_dataset = f["/acquisition/TwoPhotonSeries/data"]
    dsid = h5_dataset.id
    # for i in tqdm(range(100)):
    for i in range(100):
        chunk_info = dsid.get_chunk_info(i)
    
    end_time = timeit.default_timer()
    elapsed_time = end_time - start_time
    print(f'Time elapsed: {elapsed_time} seconds')
import h5py
import timeit

url_or_path = "/Users/rly/Downloads/sub-R6_ses-20200206T210000_behavior+ophys.nwb"
with h5py.File(url_or_path, "r") as f:
    start_time = timeit.default_timer()
    h5_dataset = f["/acquisition/TwoPhotonSeries/data"]
    dsid = h5_dataset.id
    stinfo = list()
    dsid.chunk_iter(stinfo.append)
    print(len(stinfo))
    print(stinfo[-1])

    end_time = timeit.default_timer()
    elapsed_time = end_time - start_time
    print(f'Time elapsed: {elapsed_time} seconds')
@magland
Copy link
Collaborator

magland commented May 13, 2024

I'm glad you found that alternative method! I had been looking for a solution to this for a while (even delving into the hdf5 source code), but was not able to discover that alternative.

If we assume it is linear, then to get chunk info for ALL chunks would take about 56 hours. Using the new method, getting the chunk info for ALL chunks takes about 1-6 seconds. The variation between 1 and 6 seconds might depend on hdf5 caching.

Incredible.

Okay so this should be high priority. Do you want me to take a crack at it?

@rly
Copy link
Contributor Author

rly commented May 14, 2024

I started taking a crack at it. I'll let you know my progress in the morning. Might need some eyes on refactoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants