Investigate why access performance isn't improved uniformly by repacking metadata #19

betolink · 2023-08-30T17:54:58Z

On the files we tested over Antarctica, repacking the metadata with h5repack didn't improve access times in a dramatic way, specially for xarray and h5py. These granules contained a lot of data and each was around 6GB, with ~7MB of metadata. They were selected and processed using this notebook

e.g. ATL03_20181120182818_08110112_006_02.h5 ~7GB in size and 7MB of metadata

Note: The S3 bucket with the original data is gone but can be easily recreated.

However for other granules with less data, repacking represented a 10X improvement for xarray

e.g. ATL03_20220201060852_06261401_005_01.h5 ~500MB in size and 3MB of metadata

After applying h5repack to both files the access time to the first one is not improved for xarray but it is improved from 1 minute to 5 seconds for the second granule, why?

group = '/gt2l/heights'
variable = 'h_ph'

with s3.open(file, 'rb') as file_stream:
     ds = xr.open_dataset(file_stream, group=group, engine='h5netcdf')
     variable_mean = ds[variable].mean()

I'm going to repack the original files and put them on a more durable bucket, along with more examples from other NASA datasets.

Maybe @ajelenak has some clues on why this may be happening.

Tasks

Give feedback

Tasks

Give feedback

No tasks being tracked yet.

Options

The text was updated successfully, but these errors were encountered:

ajelenak · 2023-08-31T16:34:26Z

Hi @betolink,

Repacking the file is the necessary first step but then the instructions to use the features available in the repacked file must be passed to libhdf5. I know it can be done from h5py, but have not yet verified if the same is possible from xarray and h5netcdf. It probably is because I've seen xarray code where backend storage engine options are set in the open_dataset() call.

The variable mean calculation example will read all the data only once for the /gt2l/heights/h_ph dataset and then discard it, which means that available libhdf5 caches may not help much in this use case.

betolink · 2023-08-31T16:42:27Z

The curious thing is that in some instances repacked files get faster times compared to their non repacked original version without passing any special parameter to h5py or xarray.

ajelenak · 2023-08-31T21:15:23Z

That's probably because of the paged aggregation applied to the repacked file that forces libhdf5 to only make S3 requests of the file page size. Those pages then bring much more data (likely quite a few chunks in one request) compared to the original file where libhdf5 can make S3 requests starting from as little as 8 bytes.

betolink · 2023-12-17T02:29:28Z

We had a very interesting conversation/brainstorming session with @ajelenak during AGU23, he is developing tools to trace the behavior of h5py over the network: https://github.com/ajelenak/ros3vfd-log-info that we'll use to have a better idea of how repacking and doing page aggregation makes an impact on file access times. I'm not sure if this tool can be used with the h5py -> fsspec or just for the rosv3 driver.

ajelenak · 2023-12-19T00:38:01Z

Currently it can only parse libhdf5's ros3 driver logs. I was interested in those because they are the most accurate information about where in a file and how many bytes libhdf5 is reading. An fsspec log parser can certainly be added. Do you have one to share?

betolink · 2023-12-19T21:26:40Z

Working on it! @ajelenak fsspec logs are too verbose and I'm figuring out how can we create a filter before they get flushed to match what this tool needs.

asteiker assigned betolink Nov 2, 2023

asteiker mentioned this issue Nov 2, 2023

Wishlist #18

Open

8 tasks

asteiker mentioned this issue Dec 20, 2023

Re-plot performance testing based on individual files #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate why access performance isn't improved uniformly by repacking metadata #19

Investigate why access performance isn't improved uniformly by repacking metadata #19

betolink commented Aug 30, 2023 •

edited by asteiker

Tasks

Tasks

ajelenak commented Aug 31, 2023

betolink commented Aug 31, 2023

ajelenak commented Aug 31, 2023

betolink commented Dec 17, 2023

ajelenak commented Dec 19, 2023

betolink commented Dec 19, 2023

Investigate why access performance isn't improved uniformly by repacking metadata #19

Investigate why access performance isn't improved uniformly by repacking metadata #19

Comments

betolink commented Aug 30, 2023 • edited by asteiker

Tasks

Tasks

ajelenak commented Aug 31, 2023

betolink commented Aug 31, 2023

ajelenak commented Aug 31, 2023

betolink commented Dec 17, 2023

ajelenak commented Dec 19, 2023

betolink commented Dec 19, 2023

betolink commented Aug 30, 2023 •

edited by asteiker