Performance: reading h5 file is slow compared to h5py #267

WindSoilder · 2024-01-16T02:02:04Z

Hi, I have a case which need to load some h5 file into memory as a cache, the h5 file contains a lot of dataset, and a dataset contains thousand rows of 1d array.

Sorry that I can't provide such h5 file, but I can simulate a file which have similar structure in my use case.

Code to generate such h5 file

import h5py
import pandas as pd

f = h5py.File("tmp.h5", "w")

for i in range(1, 15001):
    dataset_name = str(i)
    data = pd.DataFrame(
        {key: [1] * 3000 for key in ["a1", "a2", "a3", "a4", "a5", "a6"]}
    )
    data = data.astype(
        {"a1": "<u4", "a2": "<f8", "a3": "<f8", "a4": "<f8", "a5": "<f8", "a6": "<u8"}
    )
    f.create_dataset(
        dataset_name, data=data.to_records(index=False), compression=9, shuffle=False
    )

Here is my generated h5 file: tmp.tar.gz

Reader code

Here is rust code:

use hdf5::{File, H5Type};
use ndarray::{s, Array1};
use std::collections::HashMap;
use std::error::Error;
use std::result::Result as StdResult;

#[derive(H5Type, Clone, PartialEq, Debug)]
#[repr(C)]
pub struct TmpData {
    a1: u32,
    a2: f64,
    a3: f64,
    a4: f64,
    a5: f64,
    a6: u64,
}

pub fn read_to_mem(path: &str) -> StdResult<HashMap<String, Array1<TmpData>>, Box<dyn Error>> {
    let file = File::open(path)?; // open for reading
    let mut result = HashMap::new();
    for dataset in file.datasets()? {
        let data = dataset.read_slice_1d::<TmpData, _>(s![..])?;
        result.insert(dataset.name(), data);
    }
    Ok(result)
}


fn main() {
    let _ = read_to_mem("tmp.h5").unwrap();
}

And here is python code:

import h5py
f = h5py.File("tmp.h5")
cache = {k: v[()] for k, v in f.items()}

As compared, hdf5-rust code takes 8m19s to read the whole file, but h5py code takes about 30 seconds.

I've tried to enable f16 feature, but have no luck.

Am I doing something wrong? Or how can I improve the performance?

The text was updated successfully, but these errors were encountered:

mulimoen · 2024-01-16T10:13:31Z

I think this has popped up before (can't find the issue) and it was to do with hdf5 doing conversion of every compound internally, when it could have been a copy/noop. You could try creating a flamegraph to verify this.

h5py might be using a different way of reading the file compared to the naive way in this crate. We should look at this approach and copy their way of doing it.

aldanor · 2024-01-30T20:49:29Z

Numpy structured arrays will produce packed layouts by default. You can check that .dtype.itemsize in your case is equal to 44, whereas for the Rust struct you have it's repr(C), so its sizeof will be 48. There's no surprise then, h5py does a direct read with zero work afterwards whereas in Rust you have mismatching layouts and you have to copy every field into its place. So, you'd want to do either:

Use align=True when creating recarrays, then you can use it with a repr(C) struct
Use repr(packed) on the struct, then you can use it with packed arrays

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: reading h5 file is slow compared to h5py #267

Performance: reading h5 file is slow compared to h5py #267

WindSoilder commented Jan 16, 2024 •

edited

mulimoen commented Jan 16, 2024

aldanor commented Jan 30, 2024 •

edited

Performance: reading h5 file is slow compared to h5py #267

Performance: reading h5 file is slow compared to h5py #267

Comments

WindSoilder commented Jan 16, 2024 • edited

Code to generate such h5 file

Reader code

mulimoen commented Jan 16, 2024

aldanor commented Jan 30, 2024 • edited

WindSoilder commented Jan 16, 2024 •

edited

aldanor commented Jan 30, 2024 •

edited