Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: reading h5 file is slow compared to h5py #267

Open
WindSoilder opened this issue Jan 16, 2024 · 2 comments
Open

Performance: reading h5 file is slow compared to h5py #267

WindSoilder opened this issue Jan 16, 2024 · 2 comments

Comments

@WindSoilder
Copy link

WindSoilder commented Jan 16, 2024

Hi, I have a case which need to load some h5 file into memory as a cache, the h5 file contains a lot of dataset, and a dataset contains thousand rows of 1d array.

Sorry that I can't provide such h5 file, but I can simulate a file which have similar structure in my use case.

Code to generate such h5 file

import h5py
import pandas as pd

f = h5py.File("tmp.h5", "w")

for i in range(1, 15001):
    dataset_name = str(i)
    data = pd.DataFrame(
        {key: [1] * 3000 for key in ["a1", "a2", "a3", "a4", "a5", "a6"]}
    )
    data = data.astype(
        {"a1": "<u4", "a2": "<f8", "a3": "<f8", "a4": "<f8", "a5": "<f8", "a6": "<u8"}
    )
    f.create_dataset(
        dataset_name, data=data.to_records(index=False), compression=9, shuffle=False
    )

Here is my generated h5 file: tmp.tar.gz

Reader code

Here is rust code:

use hdf5::{File, H5Type};
use ndarray::{s, Array1};
use std::collections::HashMap;
use std::error::Error;
use std::result::Result as StdResult;

#[derive(H5Type, Clone, PartialEq, Debug)]
#[repr(C)]
pub struct TmpData {
    a1: u32,
    a2: f64,
    a3: f64,
    a4: f64,
    a5: f64,
    a6: u64,
}

pub fn read_to_mem(path: &str) -> StdResult<HashMap<String, Array1<TmpData>>, Box<dyn Error>> {
    let file = File::open(path)?; // open for reading
    let mut result = HashMap::new();
    for dataset in file.datasets()? {
        let data = dataset.read_slice_1d::<TmpData, _>(s![..])?;
        result.insert(dataset.name(), data);
    }
    Ok(result)
}


fn main() {
    let _ = read_to_mem("tmp.h5").unwrap();
}

And here is python code:

import h5py
f = h5py.File("tmp.h5")
cache = {k: v[()] for k, v in f.items()}

As compared, hdf5-rust code takes 8m19s to read the whole file, but h5py code takes about 30 seconds.

I've tried to enable f16 feature, but have no luck.

Am I doing something wrong? Or how can I improve the performance?

@mulimoen
Copy link
Collaborator

I think this has popped up before (can't find the issue) and it was to do with hdf5 doing conversion of every compound internally, when it could have been a copy/noop. You could try creating a flamegraph to verify this.

h5py might be using a different way of reading the file compared to the naive way in this crate. We should look at this approach and copy their way of doing it.

@aldanor
Copy link
Owner

aldanor commented Jan 30, 2024

Numpy structured arrays will produce packed layouts by default. You can check that .dtype.itemsize in your case is equal to 44, whereas for the Rust struct you have it's repr(C), so its sizeof will be 48. There's no surprise then, h5py does a direct read with zero work afterwards whereas in Rust you have mismatching layouts and you have to copy every field into its place. So, you'd want to do either:

  • Use align=True when creating recarrays, then you can use it with a repr(C) struct
  • Use repr(packed) on the struct, then you can use it with packed arrays

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants