Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Z5 performance #118

Open
weilewei opened this issue Jun 26, 2019 · 15 comments
Open

Z5 performance #118

weilewei opened this issue Jun 26, 2019 · 15 comments

Comments

@weilewei
Copy link
Contributor

Hi,

I am not sure if this is a correct question to ask. How do you see the performance of parallel I/O for Z5 in the context of distributed computing comparing with other I/O libraries (hdf5)? This is just a general question. Because I am thinking to integrate Z5 into HPX https://github.com/STEllAR-GROUP/hpx, a C++ Standard Library for Concurrency and Parallelism, in the near future. I would like to see if there is any performance benchmark that I can refer to and any performance comparison that I can make. This could be my graduate study project.

Over this summer, I am working on the C API interface for Z5 (https://github.com/kmpaul/cz5test). The idea of this project is to test Z5 performance with another parallel I/O library. This project is still in progress. I would love to share the results with you once they are in place.

Any suggestion will be helpful. Thanks.

@clbarnes
Copy link
Contributor

clbarnes commented Jun 26, 2019

I think @aschampion designed the benchmarks for rust-n5 to match some benchmarks being done on the java reference implementation of N5, although I don't know where the java benchmarks are, which are here. Not sure about the zarr side of things, although I personally see more of a future for zarr than N5, as a format.

@constantinpape
Copy link
Owner

constantinpape commented Jun 26, 2019

I am not sure if this is a correct question to ask.

No worries, this is a perfectly legit question to ask here.

How do you see the performance of parallel I/O for Z5 in the context of distributed computing comparing with other I/O libraries (hdf5)?

In general, the big advantage of z5 (or to be more precise of n5 / zarr) compared to hdf5 is that it supports parallel write operations. It's important to note that this only works if chunks are not accessed concurrently (in the case of z5, this will lead to undefined behaviour). @clbarnes and me actually started to implement file locking for chunks quite a while back, but this has gone stale, mainly because there are several issues with file locking in general. If you want to know more about this, have a look at #65, #66 and #63.

In terms of single threaded performance z5 and hdf5 are roughly equal (I compared in python to h5py performance at some point.)

Because I am thinking to integrate Z5 into HPX https://github.com/STEllAR-GROUP/hpx, a C++ Standard Library for Concurrency and Parallelism, in the near future.

That would be awesome!

in the near future. I would like to see if there is any performance benchmark that I can refer to and any performance comparison that I can make.

Besides the java / n5 benchmarks that @clbarnes mentioned, here are the benchmarks I used while I developed the library, also comparing to hdf5:
https://github.com/constantinpape/z5/tree/master/src/bench

Since then, I started to set up an asv repository:
https://github.com/constantinpape/z5py-benchmarks
But this is still unfinished (any contributions would be very welcome!).

Over this summer, I am working on the C API interface for Z5 (https://github.com/kmpaul/cz5test).

I will give you some feedback on this in the related issue #68.

@weilewei
Copy link
Contributor Author

Thanks for your explanation @constantinpape @clbarnes! I am reading your comments and searching around these days.

In general, the big advantage of z5 (or to be more precise of n5 / zarr) compared to hdf5 is that it supports parallel write operations. It's important to note that this only works if chunks are not accessed concurrently (in the case of z5, this will lead to undefined behaviour). @clbarnes and me actually started to implement file locking for chunks quite a while back, but this has gone stale, mainly because there are several issues with file locking in general. If you want to know more about this, have a look at #65, #66 and #63.

It's great to see that z5 supports parallel write, it might be a good fit for HPX which can launch many threads and processes. It's also sad to see that hdf5 currently does not support parallel write, and I just noticed that they are proposing parallel design here https://portal.hdfgroup.org/display/HDF5/Introduction+to+Parallel+HDF5. This also means that probably I could not get any performance comparison between z5 and hdf5 in the context of parallel computing in the near future.

Also, I am not sure about how to solve the file locking issue at this point. I will need to look into it in details later.

In terms of single threaded performance z5 and hdf5 are roughly equal (I compared in python to h5py performance at some point.)
Thanks, good to know.

Besides the java / n5 benchmarks that @clbarnes mentioned, here are the benchmarks I used while I developed the library, also comparing to hdf5:
https://github.com/constantinpape/z5/tree/master/src/bench
Since then, I started to set up an asv repository:
https://github.com/constantinpape/z5py-benchmarks
But this is still unfinished (any contributions would be very welcome!).

Thanks for providing such benchmarks! I am trying to see if I can measure some performance on the C++/C side as my projects are mainly on these two languages.

@constantinpape
Copy link
Owner

It's also sad to see that hdf5 currently does not support parallel write, and I just noticed that they are proposing parallel design here https://portal.hdfgroup.org/display/HDF5/Introduction+to+Parallel+HDF5.

My first answer was not quite precise with regard to parallel writing in HDF5. There is support for parallel writing, but it is not a feature supported by default. As far as I am aware of it there are two options to do this in HDF5, both with some downsides:

  • parallel HDF5 (the link you posted); only works with MPI, does not allow thread based access, to the best of my knowledge not implemented in h5py, so not available in python, also not available in Java.
  • The other option is to use region references https://support.hdfgroup.org/HDF5/Tutor/reftoreg.html and parallelize over regions that are stored in separate files. This approach becomes very similar to the chunk based storage of N5 / zarr, but it's probably less efficient if regions become too small (I have never checked for this, so just my gut feeling).

@weilewei
Copy link
Contributor Author

weilewei commented Jul 1, 2019

I see. Thanks for sharing and answering. @constantinpape

@weilewei
Copy link
Contributor Author

weilewei commented Jul 26, 2019

Sorry for bringing up another performance issue again. Could you please take a look at the issue I asked here: xtensor-stack/xtensor#1695? Let me know if you have any suggestion. Thanks.

@constantinpape
Copy link
Owner

Could you please take a look at the issue I asked here: QuantStack/xtensor#1695?

Thanks for bringing this up. I just came back from vacations and I had a quick look and I think I have some ideas. I will try to have a closer look and write something up tomorrow.

@constantinpape
Copy link
Owner

constantinpape commented Aug 3, 2019

First, let me provide some context on the performance issue you brought up:

This header contains functions to read / write a region of interest (ROI) from / into a dataset into / from a xtensor multiarray. I will focus on reading here, writing is mostly analogous.

Reading the ROI works as follows:

For the first case (complete overlap) I have noticed that the naive way of copying via xtensor functionality

const auto bufView = xt::adapt(buffer, chunksShape);
view = bufView;

was a major bottleneck, so I implemented a function to copy from buffer to view myself.
This function does not work for 1d tensors though, and I did not bother to fix this or implement a separate copy function for 1d, so I just fall back to the naive xtensor copy here.
This seems to be the performance issue that you encountered.

I see two options to deal with this:

  1. Fix / expand copyViewToBuffer and copyBufferToView s.t. it also works for 1d arrays.
  2. Investigate how to improve performance within the xtensor functionality.

Option 1 should be straight-forward: I think the functions would only need to be changed a bit,
or a special function for the 1d case could be implemented.

Option 2 would be more interesting though: I only tried the naive approach with xtensor, i.e. I did not specify the layout types for the views into the array and buffer.
This might improve performance enough, maybe even to get rid of my custom copy functions completely. I am not quite sure how well this would work though, because the view into the multiarray is strided. Maybe @SylvainCorlay @wolfv or @JohanMabille could provide some insight here.

If we were to get rid of the custom copy functions completely, this needs to be benchmarked carefully, because I don't want to regress compared to the current performance.

@weilewei
Copy link
Contributor Author

weilewei commented Aug 8, 2019

Thanks for the updates, @halehawk and I will look into this soon.

@weilewei
Copy link
Contributor Author

Just FYI, my presentation of the summer intern project is online now: https://www2.cisl.ucar.edu/siparcs-2019-wei where I reported how to integrate Z5 into an earth model and performance comparison between Z5, netCDF4, and pnetCDF. I will keep you posted if we have any future publication.

@constantinpape
Copy link
Owner

Just FYI, my presentation of the summer intern project is online now: https://www2.cisl.ucar.edu/siparcs-2019-wei where I reported how to integrate Z5 into an earth model and performance comparison between Z5, netCDF4, and pnetCDF.

Thanks for sharing this and great work!
I have one question: Which compression library did you use in z5 for the performance analysis (slide 10/11) and did you compare the compression ratios between z5 and netCDF?
Also, did you compare the performance of z5 and PnetCDF when you don't use compression in z5 (compressor=raw)?

I will keep you posted if we have any future publication.

Looking forward to it!

@weilewei
Copy link
Contributor Author

weilewei commented Aug 21, 2019

I have one question: Which compression library did you use in z5 for the performance analysis (slide 10/11) and did you compare the compression ratios between z5 and netCDF?
Also, did you compare the performance of z5 and PnetCDF when you don't use compression in z5 (compressor=raw)?

We use zlib for compression. The compression rate between z5 and netCDF is similar. No, I haven't tried to do no-compression setting between z5 and PnetCDF (maybe we will try it later).

@halehawk
Copy link

halehawk commented Aug 21, 2019 via email

@constantinpape
Copy link
Owner

Ok, thanks for the follow up

Maybe the output size is fairly small (float number 3192288 on each processor).

That's indeed fairly small.
From my experience raw compression can bring quite a speed up.

@halehawk
Copy link

halehawk commented Aug 21, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants