Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using zspy with database-format? #249

Open
magnunor opened this issue Apr 5, 2024 · 4 comments
Open

Using zspy with database-format? #249

magnunor opened this issue Apr 5, 2024 · 4 comments

Comments

@magnunor
Copy link
Contributor

magnunor commented Apr 5, 2024

The go-to file format for saving large files in HyperSpy is currently .zspy. It uses the Zarr library, to (by default) save the individual chunks in a dataset as individual files. This is through zarr.NestedDirectoryStore. Since the data is stored in individual files, python can both write and read the data in parallel. This makes it much faster compared to for example HDF5-files (.hspy).

However, one large downside with this way of storing the data, is that one can end up with several 1000 of individual files nested within a large number of folders. Sharing this with other people directly is tricky. While it is possible to zip the data, the default zip-reader/writer in Windows seems to struggle if the number of files becomes too large. In addition, it is tedious if the receiver has to uncompress the data before they can visualize it.

Zarr has support for several database formats, where some of them can handle parallel reading and/or writing. With this, it should be possible to get the parallel read/write, while simultaneously getting only one or two files.

I am not at all familiar with these types of database formats. So I wanted to see how they performed, and if they could be useful for working on and sharing large multidimensional datasets.

File saving

Making the dataset

import dask.array as da
import hyperspy.api as hs
dask_data = da.zeros(shape=(400, 400, 200, 200), chunks=(50, 50, 50, 50))
dask_data[:, :, 80:120, 80:120] = da.random.random((400, 400, 40, 40))
s = hs.signals.Signal2D(dask_data).as_lazy()

Saving the datasets:

import zarr
##########################
store = zarr.LMDBStore('001_test_save_lmdb.zspy')
s.save(store)

##########################
store = zarr.NestedDirectoryStore('001_test_save_nested_dir.zspy')
s.save(store)

##########################
store = zarr.SQLiteStore('001_test_save_sqldb.zspy')
s.save(store)

File loading

Then loading the same datasets

from time import time
import zarr
import hyperspy.api as hs

Note: run these separately, since the file is pretty large.

t0 = time()
store = zarr.LMDBStore("001_test_save_lmdb.zspy")
s = hs.load(store)
print("LMDB {0}".format(time() - t0))
t0 = time()
store = zarr.NestedDirectoryStore("001_test_save_nested_dir.zspy")
s = hs.load(store)
print("NestedDirectory {0}".format(time() - t0))
t0 = time()
store = zarr.SQLiteStore('001_test_save_sqldb.zspy')
s = hs.load(store)
print("SQLite {0}".format(time() - t0))

Results:

  • LMDBStore: write 14.12 s, read 30.52 s
  • NestedDirectoryStore: write 4.28 s, read 32.47 s
  • SQLiteStore: write 28.75 s, read 35.19 s
@ericpre
Copy link
Member

ericpre commented Apr 5, 2024

Can you be more specific with the issue on windows? Does it have to do with the number of files per directory, the nested structure of the directories or the specific software being used on windows?
I had issues with path length on windows but this can fixed easily with some windows setting.

I usually copy folder without zipping and it works fine when synchronising using Dropbox, Onedrive, Nextcloud, etc. What are you using to share the data?

@CSSFrancis
Copy link
Member

@magnunor Another thing to consider is if windows is trying to compress the data even further. I think for linux systems it checks to see if the underlying data is compressed and won't "double" compress the data but it's fairly possible that windows doesn't handle that case nearly as well.

@CSSFrancis
Copy link
Member

Something that I've been meaning to try is using a S3 like file system and the FSStore class. People seem to really like that for partial reads over a network which might be of interest.

Another thing to consider is that the v3 specification includes support for "sharding" which should be quite interesting as well and improves the performance for windows computers I think.

@magnunor
Copy link
Contributor Author

I usually copy folder without zipping and it works fine when synchronising using Dropbox, Onedrive, Nextcloud, etc. What are you using to share the data?

Internal sharing is fine, but for example Zenodo or our website-based filesender can't handle folder-structures (at least not easily).


I tested this a bit more, and the ZipStore seems to perform pretty good:

  • NestedDirectoryStore: saving 4.63 seconds, loading 31.7 seconds
  • ZipStore: saving 4.91 seconds, loading 30.7 seconds

The code

Saving the data:

from time import time
import zarr
import dask.array as da
import hyperspy.api as hs

dask_data = da.zeros(shape=(400, 400, 200, 200), chunks=(50, 50, 50, 50))
dask_data[:, :, 80:120, 80:120] = da.random.random((400, 400, 40, 40))

s = hs.signals.Signal2D(dask_data).as_lazy()

###########################
t0 = time()
store = zarr.NestedDirectoryStore('001_test_save_nested_dir.zspy')
s.save(store)
print("NestedDirectory store, save-time: {0}".format(time() - t0))

##########################
t0 = time()
store = zarr.ZipStore('001_test_save_zipstore.zspy')
s.save(store)
print("ZIP store, save-time: {0}".format(time() - t0))

Loading the data:

from time import time
import zarr
import hyperspy.api as hs

##############################
t0 = time()
store = zarr.NestedDirectoryStore("001_test_save_nested_dir.zspy")
s = hs.load(store)
print("NestedDirectory {0}".format(time() - t0))
"""

##############################
t0 = time()
store = zarr.ZipStore('001_test_save_zipstore.zspy')
s = hs.load(store)
print("ZIP {0}".format(time() - t0))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants