Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large number of small file size #80

Open
apatlpo opened this issue Feb 9, 2021 · 18 comments
Open

large number of small file size #80

apatlpo opened this issue Feb 9, 2021 · 18 comments

Comments

@apatlpo
Copy link
Member

apatlpo commented Feb 9, 2021

usage question I guess

I'm working on an HPC system with a GPFS filesystem and I've been told a couple of times by sys admins (ping @guillaumeeb for confirmation) that I should not produce (very) large number of small filesizes (typically <10Mb).
This potentially puts a lower bound on the intermediate file size.

With rechunker, the knobs that control the intermediate file size are the memory per worker and the target chunk size.
The former is limited by the size of your node (say 100GB), it leads to an upper bound for the intermediate fiile size

Does it mean that in some situations you may have no other choices than apply rechunker in multiple passes in order to complete a rechunk that is too ambitious given the constraints above ?

@rabernat
Copy link
Member

rabernat commented Feb 9, 2021

This is one of the downsides to the way rechunker and Zarr currently work. If you can think of a workaround, I'd love to know about it.

@apatlpo
Copy link
Member Author

apatlpo commented Feb 9, 2021

What would be great (as far as I'm concerned) is if rechunker was able to detect such situation and directly propose a multipass approach.
If you think this would be a valuable feature (difficult to know if this happens often), we can discuss this some more and I can lead PR
What do people think?

@rabernat
Copy link
Member

rabernat commented Feb 9, 2021

How would you detect this situation? What criterion would you use?

@apatlpo
Copy link
Member Author

apatlpo commented Feb 9, 2021

the intermediate chunk size can probably be related to an equivalent file size which needs to be above the minimum file size threshold

@apatlpo
Copy link
Member Author

apatlpo commented Feb 9, 2021

I actually did that manually by adjusting the target chunk size iteratively

@rabernat
Copy link
Member

rabernat commented Feb 9, 2021

But don't you care about the total number of intermediate files? AFAIK, that is the thing that stresses the filesystem, not the chunk size per se.

@apatlpo
Copy link
Member Author

apatlpo commented Feb 9, 2021

I had the feeling that as long as the files were large enough the filesystem was fine, no matter the number of files.
But maybe I don't understand how filesystems are working well enough (or at all :) ).
@guillaumeeb any thoughts ? (he's the one complaining when I produce small files on CNES cluster ...)

@rabernat
Copy link
Member

rabernat commented Feb 9, 2021

I doubt they would have complained if you created one single 1kb file. So "this situation" is some function of file size and number of files. I'd like to understand that function better before we start brainstorming solutions.

@guillaumeeb
Copy link
Member

Ok, let me try to explain. We discussed this a bit in https://discourse.pangeo.io/t/best-practices-to-go-from-1000s-of-netcdf-files-to-analyses-on-a-hpc-cluster/588/13?u=geynard.

At CNES, we have a GPFS file system. It can handle 8,5 PiB of data, but has also a limited amount of metadata (inodes) it can record. I'm not sure of the number here, but we've converge to fix a limit on the mean file size of any given space on our storage: 1MiB. This doesn't mean that you can't create smaller files, even lots of them, but just that for each project space, you'll get an upper limit on inodes and thus files and directories number (which can be high if you're allotted space is too). Be aware that some other computing centers are much more conservative about inodes limits.

This GPFS file system has also a given bandwidth, up to 50GiB/s for us (shared between all nodes and users), but more importantly a limit on the number of IOps (Input Output operation per second, so file creation, random writes...) it can perform. This limit depends on the nature of the operation and a lot of other parameters, and is very hard to measure. But this is the one we reach the most (and our FS is really optimized for this) : a single user with a few hundred cores can slow down the entire FS with bad IOs. Two principal examples of bad IOs:

  • Using FS as a database, or having a parallel job iterating on some grid and randomly reading values into a netCDF collection, Byte by byte.
  • Generating a lot (say millions) of small files (<1MiB), the limitation won't be bandwidth, but IOps.

So I hope this clarifies things, and in the end you must be careful about two things:

  • The upper limit of inodes you can create in a given space.
  • the minimum chunk size you have to use in order to avoid overloading a FS with millions of IOps. At CNES, I'll say that targeting at least 10MiB is a good thing, but in some case you could do with smaller chunks, 1MiB being hard limit.

To go deeper, if I have to translate that in pseudo code:

if len(chunks) > some_high_limit
  do bigger chunks
if len(chunks) > 100_000 and sizeof(chunks) < 10MiB
  do bigger chunks
if sizeof(chunks) < 1MiB
  do bigger chunks

Of course, all these limits will change from HPCs to HPCs, but I'll bet no admin would enjoy lots of files of less than a MiB.

@rabernat
Copy link
Member

rabernat commented Feb 9, 2021

Very interesting. Will digest for a bit.

The tradeoffs remind me a lot of this paper: Predicting and Comparing the Performance of Array Management Libraries, which compares HDF5 and Zarr.

@guillaumeeb
Copy link
Member

Note also that object store behave quite differently as you very well know @rabernat. For those, there is no metadata table, no inodes, the only pain point is write latency. This means you'll only want to have chunks big enough to avoid stacking up write or read calls because of the tens of milliseconds latency of each one.

@rabernat
Copy link
Member

rabernat commented Feb 9, 2021

To be honest, we really designed rechunker with object storage in mind, even though there are probably more users on HPC. Can you look at the algorithm and think of some strategy we could use to mitigate this inode problem? Zarr doesn't support consolidating multiple chunks into one file, although this has been discussed (can't find the github isssue now). An alternative would be to try using TileDB as an intermediate storage format. TileDB should support concurrent writes to the same chunk, allowing us to use bigger chunks. To do that, we would need a PR to rechunker to implement a different array backend.

@apatlpo - it would be good to get the exact size of the input chunks, intermediate chunks, and target chunks for your use case. How many files / what size are we talking about? Is this LLC4320?

@shoyer
Copy link
Collaborator

shoyer commented Feb 11, 2021

To be clear, object storage also has fixed overhead per object. It's less of an issue than in distributed filesystems, but you will still run into performance issues if you store lots of files smaller than 1 MB.

This sounds like a reason to revisit #36, since distributed execution engines like Spark and Dataflow are designed to handle lots spilling outs of small intermediate outputs to disk.

Within Zarr, one way to solve this would be to make an intermediate "storage" object that maps multiple chunks to different parts of the same underlying files.

@rabernat
Copy link
Member

I'm curious if #36 would help @apatlpo. Can they run Beam on their HPC system?

@shoyer
Copy link
Collaborator

shoyer commented Feb 11, 2021

In theory Beam can run on top of Spark, but probably not quite as smoothly as on Cloud Dataflow. (I haven't tried it)

@guillaumeeb
Copy link
Member

Can you look at the algorithm and think of some strategy we could use to mitigate this inode problem?

I took a quick look at it, the most simple thing at first I cant think of is raising a warning if the intermediate chunks are smaller in size than a given limit or there are too many of them. The advice would be to do an intermediate chunking, or to use consolidate reads and writes?

In a second time, we could try to find another algorithm to determine the intermediate chunk size better than using the minimum of each axis. E.g. if shape is (4,1) in input and (1,4) in output, see if we could use (2,2) in intermediate if we have enough memory. I think this is what consolidate_chunk() is doing, but maybe not on intermediate chunks? And consolidate_writes as I understand, change the requested target_chunks size?

To be clear, object storage also has fixed overhead per object. It's less of an issue than in distributed filesystems, but you will still run into performance issues if you store lots of files smaller than 1 MB.

Yeah totally! Less impact on the overall service (and so other users), but performance would be affected also!

Spark and Dataflow are designed to handle lots spilling outs of small intermediate outputs to disk.

Spark could use HPC node local storage to spill to disk intermediate chunks, thus not affecting shared FS. Maybe it could also consolidate intermediate chunks on shared FS into bigger files using internal format, not sure.

I'm curious if #36 would help @apatlpo. Can they run Beam on their HPC system?
In theory Beam can run on top of Spark, but probably not quite as smoothly as on Cloud Dataflow. (I haven't tried it)

We can run Spark on CNES HPC system, it's a bit more complicated than Dask (no dask-jobqueue equivalent, so we do it by hand). Never tried Beam though.

@apatlpo
Copy link
Member Author

apatlpo commented Feb 12, 2021

@rabernat Yep, original data is llc4320 with the shapes you know by heart I'm sure.
Here is a summary a full transposition of one variable (1 face considered for convenience)

{'time': 8784, 'face': 1, 'i_g': 72, 'j': 24}
Individual chunk size = 15.2 MB
Source data size: 		 8784x4320x4320 	 655.7GB
Source chunk size: 		 1x4320x4320 		 74.6MB
Source number of files: 		 8784
Intermediate chunk size: 	 1x192x4320 		 3.3MB
Intermediate number of files: 		 197640
Target chunk size: 		 8784x24x72 		 60.7MB
Target number of files: 		 10800

The above approach spills about 200 000 intermediate files that are between 1 and 3MB which is a bit lower than what I was told to use on CNES cluster.
max_mem is 30GB.

So I do only a partial transposition instead, which looks like:

{'time': 2196, 'face': 1, 'i_g': 144, 'j': 48}
Individual chunk size = 15.2 MB
Source data size: 		 8784x4320x4320 	 655.7GB
Source chunk size: 		 1x4320x4320 		 74.6MB
Source number of files: 		8784
Intermediate chunk size: 	 1x768x4320 		 13.3MB
Intermediate number of files: 		 49410
Target chunk size: 		 2196x48x144 		 60.7MB
Target number of files: 		10800

i.e. about 50 000 files that are between 5 and 10 MB

So not a blocking point at all in my case as you can guess.
But this potential issue may deserve to be mentioned in the doc.

@Thomas-Moore-Creative
Copy link

This is one of the downsides to the way rechunker and Zarr currently work. If you can think of a workaround, I'd love to know about it.

I'm also using Pangeo and rechunker mainly on HPC resources. This may be an ignorant or misguided comment but would making use of the zarr ZipStore option help here? I don't know what penalty you pay in performance zipping/unzipping but just doing an rm -rf my_file.zarr on an HPC based 2TB zarr file in DirectoryStore format with 20,000 chunks isn't exactly fast, even on a BeeGFS parallel file system.

< I better get back to attempting to rechunk this 2TB dataset >
=/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants