Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed of cfa #736

Open
JonathanGregory opened this issue Mar 12, 2024 · 7 comments
Open

speed of cfa #736

JonathanGregory opened this issue Mar 12, 2024 · 7 comments
Labels
question General question

Comments

@JonathanGregory
Copy link

Dear @davidhassell and @sadielbartholomew

A few months ago I recall David reporting much faster time for cfa processing pp files. I've installed the latest version of cf-python and dependencies, I believe:

>>> cf.environment(paths=False)
Platform: Linux-3.10.0-1160.108.1.el7.x86_64-x86_64-with-glibc2.17
HDF5 library: 1.12.2
netcdf library: 4.9.3-development
udunits2 library: libudunits2.so.0
esmpy/ESMF: not available
Python: 3.9.13
dask: 2023.7.0
netCDF4: 1.6.4
psutil: 5.9.0
packaging: 21.3
numpy: 1.25.1
scipy: 1.10.0
matplotlib: 3.5.2
cftime: 1.6.2
cfunits: 3.3.6
cfplot: not available
cfdm: 1.11.1.0
cf: 3.16.1

$ cfa
Using cf-python library version 3.16.1 at /home/users/sws02jmg/.local/lib/python3.9/site-packages/cf

In /storage/basic/baobab/jonathan/general/exprzb.000100 on the RACC I am executing cfa -f CFA4 -o nca *.pp. The directory contains 42,000 pp files, each containing one pp field. So far, it has been executing for a couple of hours. Should it take this long?

Best wishes and thanks

Jonathan

@JonathanGregory JonathanGregory added the bug Something isn't working label Mar 12, 2024
@davidhassell
Copy link
Collaborator

Thanks, Jonathan. I shall investigate ...

@davidhassell
Copy link
Collaborator

  • The aggregation was fast: ~240 seconds on my laptop, with a local copy of your data.

  • I've started the CFA write. Let's see how that goes: it's 17 minutes in already, and still going ... The output file is growing at 250 kB/minute, which seems quite slow to me, so I'll dig deeper into this.

However, at only 4 minutes to aggregate on the fly ...

@davidhassell
Copy link
Collaborator

Here are my aggregate/write times:

  • Aggregate: 4 minutes
  • Write: 1 hour 39 minutes
In [19]: %time f = cf.read('*.pp')
CPU times: user 3min 45s, sys: 2.36 s, total: 3min 47s
Wall time: 3min 54s

In [20]: len(f)
4069

In [21]: %time cf.write(f, 'delme.nca', cfa=True)
CPU times: user 1h 37min 15s, sys: 56.8 s, total: 1h 38min 12s
Wall time: 1h 39min 22s

In [22]: !du -sh delme.nca
25M	delme.nca

@JonathanGregory
Copy link
Author

Dear @davidhassell

Thanks for the tests. Four minutes is quick for aggregation. That is an impressive speedup, indeed. However, it's too long to wait for accessing a dataset when doing interactive analysis. If you could speed it up by another factor of 100, it would be fine. 😄

My test on racc-login-2 is still running. After nearly a day, it's written 1.9 Mbyte. Presuming it's trying to produce the same 25 Mbyte file as your test did, it will take more than three weeks to complete, which is too long to wait even for a batch job. Do you understand how it can take three weeks, or even 100 minutes, to write a netCDF file of 25 Mbyte? I haven't seen it yet, but I guess it probably contains a few hundred fields, doesn't it, metadata only of course.

To make the pph file for this directory takes about 10 minutes. This is simply a concatenation of the pp headers produced by reading all the pp files. du -sh pph gives 1.5M of actual disk space, du -sh --apparent-size pph gives 11M, which is what you'd expect for 42,000 headers of 256 bytes each plus block control words. Presumably it gets compressed by the file system owing to zeros and repetition. How can the CFA file be more than twice as big as as the pph file? The aggregation should have made it much smaller, shouldn't it?

Best wishes

Jonathan

@JonathanGregory
Copy link
Author

JonathanGregory commented Mar 13, 2024

Some more information. I can ncdump the CFA4 file which is being generated, and I find it has so far produced 2800 fields. A file of 25M would therefore contain 37,000 fields, which is quite similar to 42,000. This seems to suggest that it's not aggregating at all, and producing one output CF field for each input pp field. There are 210 2D pp fields in each pp file, and 68 distinct stashcodes, so I think that after aggregation we should have 68 CF fields.

... We've just discussed this. Your experiment shows that it's not aggregating the specific humidity fields. That would explain why there are so many output fields. It does aggregate all the others, you say, but yet it still takes 30 minutes to write the 67 (I suppose) aggregated fields to the CFA file, without data.

@JonathanGregory
Copy link
Author

cfa has finished! It was only about 2 days, not 3 weeks, probably because most of the fields were aggregated, as you found yesterday. In the end there are 4072 fields in the file, which can be explained as 4000 for non-aggregated specific humidity, and 72 for the aggregated quantities. The file is 4.7 Mbyte actual disk space, 26.7 Mbyte apparent disk space, probably the same as yours. As we discussed yesterday, it's another question why the file took 1.5 h write on your laptop, but 2 days on RACC, which is not generally slow for writing netCDF. But perhaps we will understand this soon.

@davidhassell davidhassell added question General question and removed bug Something isn't working labels Mar 14, 2024
@davidhassell
Copy link
Collaborator

Part of this is addressed by #737 (ensuring we write 71 fields as intended, rather than 4069!), but that is not that is not the whole story. Tests are ongoing, and I'll write up the answer soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question
Projects
None yet
Development

No branches or pull requests

2 participants