Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster ISD #113

Open
zmoon opened this issue May 6, 2023 · 1 comment
Open

Faster ISD #113

zmoon opened this issue May 6, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@zmoon
Copy link
Member

zmoon commented May 6, 2023

Currently it is pretty slow. Youhua reported that it took somewhat over an hour to get all US sites for a period with one unique year.

Andrew Lambert of NRL said that with his code he was able to do something similar (vis data only though?) loading from Amazon S3 in 3 minutes or so. So there is a lot of room for improvement.


I did some initial benchmarking on my desktop, picked a site year with larger file size: site 01010099999, year 2020, 748K. Loading with pd.read_fwf1, specifying widths and dtypes (also using header=None, compression="gzip"). 3 options for this fixed-width (FW) file:

☝️ So there is a factor of 2 speed gain in this case by using the S3 URL instead. But note not all the processing that monetio does is included.

Loading this site-year similarly

ish.add_data(pd.date_range("2020/01/01", "2021/01/01", freq="D")[:-1], site="01010099999", resample=False)

(though more processing done, e.g. 99999 -> NaN) with monetio currently takes ~ 15 s. Probably by leveraging pd.read_fwf etc. like above instead of the current method, we can speed the processing up.

Also note that loading the CSV (s3://noaa-global-hourly-pds/2020/01010099999.csv) instead of the FW text seems not that much slower. Will have to compare with all the processing included etc.

Also note that https://www.ncei.noaa.gov/data/global-hourly/archive/ has compiled files (all sites, presumably) by year. These are just .tar.gzs of all the CSV or FW files though.


cc: @ytangnoaa @bbakernoaa

Footnotes

  1. pd.read_fwf args are not all documented in its docstring, but they are mostly the same as pd.read_csv, with the differences noted here

@zmoon zmoon added the enhancement New feature or request label May 6, 2023
@zmoon zmoon self-assigned this May 6, 2023
@zmoon zmoon added this to the v0.3 milestone May 10, 2023
This was referenced Jun 5, 2023
@zmoon
Copy link
Member Author

zmoon commented Jan 18, 2024

Remember, ISH-Lite is in the bucket as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant