Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wishlist #18

Open
1 of 8 tasks
abarciauskas-bgse opened this issue Aug 11, 2023 · 3 comments
Open
1 of 8 tasks

Wishlist #18

abarciauskas-bgse opened this issue Aug 11, 2023 · 3 comments

Comments

@abarciauskas-bgse
Copy link
Contributor

abarciauskas-bgse commented Aug 11, 2023

  • Add repack preprocessing times to test_workflow.ipynb
  • Add flatgeobuf preprocessing times to test_workflow.ipynb
  • Track fix to sliderule and re-run processing times notebook when sliderule is working
  • Complete creating tests for the ArrMean and SubsetMean for remaining data formats
  • Scale running tests + report something statistically significant. Investigate if S3 caching is causing any impact.
  • Provide instructions on how to update the dataset list (say we want to run the test for different ATL03 granules or even need to produce the test data in a different bucket)
  • Integrate h5coro as a backend to xarray and include that option in the tests
  • Investigate why repacking makes results slower than the original h5
@weiji14 weiji14 pinned this issue Aug 13, 2023
@weiji14
Copy link
Member

weiji14 commented Aug 15, 2023

  • Integrate h5coro as a backend to xarray and include that option in the tests

Besides h5coro (C++ based), I'd also like to add hidefix (Rust-based) into the benchmark comparison. It looks like there's a backend engine already at https://github.com/gauteh/hidefix/blob/v0.6.3/python/hidefix/xarray.py#L40, related issue at pydata/xarray#7446. Will be easier to test once the conda-forge package is ready.

@asteiker
Copy link
Member

asteiker commented Nov 2, 2023

@andypbarrett @betolink @asteiker Are reviewing this Issue to determine what else is needed to provide robust recommendations for ATL03 reformatting. Should we break out the relevant tasks into separate Issues? Other notes:

@jpswinski
Copy link
Contributor

In SlideRule performance testing I've seen a 25% performance difference between hitting an object for the first time in S3 and then subsequent reads of the same object. I've also seen a 10% to 20% performance swing based on the time of day we are hitting S3. So, for our performance testing, we try to run the tests at the same time (early morning seems to be more stable), and we run the tests multiple times until the timing numbers come down and stabilize. Doing that doesn't capture the performance most people experience (because they will typically hit our system in the middle of the day to late afternoon, and because the resulting data pull by SlideRule is usually reading the data after the S3 caches have cooled), but it does allow us to compare apples to apples when we make changes in the code and want to know how the changes affected performance.

@abarciauskas-bgse abarciauskas-bgse unpinned this issue Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants