Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create "Optimising memory use" notebook in Frequently_used_code #879

Open
robbibt opened this issue Sep 28, 2021 · 2 comments
Open

Create "Optimising memory use" notebook in Frequently_used_code #879

robbibt opened this issue Sep 28, 2021 · 2 comments

Comments

@robbibt
Copy link
Collaborator

robbibt commented Sep 28, 2021

We should create a Frequently_used_code notebook that documents some useful techniques for optimising memory use when analysing DEA data.

@Kirill888 has lots of useful tools for doing that here: https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/_masking.py

E.g. you can use something like fmask_to_bool to produce a boolean mask from fmask flags:
https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/_masking.py#L517

Then pass that to erase_bad to set those "bad" values to the data's nodata value (still in the original data type):
https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/_masking.py#L97

Then finally convert it to floats at the end using to_float (this is the first time the nodata values will be set to NaN ) :
https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/_masking.py#L204

The idea behind those funcs is to keep things as dask arrays and int datatypes until the last possible moment so that memory is kept better under control. I'm not entirely sure though if there's options there for computing things like means/medians etc on the data in its original data type (taking into account the custom nodata values), but this would also be good to include as these are very common workflows.

@robbibt
Copy link
Collaborator Author

robbibt commented Sep 28, 2021

From Kirill:

It's a bit messy (some internals are exposed that should not be, and docs quality is not uniform). Probably best place to get an overview of what's available is here:
https://github.com/opendatacube/odc-tools/blob/develop/libs/algo/odc/algo/__init__.py#L19
There are things like

  • enum_to_bool
  • to_float from_float
  • apply_numexpr

There are no nodata aware reduction functions. Maybe these are supported by masked arrays in numpy? But really bigger problem is not so much representation and handling of missing values, bigger problem is integer math can be hard to reason about and implement correctly (without silent overflows). So I prefer to convert to float then use nan{mean,sum,...} family of functions followed by conversion back to integer.
to_float is also useful for plotting as nan are automatically transparent, whereas nodata values are not

@cbur24
Copy link
Collaborator

cbur24 commented Oct 7, 2021

A step-by-step example of using some of these tools are available in the Cloud and Pixel Quality Masking notebook in deafrica-sandbox-notebooks, including the mask_cleanup function which is pretty handy. Not explicity reference memory optimisation, but might provide some boilerplate code for starting this notebook

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants