Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Return true minimum dtype when np.inf/nan present & handle empty arrays #2724

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

groutr
Copy link
Contributor

@groutr groutr commented Jan 23, 2023

When get_minimum_dtype was called with np.inf or np.nan, it would return float64. This PR fixes it to only do the range check on the finite values.

Additionally, I return the smallest dtype when the containers are empty, though, it could be argued to let the exceptions bubble up. Another approach would be to return None (ie GDT_Unknown). What would be preferred approach? My opinion is that either the smallest dtype (bool) or None should be returned.

ping @sgillies

Split off of #2722

@snowman2
Copy link
Member

Thanks for these changes. I think in the scenario when you don't have any data to go off of, it is safer to pick the largest dtype. I think that is why numpy uses float64 for an empty array.

>>> import numpy
>>> arr = numpy.array([])
>>> arr
array([], dtype=float64)

For that reason, I suggest we follow the logic numpy is using and instead default to float64 for empty arrays.

@snowman2
Copy link
Member

After thinking about it further, I think what you have makes sense for the "minimum" dtype. So, ignore my previous comment.

@snowman2 snowman2 changed the title Return true minimum dtype when np.inf/nan present BUG: Return true minimum dtype when np.inf/nan present & handle empty arrays Jan 25, 2023
@snowman2 snowman2 added the bug label Jan 25, 2023
Copy link
Member

@snowman2 snowman2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @groutr 👍

@snowman2 snowman2 added this to the 1.4.0 milestone Jan 25, 2023
@groutr
Copy link
Contributor Author

groutr commented Jan 25, 2023

@snowman2 After looking over this PR some more, I realized there is some other subtle issues with get_minimum_dtype.

  1. Bool should be return if range [0, 1] (was returning uint8)
  2. Never returns complex dtype.

I restructured get_minimum_dtype to handle both of those issues and be more comprehensible.

After thinking it over more, None might be a more consistent return (or maybe even raise a ValueError?) on an empty container. There are two ways of thought, and I'm not sure which is the preferred way: 1) No values mean an unknown dtype (GDT_Unknown) and therefore return None/raise an error or 2) No values mean any dtype is possible and therefore return the smallest/minimum dtype.
If get_minimum_dtype returned None or raised an error, it means every call would need to be checked and calling code would look like

# returning None
dtype = get_minimum_dtype([])
if dtype:
    np.can_cast(dtype, 'float32')
else:
    # no values, or some unrecognized dtype (not int/float/complex)

# raise an error
try:
    dtype = get_minimum_dtype([])
    np.can_cast(dtype, 'float32')
except ValueError:
    print("No values???")
else:
    print("Unknown dtype (not int/float/complex)

I could see the constant checks after calling get_minimum_dtype could be annoying.

@snowman2
Copy link
Member

Bool should be return if range [0, 1] (was returning uint8)

See #2657

Boolean is not a GDAL type in the mapping:

dtype_fwd = {
0: None, # GDT_Unknown
1: ubyte, # GDT_Byte
2: uint16, # GDT_UInt16
3: int16, # GDT_Int16
4: uint32, # GDT_UInt32
5: int32, # GDT_Int32
6: float32, # GDT_Float32
7: float64, # GDT_Float64
8: complex_int16, # GDT_CInt16
9: complex64, # GDT_CInt32
10: complex64, # GDT_CFloat32
11: complex128, # GDT_CFloat64
}

In rasterize, it doesn't have the boolean dtype:

valid_dtypes = (
'int16', 'int32', 'uint8', 'uint16', 'uint32', 'float32', 'float64'
)

Boolean appears to only be used for masks. From what I can tell, uint8 appears to be correct.

@snowman2
Copy link
Member

@groutr, I think you have raised some valid points. I think the inf/nan changes are good to go. However, the empty array logic appears to need more thought/discussion. Mind moving the empty array changes/discussion to a separate PR?

@snowman2
Copy link
Member

Never returns complex dtype.

Mind moving these changes into a separate PR as well? I thin this topic is "complex" enough to be on its own 😄

@groutr
Copy link
Contributor Author

groutr commented Jan 26, 2023

@groutr, I think you have raised some valid points. I think the inf/nan changes are good to go. However, the empty array logic appears to need more thought/discussion. Mind moving the empty array changes/discussion to a separate PR?

Let's get #2734 merged first. Having several different open PRs against the same function is really confusing for me. With that PR merged, it would remove the non-array path and this PR could be simplified.

Regarding the bool return type, I was going off the docstring for get_minimum_dtype where it says the return is a "rasterio dtype string". What that is exactly isn't clear, but I took it to mean the these:

bool_ = 'bool'
ubyte = uint8 = 'uint8'
sbyte = int8 = 'int8'
uint16 = 'uint16'
int16 = 'int16'
uint32 = 'uint32'
int32 = 'int32'
uint64 = 'uint64'
int64 = 'int64'
float32 = 'float32'
float64 = 'float64'
complex_ = 'complex'
complex64 = 'complex64'
complex128 = 'complex128'
complex_int16 = "complex_int16"

@snowman2
Copy link
Member

snowman2 commented Feb 6, 2023

What that is exactly isn't clear, but I took it to mean the these

Your logic makes sense. However, since bool doesn't map to a GDAL dtype, this will cause troubles in the rasterize function it is is used in as it is looking for the rasterio dtype with a GDAL dtype equivalent.

To help make the function clearer, I am thinking that it may be a good idea to update the docstring.

@groutr
Copy link
Contributor Author

groutr commented May 2, 2023

@sgillies There are several questions in this PR that need guidance/clarification.

  1. What is the proper return for empty values? I feel None is the appropriate return value. Bool is another possible return or we could raise an error.
  2. Should get_minimum_dtype consider complex values? What about bool values?
  3. What exactly is a rasterio dtype string? Are they numpy dtypes or GDAL types? Should we consider the mapping idea proposed here: rasterio.dtypes improvements #2763 (comment)?

@sgillies sgillies removed this from the 1.4.0 milestone Dec 21, 2023
@sgillies
Copy link
Member

sgillies commented Jan 5, 2024

@groutr @snowman2 how about this?

  1. get_minimum_dtype() should be calculated using np.nanmin() and np.nanmax().
  2. operation on an empty array should give a ValueError like numpy does: ("ValueError: zero-size array to reduction operation minimum which has no identity"). But maybe a little more clear 😄

Removing implicit use of this function from rasterize() makes its ergonomics more important, yes? We're kind of asking users to call it in their own code instead. What would you think about changing the function def to take multiple values so that a developer can plug in their dataset's min and max values without creating a list? Behavior like Python's min() or max(). For example get_minimum_dtype(-9999, 15).

@sgillies sgillies added this to the 1.4.0 milestone Jan 5, 2024
@sgillies sgillies removed this from the 1.4.0 milestone Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants