Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote Python datetime and timedelta objects to relevant types in broadcasting #3109

Open
jpivarski opened this issue May 7, 2024 · 2 comments
Labels
bug The problem described is something that must be fixed

Comments

@jpivarski
Copy link
Member

Description of new feature

Problem

Suppose you have an array of datetimes:

>>> array = ak.Array(
...     np.array([datetime.fromisoformat(f"20{x:02d}-01-01") for x in range(20, 25)], "datetime64[us]")
... )
>>> array.show(type=True)
type: 5 * datetime64[us]
[2020-01-01T00:00:00.000000,
 2021-01-01T00:00:00.000000,
 2022-01-01T00:00:00.000000,
 2023-01-01T00:00:00.000000,
 2024-01-01T00:00:00.000000]

You can broadcast it with other NumPy datetimes:

>>> array - np.datetime64(datetime.now())
<Array [-137263721257166 microseconds, ...] type='5 * timedelta64[us]'>

but you can't broadcast it with Python datetimes:

>>> array - datetime.now()
Traceback (most recent call last):
...
  File "/home/jpivarski/irishep/awkward/src/awkward/_nplikes/array_module.py", line 220, in _get_nep_50_dtype
    assert isinstance(obj, (int, complex, float))
AssertionError

Now, NumPy itself doesn't do this:

>>> np_array - datetime.now()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
numpy.core._exceptions._UFuncBinaryResolutionError: ufunc 'subtract' cannot use operands with types dtype('<M8[us]') and dtype('O')

and we don't want to get into the many time formats of Arrow, Pandas, and the rest, but since datetime64 and timedelta64 are among Awkward's primitive types, you'd think (users are likely to think) that the corresponding Python objects should be promoted.

Solution

The traceback ended here:

def _get_nep_50_dtype(
self, obj: Any
) -> DType | type[int] | type[complex] | type[float]:
if hasattr(obj, "dtype"):
return obj.dtype
elif isinstance(obj, bool):
return np.dtype(np.bool_)
else:
assert isinstance(obj, (int, complex, float))
return type(obj)

(a helper function that assumes all of the primitive types are booleans, integers, and (real & complex) floating point numbers, leaving out datetimes and timedeltas). But that's not the only issue. In order to do this right and promote Pythonic data containing temporal data,

  • ArrayBuilder must be aware of temporal data (it is),
  • from_iter has to recognize temporal data (it does),
  • and that all has to go through to_layout, which it does:
>>> ak.to_layout([datetime.now()])
<NumpyArray dtype='datetime64[us]' len='1'>['2024-05-07T16:58:35.871271']</NumpyArray>

Okay, I had thought that there would be a lot of work to do, but most of this looks like it's already implemented. Maybe it's just one naive helper function, and generalizing that one function could completely fix this. If that's true, then this isn't a big project, but a small project. Also, this wouldn't be a feature but a bug-fix. I'll change the label now.

Cc: @alexander-held, who brought this up on Slack.

@jpivarski jpivarski added the bug The problem described is something that must be fixed label May 7, 2024
@jpivarski jpivarski added this to P4 in Finalization May 7, 2024
@agoose77
Copy link
Collaborator

agoose77 commented May 8, 2024

The ufunc logic first passes through ak._connect.numpy's array_ufunc function. There, we have a cast function that you can fiddle with:

>>> from datetime import datetime

>>> ak.behavior["__cast__", datetime] = lambda x: ak.to_layout(x)
>>> array - datetime.now()
<Array [-137324468250000 microseconds, ...] type='5 * timedelta64[us
]'>

It occurs to me that we might want __cast__ to apply to more than just ufuncs; perhaps to_layout should look up cast functions.

But, I digress. Casting is a "solution" here, but the question really remains about expectations. My instinct is that this should require coercion because datetime and np.datetime64 are not exactly the same (the latter has no timezone information for one thing).

@jpivarski
Copy link
Member Author

By "casting," I meant a reinterpret-cast. And yeah, no timezones in NumPy. One of the things that make Python datetimes != NumPy datetimes != Arrow datetimes != Pandas datetimes.

Some timezone would have to be assumed, and it shouldn't be locale-dependent, otherwise code that works on a computer in the U.S. wouldn't work in Europe. (Or wouldn't work on the same laptop, after a transatlantic flight!)

Right now, what the code does is it takes the datetime object, constructs a `datetime(1970, 1, 1, 0, 0, 0), and subtracts the two to get microseconds since epoch, which it then feeds to ArrayBuilder:

else if (py::isinstance(obj, py::module::import("datetime").attr("datetime"))) {
auto datetime = py::module::import("datetime");
auto time_since_epoch = obj - datetime.attr("datetime")(1970, 1, 1, 0, 0, 0);
auto resolution_microseconds = datetime.attr("timedelta")("microseconds"_a=1);
auto time_since_epoch_us = time_since_epoch.attr("__floordiv__")(resolution_microseconds).cast<int64_t>();
self.datetime(time_since_epoch_us, "datetime64[us]");
}

So if the datetime actually has a timezone,

>>> from datetime import datetime
>>> from pytz import timezone
>>> obj = datetime(2011, 8, 15, 8, 15, 12, 0, tzinfo=timezone("US/Central"))

the above code would try to subtract a timezone-naive datetime from a timezone-aware datetime:

>>> obj - datetime(1970, 1, 1, 0, 0, 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't subtract offset-naive and offset-aware datetimes

which is a Python error, uncaught in pybind11 code, and so I think that would cause a segfault.


Nope! It's fine—no segfault:

>>> ak.from_iter([{"a": [{"b": [{"c": [obj, obj, obj]}]}]}])
Traceback (most recent call last):
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 39, in dispatch
    gen_or_result = func(*args, **kwargs)
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_from_iter.py", line 70, in from_iter
    return _impl(iterable, highlevel, behavior, allow_record, initial, resize, attrs)
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_from_iter.py", line 100, in _impl
    builder.fromiter(iterable)
TypeError: can't subtract offset-naive and offset-aware datetimes

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 38, in dispatch
    with OperationErrorContext(name, args, kwargs):
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 85, in __exit__
    self.handle_exception(exception_type, exception_value)
  File "/home/jpivarski/irishep/awkward/src/awkward/_errors.py", line 95, in handle_exception
    raise self.decorate_exception(cls, exception)
TypeError: can't subtract offset-naive and offset-aware datetimes

This error occurred while calling

    ak.from_iter(
        [{'a': [{'b': [{'c': [datetime.datetime(2011, 8, 15, 8, 15, 12, tzinf...
    )

I don't see where the Python exception catching happens and gets propagated correctly, but it gets handled somewhere.


I think this isn't a bad error to get. NumPy datetimes, our internal format, are timezone-naive. If you try to use timezone-aware datetimes, this error is the best thing that can happen. Anything else, such as assuming a locale or assuming UTC, would be a subtle error. We can just say that we don't support timezone-aware datetimes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed
Projects
Development

No branches or pull requests

2 participants