You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When iterating over an Awkward Array of strings in Numba, we present them as Numba's internal lowered string objects, but we do it by creating Python strings (new PyObject* object, which has to be DECREFed and GIL-protected).
The above implementation uses pyapi.to_native_value to avoid having to even know what Numba's lowered string type is, but Numba does have a lowered string type.
I think it's the nb.types.UnicodeType, which has a concrete instantiation as nb.types.unicode_type. For the equivalent of a Python bytes, there's nb.types.Bytes, but I don't know if that one has a concrete instantiation.
This seems to be the way to make a nb.types.Bytes object, in a lowered context:
There's a _make_constant_bytes helper that would show how to do that. But also notice that it's extracting data from the UnicodeType by create_struct_proxy, so maybe UnicodeType is just a StructModel with fields data and kind.
Need to get the reference-counts right, and they're not Python reference counts (NRT).
Probably still need to allocate memory for the Numba unicode object and copy the data from the Awkward Array into it, rather than anything zero-copy. Numba will try to delete that memory when the unicode object goes out of scope.
Numba's unicode type has all of the unicode functions, like the ones we imported from Arrow in ak.str.*. It would be interesting to do a race between them, recognizing that the Arrow functions do not need to copy string data and the Numba functions do.
It would be interesting to do a race between them, recognizing that the Arrow functions do not need to copy string data and the Numba functions do.
arrow does need to copy if the awkward strings are not contiguous, I think.
(In light testing, the arrow string kernels are not much faster than a python loop, which is maybe not surprising since the basic algorithms are probably the same. This is for operations that don't create lists/strings as output, so the upfront costs of making python strings happen before benchmarking... Maybe the bulk of the time is in unicode lookups, not any memcopies.)
That's true: an Awkward ListArray has to be copied when making list arrays in Arrow, but an Awkward ListOffsetArray does not. ListOffsetArrays are more common, though—any Awkward operation that needs to rewrite a list array would rewrite it in a way that makes it contiguous.
Version of Awkward Array
HEAD
Description and code to reproduce
When iterating over an Awkward Array of strings in Numba, we present them as Numba's internal lowered string objects, but we do it by creating Python strings (new
PyObject*
object, which has to be DECREFed and GIL-protected).awkward/src/awkward/_connect/numba/layout.py
Lines 74 to 89 in 461b990
The above implementation uses
pyapi.to_native_value
to avoid having to even know what Numba's lowered string type is, but Numba does have a lowered string type.I think it's the
nb.types.UnicodeType
, which has a concrete instantiation asnb.types.unicode_type
. For the equivalent of a Pythonbytes
, there'snb.types.Bytes
, but I don't know if that one has a concrete instantiation.This seems to be the way to make a
nb.types.Bytes
object, in a lowered context:https://github.com/numba/numba/blob/f4c4afcb180193064a0c985246c7c71d007c6b1d/numba/cpython/charseq.py#L222-L237
There's a
_make_constant_bytes
helper that would show how to do that. But also notice that it's extracting data from theUnicodeType
bycreate_struct_proxy
, so maybeUnicodeType
is just aStructModel
with fieldsdata
andkind
.Aha, here's the full
StructModel
:https://github.com/numba/numba/blob/f4c4afcb180193064a0c985246c7c71d007c6b1d/numba/cpython/unicode.py#L77-L90
The
pyapi.to_native_value
that I was using is probably numba.cpython.unicode.unbox_unicode_str, which gives a hint about how to set all of those fields (maybe need to follow numba.core.pythonapi.string_as_string_size_and_kind for the definitions).Need to get the reference-counts right, and they're not Python reference counts (NRT).
Probably still need to allocate memory for the Numba unicode object and copy the data from the Awkward Array into it, rather than anything zero-copy. Numba will try to delete that memory when the unicode object goes out of scope.
Numba's unicode type has all of the unicode functions, like the ones we imported from Arrow in
ak.str.*
. It would be interesting to do a race between them, recognizing that the Arrow functions do not need to copy string data and the Numba functions do.cc @martindurant and @douglasdavis
The text was updated successfully, but these errors were encountered: