Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop making Numba strings through PyObject* #2704

Open
jpivarski opened this issue Sep 8, 2023 · 2 comments
Open

Stop making Numba strings through PyObject* #2704

jpivarski opened this issue Sep 8, 2023 · 2 comments
Assignees
Labels
performance Works, but not fast enough or uses too much memory

Comments

@jpivarski
Copy link
Member

Version of Awkward Array

HEAD

Description and code to reproduce

When iterating over an Awkward Array of strings in Numba, we present them as Numba's internal lowered string objects, but we do it by creating Python strings (new PyObject* object, which has to be DECREFed and GIL-protected).

pyapi = context.get_python_api(builder)
gil = pyapi.gil_ensure()
strptr = builder.bitcast(rawptr_cast, pyapi.cstring)
if viewtype.type.parameters["__array__"] == "string":
kind = context.get_constant(numba.types.int32, pyapi.py_unicode_1byte_kind)
pystr = pyapi.string_from_kind_and_data(kind, strptr, strsize_cast)
else:
pystr = pyapi.bytes_from_string_and_size(strptr, strsize_cast)
out = pyapi.to_native_value(rettype, pystr).value
pyapi.decref(pystr)
pyapi.gil_release(gil)

The above implementation uses pyapi.to_native_value to avoid having to even know what Numba's lowered string type is, but Numba does have a lowered string type.

I think it's the nb.types.UnicodeType, which has a concrete instantiation as nb.types.unicode_type. For the equivalent of a Python bytes, there's nb.types.Bytes, but I don't know if that one has a concrete instantiation.


This seems to be the way to make a nb.types.Bytes object, in a lowered context:

https://github.com/numba/numba/blob/f4c4afcb180193064a0c985246c7c71d007c6b1d/numba/cpython/charseq.py#L222-L237

There's a _make_constant_bytes helper that would show how to do that. But also notice that it's extracting data from the UnicodeType by create_struct_proxy, so maybe UnicodeType is just a StructModel with fields data and kind.

Aha, here's the full StructModel:

https://github.com/numba/numba/blob/f4c4afcb180193064a0c985246c7c71d007c6b1d/numba/cpython/unicode.py#L77-L90

The pyapi.to_native_value that I was using is probably numba.cpython.unicode.unbox_unicode_str, which gives a hint about how to set all of those fields (maybe need to follow numba.core.pythonapi.string_as_string_size_and_kind for the definitions).

Need to get the reference-counts right, and they're not Python reference counts (NRT).

Probably still need to allocate memory for the Numba unicode object and copy the data from the Awkward Array into it, rather than anything zero-copy. Numba will try to delete that memory when the unicode object goes out of scope.

Numba's unicode type has all of the unicode functions, like the ones we imported from Arrow in ak.str.*. It would be interesting to do a race between them, recognizing that the Arrow functions do not need to copy string data and the Numba functions do.

cc @martindurant and @douglasdavis

@jpivarski jpivarski added the performance Works, but not fast enough or uses too much memory label Sep 8, 2023
@martindurant
Copy link
Contributor

It would be interesting to do a race between them, recognizing that the Arrow functions do not need to copy string data and the Numba functions do.

arrow does need to copy if the awkward strings are not contiguous, I think.

(In light testing, the arrow string kernels are not much faster than a python loop, which is maybe not surprising since the basic algorithms are probably the same. This is for operations that don't create lists/strings as output, so the upfront costs of making python strings happen before benchmarking... Maybe the bulk of the time is in unicode lookups, not any memcopies.)

@jpivarski
Copy link
Member Author

That's true: an Awkward ListArray has to be copied when making list arrays in Arrow, but an Awkward ListOffsetArray does not. ListOffsetArrays are more common, though—any Awkward operation that needs to rewrite a list array would rewrite it in a way that makes it contiguous.

@jpivarski jpivarski added this to Unprioritized in Finalization Jan 19, 2024
@jpivarski jpivarski removed this from Unprioritized in Finalization Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Works, but not fast enough or uses too much memory
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

3 participants