Stop making Numba strings through PyObject* #2704

jpivarski · 2023-09-08T17:11:39Z

Version of Awkward Array

HEAD

Description and code to reproduce

When iterating over an Awkward Array of strings in Numba, we present them as Numba's internal lowered string objects, but we do it by creating Python strings (new PyObject* object, which has to be DECREFed and GIL-protected).

awkward/src/awkward/_connect/numba/layout.py

Lines 74 to 89 in 461b990

    
           pyapi = context.get_python_api(builder) 
        
           gil = pyapi.gil_ensure() 
        
           strptr = builder.bitcast(rawptr_cast, pyapi.cstring) 
        
           if viewtype.type.parameters["__array__"] == "string": 
        
               kind = context.get_constant(numba.types.int32, pyapi.py_unicode_1byte_kind) 
        
               pystr = pyapi.string_from_kind_and_data(kind, strptr, strsize_cast) 
        
           else: 
        
               pystr = pyapi.bytes_from_string_and_size(strptr, strsize_cast) 
        
           out = pyapi.to_native_value(rettype, pystr).value 
        
           pyapi.decref(pystr) 
        
           pyapi.gil_release(gil)

The above implementation uses pyapi.to_native_value to avoid having to even know what Numba's lowered string type is, but Numba does have a lowered string type.

I think it's the nb.types.UnicodeType, which has a concrete instantiation as nb.types.unicode_type. For the equivalent of a Python bytes, there's nb.types.Bytes, but I don't know if that one has a concrete instantiation.

This seems to be the way to make a nb.types.Bytes object, in a lowered context:

https://github.com/numba/numba/blob/f4c4afcb180193064a0c985246c7c71d007c6b1d/numba/cpython/charseq.py#L222-L237

There's a _make_constant_bytes helper that would show how to do that. But also notice that it's extracting data from the UnicodeType by create_struct_proxy, so maybe UnicodeType is just a StructModel with fields data and kind.

Aha, here's the full StructModel:

https://github.com/numba/numba/blob/f4c4afcb180193064a0c985246c7c71d007c6b1d/numba/cpython/unicode.py#L77-L90

The pyapi.to_native_value that I was using is probably numba.cpython.unicode.unbox_unicode_str, which gives a hint about how to set all of those fields (maybe need to follow numba.core.pythonapi.string_as_string_size_and_kind for the definitions).

Need to get the reference-counts right, and they're not Python reference counts (NRT).

Probably still need to allocate memory for the Numba unicode object and copy the data from the Awkward Array into it, rather than anything zero-copy. Numba will try to delete that memory when the unicode object goes out of scope.

Numba's unicode type has all of the unicode functions, like the ones we imported from Arrow in ak.str.*. It would be interesting to do a race between them, recognizing that the Arrow functions do not need to copy string data and the Numba functions do.

cc @martindurant and @douglasdavis

The text was updated successfully, but these errors were encountered:

martindurant · 2023-09-28T15:07:57Z

It would be interesting to do a race between them, recognizing that the Arrow functions do not need to copy string data and the Numba functions do.

arrow does need to copy if the awkward strings are not contiguous, I think.

(In light testing, the arrow string kernels are not much faster than a python loop, which is maybe not surprising since the basic algorithms are probably the same. This is for operations that don't create lists/strings as output, so the upfront costs of making python strings happen before benchmarking... Maybe the bulk of the time is in unicode lookups, not any memcopies.)

jpivarski · 2023-09-28T16:47:45Z

That's true: an Awkward ListArray has to be copied when making list arrays in Arrow, but an Awkward ListOffsetArray does not. ListOffsetArrays are more common, though—any Awkward operation that needs to rewrite a list array would rewrite it in a way that makes it contiguous.

jpivarski added the performance Works, but not fast enough or uses too much memory label Sep 8, 2023

jpivarski added this to Unprioritized in Finalization Jan 19, 2024

jpivarski assigned ianna Jan 19, 2024

jpivarski removed this from Unprioritized in Finalization Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop making Numba strings through PyObject* #2704

Stop making Numba strings through PyObject* #2704

jpivarski commented Sep 8, 2023

martindurant commented Sep 28, 2023

jpivarski commented Sep 28, 2023

Stop making Numba strings through PyObject* #2704

Stop making Numba strings through PyObject* #2704

Comments

jpivarski commented Sep 8, 2023

Version of Awkward Array

Description and code to reproduce

martindurant commented Sep 28, 2023

jpivarski commented Sep 28, 2023