Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow StringArray[python] to be backed by numpy StringDType in numpy 2.0 #58578

Draft
wants to merge 72 commits into
base: main
Choose a base branch
from

Conversation

lithomas1
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Just testing for CI again.

# in the next iteration when the created str object is GC'ed,
# clobbering the value of v
#if values.dtype.kind == "T":
v = strdup(v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit wary of managing lifecycle this way - so the existing implemention has no ownership of the string lifecycle then right? Its probably easier to make that a StringView hash table then and creating a dedicated String hash table which does copy

This is another case where using C++ would be a better language choice than tempita (see also https://github.com/pandas-dev/pandas/pull/57730/files)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, editing anything in tempita kinda sucks in general.

But yes, I think the existing implementation doesn't have ownership of the Python string objects.

Turning this into StringViewHashTable, and subclassing this sounds good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could get the UTF-8 string data from the array entry directly, without going throuh PyArray_GETITEM via the NumPy C API:

https://numpy.org/neps/nep-0055-string_dtype.html#packing-and-loading-strings

There aren't cython bindings for this API yet in the numpy cython bindings but it's on my list of things to do. It probably makes sense to manage the allocators with a context manager, for example.

I also see that the new C API isn't yet covered in the C API docs and I need to make sure there are docs for the stringdtype C API before the 2.0 release happens. Thank you for prompting me to notice that oversight!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# in the next iteration when the created str object is GC'ed,
# clobbering the value of v
#if values.dtype.kind == "T":
v = strdup(v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably leaks in the current implementation

Copy link
Member Author

@lithomas1 lithomas1 May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be freeing these strings in __dealloc__ if I didn't mess this up.

EDIT: Nevermind, I'm stupid 😓

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I wouldn't put it there either - __dealloc__ is the inverse of __cinit__; any memory allocations performed outside of those functions needs to be managed with its own explicit lifecycle

@mroeschke mroeschke added Strings String extension data type and string data Compat pandas objects compatability with Numpy or Python functions labels May 8, 2024
value = value._ndarray

# np.where will not preserve the StringDType
# TODO: ask Nathan about this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened numpy/numpy#26420 for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants