Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-14: Dedicated string data type for pandas 3.0 #58551

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented May 3, 2024

Following the discussion in #57073, this proposes a possible solution to get a string dtype in pandas 3.0 (essentially writing out my compromise attempt at #57073 (comment) as a formal proposal).
This also covers the issue tracking the required work for the string dtype in #54792.

Abstract

This PDEP proposes to introduce a dedicated string dtype that will be used by default in pandas 3.0:

  • In pandas 3.0, enable a "string" dtype by default, using PyArrow if available or otherwise the numpy object-dtype alternative.
  • The default string dtype will use missing value semantics using NaN consistent with the other default data types.

This will give users a long-awaited proper string dtype for 3.0, while 1) not (yet) making PyArrow a hard dependency, but still a dependency used by default, and 2) leaving room for future improvements (different missing value semantics, using NumPy 2.0 or nanoarrow, etc).

Sub-discussions:

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good attempt at providing the compromise that is being asked for.

Some possible names that spring to mind: pyarrow_legacy, pyarrow_nan

default in pandas 3.0:

* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available
or otherwise the numpy object-dtype alternative.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you allow the possability of a NumPy 2 improved type for pandas 3? With a heirarchy arrow -> np 2 -> np object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal does not preclude any further improvements for the numpy-based string dtype using numpy 2.0. A few lines below I explicitly mention it as a future improvement and in the "Object-dtype "fallback" implementation" section as well.

I just don't want to explicitly commit to anything for pandas 3.0 related to that, given it is hard to judge right now how well it will work / how much work it is to get it ready (not only our own implementation, but also support in the rest of the ecosystem). If it is ready by 3.0, then we can evaluate that separately, but this proposal doesn't stand or fall with it.

Regardless of whether to also use numpy 2.0, we have to agree on 1) making a "string" dtype the default for 3.0, 2) the missing value behaviour to use for this dtype, and 3) whether to provide an alternative for PyArrow (in which case we need the object-dtype version anyway since we also can't require numpy 2.0). I would like the proposal to focus on those aspects.

After acceptance of PDEP-10, two aspects of the proposal have been under
reconsideration:

- Based on user feedback, it has been considered to relax the new `pyarrow`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth mentioning why this has been objected to? As far as I am aware virtually all objections are due to the installation size effect, and not performance or compatibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can certainly mention something, but would prefer to keep that brief to focus here on the strings context and not trigger discussion here about the merits of those objections.
(for example, it's not only installation size, but also the difficulty to install from source in case there are no wheels)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "(mostly around installation complexity and size)"

reconsideration:

- Based on user feedback, it has been considered to relax the new `pyarrow`
requirement to not be a _hard_ runtime dependency. In addition, NumPy 2.0 can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think NumPy 2.0 will reduce the need to make pyarrow a dependency for strings; as far as I am aware it is not natively returned by any I/O operation and it has a completely different string architecture than pyarrow, so there is no zero-copy capability. Those seem like they either will require a large amount of string copying or a hefty amount of updates to make it natively work with our I/O, as well as with the larger Arrow ecosystem. That's a huge amount of things to gloss over

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think NumPy 2.0 will reduce the need to make pyarrow a dependency for strings

I think it can do that if your motivation for wanting pyarrow is the better performance compared to object-dtype. In that case, numpy 2.0's StringDType can give you a part of the speedup, without requiring pyarrow.
The discussion in #57073 also started from that point of view, mentioning numpy 2.0 as an alternative to requiring pyarrow, so based on that my feeling is that what I wrote here is correct (or at least seen as such by some people).

But you are completely right that there are a lot of things that would need to be implemented to make it fully usable for us. That's also the reason that this PDEP does not say to use numpy 2.0, but defers that as a possible future enhancement, to discuss later. And you are also right that it has drawbacks compared to a Arrow based solution (using Arrow memory layout, but not necessary using pyarrow the package), another reason for me personally to again defer that to a separate discussion.

I just wanted to mention it for the complete context of the string dtype history and discussion. Now, I already mention its existence in the previous paragraph, so could keep it shorter here.
(and if you have any concrete suggestions to word this better, I am all ears!)

topic.

In the first place, we need to acknowledge that most users should not need to
use storage-specific options. Users are expected to specify `pd.StringDtype()`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are reusing pd.StringDtype() in this case right? Is that going to break existing use cases where users have relied on that using pd.NA as a sentinel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are reusing pd.StringDtype() in this case right?

Yes, and that is what already happens since pandas 2.1 with future.infer_string enabled

Is that going to break existing use cases where users have relied on that using pd.NA as a sentinel?

Yes, I mentioned that in the "Backwards compatibility" section

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks - sorry for overlooking that. So I think it goes without saying then that if we go this route we no longer will declare pd.StringDtype() experimental? Or are we still trying to keep that reservation knowing even this is not considered a long term design decision?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think it goes without saying then that if we go this route we no longer will declare pd.StringDtype() experimental?

Yep, given the proposal is to enable this by default, I think that is indeed saying to remove the experimental label (I can mention that somewhere explicitly if that helps)

Or are we still trying to keep that reservation knowing even this is not considered a long term design decision?

Once we have a "string", we will always have one, I think. That aspect is the long term decision this PDEP is proposing. We might change later the missing value semantics, but that doesn't mean the string dtype proposed here is still experimental (just like our default "int64" dtype is not experimental). At the time that we would decide to enable new missing value semantics by default, then "string" will "simply" start meaning something differently.

@jbrockmendel
Copy link
Member

ValueError: Could not find PDEP number in 'PDEP: Dedicated string data type for pandas 3.0'. Please make sure to write the title as: 'PDEP-num: PDEP: Dedicated string data type for pandas 3.0'.

Currently, the `StringDtype(storage="pyarrow_numpy")` is used, where
"pyarrow_numpy" is a rather confusing option.

TODO see if we can come up with a better naming scheme
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringDtype(storage="pyarrow", semantics="numpy")? or instead of semantics, could use "na_value=np.nan`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i'm understanding correctly about the motivation for the change in dtype (improved overall user experience), then moving forward I suspect that when we can have improved/native dtypes for other data types (nested, date, etc) that the same logic would need to apply, i.e. we would need to have a variants of these with NumPy semantics.

Now this probably falls under PDEP-13 but if we have semantics as a argument (that users would see and use) we could still end up with columns using different missing value indicators?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringDtype(storage="pyarrow", semantics="numpy")? or instead of semantics, could use "na_value=np.nan`

or maybe "nullable=[True|False]"

However, at the moment, we distinguish the nullable data types for the other dtypes (int, float, etc) with capitalization and so for consistency could also consider string/String as the dtypes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDEP-13 proposes StringDtype(backend="pyarrow", na_marker=np.nan). I think the repr should just be updated to reflect that; trying to sift through the meaning of int versus Int versus int[pyarrow] compared to string versus string[pyarrow] versus string[pyarrow_numpy] I think would be a distraction for this proposal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringDtype(storage="pyarrow", semantics="numpy")? or instead of semantics, could use "na_value=np.nan`

@jbrockmendel good point that we can also use other keywords than just storage to make the distinction

if we have semantics as a argument (that users would see and use) we could still end up with columns using different missing value indicators?

Only if users explicitly specify a non-default value for this, and never by default. This is the same with whatever option we come up with (eg also when using dtype_backend="pyarrow" or explicitly asking for one of the masked dtypes with dtype=Int64 or .. you can end up with a DataFrame with columns with mixed semantics)

we distinguish the nullable data types for the other dtypes (int, float, etc) with capitalization and so for consistency could also consider string/String as the dtypes.

Yeah, only unfortunately to be consistent with the other dtypes where we use capitalization, it would need to be "string" for the new NaN-based dtype, and "String" for the "nullable" NA-based variant. And so that doesn't help with backwards compatibility, because "string" right now means the nullable dtype. Given that, I would personally not use capitalization here (which also only is a solution for the string alias naming, not for the StringDtype(..) API)


To keep the sub-discussions manageable, I moved this specific topic out of this inline comment thread, and into it's own issue: #58613


- Created: May 3, 2024
- Status: Under discussion
- Discussion:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no reason not to use #57073 as the discussion issue as any further discussion will be here and #57073 can now focus on whether to reject PDEP-10 and what to do about the planned improvements to other dtypes.

My assumption is that approval of this PDEP should not, in itself, be a justification to overturn the PDEP-10 decision even though they are very much related and the implementation of the fallback option is only applicable if PDEP-10 is formally rejected.

@rhshadrach rhshadrach changed the title PDEP: Dedicated string data type for pandas 3.0 PDEP-14: Dedicated string data type for pandas 3.0 May 4, 2024
@rhshadrach
Copy link
Member

rhshadrach commented May 4, 2024

@jorisvandenbossche - I've renamed this PDEP-14 to fix the doc build job. The docs build automatically picks up added PDEP PRs for the website, and they need a number for that to succeed.

[introduced in pandas 2.1](https://pandas.pydata.org/docs/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings)
that is still backed by PyArrow but follows the default missing values semantics
pandas uses for all other default data types (and using `NaN` as the missing
value sentinel) ([GH-54792](https://github.com/pandas-dev/pandas/issues/54792)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pyarrow_numpy StringArray also returns numpy arrays as results for some operations.

I think this is also important to mention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, I haven't yet mentioned that the original StringDtype returns masked arrays from operations (only that it uses pd.NA). I only mention that when going more in detail on this topic in the "Missing value semantics" subsection. Given that, I would also leave it here to the generic "missing value semantics" for the new variant as well (to not make the background section even longer. I can certainly expand the "Missing value semantics" section if needed)


To avoid a hard dependency on PyArrow for pandas 3.0, this PDEP proposes to keep
a "fallback" option in case PyArrow is not installed. The original `StringDtype`
backed by a numpy object-dtype array of Python strings can be used for this, and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to clarify that this is a separate dtype from the original string[python] dtype, just to make it clear that the original StringDtype is not changing (and still will return masked arrays, and use pd.NA as its missing sentinel)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clarify in the test that it is indeed a new variant of the string dtype and uses a subclass to reuse most code


For pandas 3.0, this is the most realistic option given this implementation is
already available for a long time. Beyond 3.0, we can still explore further
improvements such as using nanoarrow or NumPy 2.0, but at that point that is an
Copy link
Member

@lithomas1 lithomas1 May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop this bit about nanoarrow (given it is not explained/introduced in the paragraphs beforehand).

If you want to add an explanation above, that's also fine with me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a link to the discussion issues for both numpy 2.0 and nanoarrow, so people can find more explanation there if they want.

flag in pandas 2.1 (by `pd.options.future.infer_string = True`).

Some small enhancements or fixes (or naming changes) might still be needed and
can be backported to pandas 2.2.x.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the plan worries me a little.

Maybe it would be better to cut off a 2.3 from 2.2.x.

I think there's a significant proportion of the downloads for 2.2 that aren't on the latest patch release.
I think there's ~ 1/3 of the downloads that are fetching 2.2.0.

Copy link
Member

@lithomas1 lithomas1 May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also,
it would be good to mention which version of pandas is expected to have infer_string be able to infer to the object fallback option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a 2.3 release (maybe around the same time as 3.0rc) sounds reasonable.

If the features/bugfixes added to 2.3 are limited to the string dtype then we shouldn't need many patch releases. We may not need to fix any string dtype related issues that are fixed for 3.0 as these will be behind a flag in 2.3 and so shouldn't break existing code.

On the other hand, as these features are behind a flag, maybe releasing a 2.3 would not gain the field testing we hope for.

And therefore, instead of doing a 2.3, planning for at least a couple of release candidates for 3.0 would better achieve this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche

Thoughts on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to cut off a 2.3 from 2.2.x.

Yes, if we still plan to add a deprecation warning and change the naming scheme in StringDtype, calling that 2.3.0 sounds as the best option (I had been planning to propose doing a 2.3.0 (from the 2.2.x branch) anyway to bump the warning for CoW from DeprecationWarning to FutureWarning)


1. Delaying has a cost: it further postpones introducing a dedicated string
dtype that has massive benefits for our users, both in usability as (for the
significant part of the user base that has PyArrow installed) in performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can just claim this. I don't disagree, but this should be backed up more.

At least from the feedback received from #57073 and the other issue, there's at least a significant part of the user base that doesn't use strings.

There's also a significant chunk of the population that can't install pyarrow (due to size requirements or exotic platforms or whatever).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this argument is that convincing either, although for slightly different reasons. I don't think we need to feel rushed for the next release

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can just claim this. I don't disagree, but this should be backed up more.

@lithomas1 can you clarify which part of the paragraph you think requires more backing up?
The fact that I say a "significant" part of our user base has pyarrow installed?

I don't think we can ever know exact numbers for this, but one data point is that pandas currently has 210M monthly downloads and pyarrow has 120M monthly downloads. Of course not all of those pyarrow users are also using pandas, but let's just assume that half of those pyarrow downloads come from people using pandas, that would mean that around 30% for our users already have pyarrow installed, which I would consider as a "significant part".
(and my guess is that for people working with larger datasets, where the speed of pyarrow becomes more important, this percentage will be higher, for example because of using the parquet IO)

But anyway, we are never going to know this exact number, but IMO we do know that a significant part of our userbase has pyarrow and will benefit from using that by default.

there's at least a significant part of the user base that doesn't use strings.

Yes, and then this PDEP is not relevant for them. But it's not because some users don't use strings, that we shouldn't improve the life of those users that do use strings? (so just not really understanding how this is a relevant argument)

There's also a significant chunk of the population that can't install pyarrow

Yes, and this PDEP addresses that by allowing a fallback when pyarrow is not installed.

I am not sure this argument is that convincing either, although for slightly different reasons.

@WillAyd can you then clarify which other reasons?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My other reason is that I don't think there is ever a rush to get a release out; we have historically never operated that way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is ever a rush to get a release out; we have historically never operated that way

For the last six years, we have roughly released a new feature release every six months. We indeed never rush a specific release if there is something holding it up for a bit, but historically we have been releasing somewhat regularly.

At this point, a next feature release will be 3.0 given the amount of changes we already made on the main branch that require the next release cut from main to be 3.0 and not 2.3 (enforced deprecations etc).
(we can cut a 2.3 release from the the 2.2.x maintenance branch, which we might want to do for several reasons, but not counting that as a feature release for this discussion, as that will not actually contain features)

So I would say there is not necessarily a rush to do a release with a default "string" dtype (that is up for debate, i.e. this PDEP), but there is some rush to get a 3.0 release out. In the meaning that I think we don't want to delay 3.0 for like half a year or longer.

So for me delaying the string dtype, essentially means not including it in 3.0 but postponing it to pandas 4.0 (I should maybe be clearer in the paragraph above about that).

And then I try to argue in the text here that postponing it for 4.0 has a cost (or, missed benefit), because we have an implementation we could use for a default string dtype in pandas 3.0, and postponing introducing it makes that users will use the sub-optimal object dtype for longer, for (IMO) no good reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can just claim this. I don't disagree, but this should be backed up more.

@lithomas1 can you clarify which part of the paragraph you think requires more backing up? The fact that I say a "significant" part of our user base has pyarrow installed?

It'd be nice to add how much perf benefits Arrow strings are expected to bring (e.g. 20%? 2x? 10x?).
Putting in the part about how many users have pyarrow would also help.

It'd also be good to elaborate on the usability part. IIUC, the main benefit here is not having to manually check element to see whether your object dtype'd column contains strings (since I think all the string methods work on object dtype'd columns).

I think it's also fair to amend this part to say "massive benefits to users that use strings" (instead of in general).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarks are going to be highly dependent on usage and context. If working in an Arrow native ecosystem, the speedup of strings may be a factor over 100x. If working in a space where you have to copy back and forth a lot with NumPy, that number goes way down.

I think trying to set expectations on one number / benchmark for performance is futile, but generally Arrow only helps, and makes it so that we as developers don't need to write custom I/O solutions (eg: ADBC Drivers, parquet, read_csv with pyarrow all work with Arrow natively with no extra pandas dev effort)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to add how much perf benefits Arrow strings are expected to bring (e.g. 20%? 2x? 10x?).

Benchmarks are going to be highly dependent on usage and context.

Indeed, for single operations you can easily get a >10x speedup, but of course a typical workflow does not consist of just string operations, and the overall speedup depends a lot (see this slide for one small example comparison (https://phofl.github.io/pydata-berlin/pydata-berlin-2023/intro.html#74) and this blogpost from Patrick showing the benefit in a dask example workflow (https://towardsdatascience.com/utilizing-pyarrow-to-improve-pandas-and-dask-workflows-2891d3d96d2b).

but generally Arrow only helps, and makes it so that we as developers don't need to write custom I/O solutions

That is often true, but except for strings ;).
For strings, the faster compute kernels will still give a lot of value even if your IO wasn't done through Arrow (and give a lot more value compared to using pyarrow for numeric data)

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche for the PDEP.

I am generally in agreement with the motivation for this PDEP on the proviso that any approval is not rejecting PDEP-10. The motivation of accepting PDEP-10 by the team members could have been related to the perceived maintenance burden, a more performant string dtype, interoperability, having better default inference for other data types or maybe some other reason. This current PDEP only addresses one aspect of that decision.

One other aspect that is not mentioned here and was not mentioned in PDEP-10 is the consequences of choosing PyArrow as a backend. Bearing in mind, that it was felt that the implications of using nullable semantics for default dtypes was not discussed, I wonder whether we should have a section that discusses the other implications of choosing PyArrow in this PDEP, e.g. implications of choosing 1d immutable arrays as the backend.

web/pandas/pdeps/00xx-string-dtype.md Outdated Show resolved Hide resolved
Comment on lines 105 to 106
4. We update installation guidelines to clearly encourage users to install
pyarrow for the default user experience.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and do we consider adding a performance warning to the fallback also?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and do we consider adding a performance warning to the fallback also?

I personally wouldn't do that always / for each method, because that would be super noisy (and in some cases, like smallish data, it doesn't matter that much, so getting those warnings would be annoying).

If we wanted to warn users to gently push them towards installing pyarrow, I think we could do a warning but only 1) raise it once, and 2) only when doing one of the string operations on a big enough dataset (with some threshold).

Now, your question reminds me that the current pyarrow-backed string dtype has those fallback warnings for very specific cases, which I personally think we should stop doing when it becomes the default dtype. Given this is already for the existing implementation (and to keep the many discussion lines here a bit more limited), I opened a separate issue for this: #58581.
(but if there is agreement on that other issue, can of course briefly mention that here later)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point. from the recent user feedback of adding the deprecation warning for the PyArrow requirement, then maybe not having any warnings is wise.

that the current pyarrow-backed string dtype has those fallback warnings for very specific cases, which I personally think we should stop doing when it becomes the default dtype.

+1

web/pandas/pdeps/00xx-string-dtype.md Outdated Show resolved Hide resolved
Currently, the `StringDtype(storage="pyarrow_numpy")` is used, where
"pyarrow_numpy" is a rather confusing option.

TODO see if we can come up with a better naming scheme
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringDtype(storage="pyarrow", semantics="numpy")? or instead of semantics, could use "na_value=np.nan`

or maybe "nullable=[True|False]"

However, at the moment, we distinguish the nullable data types for the other dtypes (int, float, etc) with capitalization and so for consistency could also consider string/String as the dtypes.

Comment on lines 184 to 185
dtype that has massive benefits for our users, both in usability as (for the
significant part of the user base that has PyArrow installed) in performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dtype that has massive benefits for our users, both in usability as (for the
significant part of the user base that has PyArrow installed) in performance.
dtype that has massive benefits for our users, both in usability and, for users that already have PyArrow installed or have no issues installing PyArrow, in performance.

web/pandas/pdeps/00xx-string-dtype.md Outdated Show resolved Hide resolved
flag in pandas 2.1 (by `pd.options.future.infer_string = True`).

Some small enhancements or fixes (or naming changes) might still be needed and
can be backported to pandas 2.2.x.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a 2.3 release (maybe around the same time as 3.0rc) sounds reasonable.

If the features/bugfixes added to 2.3 are limited to the string dtype then we shouldn't need many patch releases. We may not need to fix any string dtype related issues that are fixed for 3.0 as these will be behind a flag in 2.3 and so shouldn't break existing code.

On the other hand, as these features are behind a flag, maybe releasing a 2.3 would not gain the field testing we hope for.

And therefore, instead of doing a 2.3, planning for at least a couple of release candidates for 3.0 would better achieve this.

jorisvandenbossche and others added 2 commits May 5, 2024 13:55
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
Comment on lines 100 to 101
if installed, and otherwise falls back to an in-house functionally-equivalent
(but slower) version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the "in-house functionally-equivalent (but slower) version" the current implementation based on numpy 1.x in version 2.2, but we now make the dtype string instead of object ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the section below that expands on this object-type based implementation ("Object-dtype "fallback" implementation"), there is a bit longer explanation and I also link to the open PR implementing this: #58451

It is based on the current StringDtype / StringArray (using object dtype under the hood), and not directly on how object-dtype columns work right now. But anyway, both use the same implementation for the string accessor methods, and this new variant will also use that same implementation.

(and fwiw, this is not specific to numpy 1.x, it will also work on numpy 2.x, it just uses object dtype)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit that I am pretty confused by these various implementations and the corresponding semantics used to describe them.

I'd suggest having a summary of what exists today in pandas 2.2 (strings with object dtype, np.nan), the current StringDtype with pd.NA, the "experimental" pyarrow based implementations (with both pd.NA and np.nan being available), and anything else, and what is proposed would be available due to this PDEP, and how it might change in the future due to however we decide to handle missing values in the future, as well as nanoarrow and numpy 2.0 strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current situation (pandas 2.2) is:

  • Current default: object dtype with np.nan or None
  • Experimental opt-in string dtypes using pd.NA: StringDtype() with storage being "python" (default, object dtype under the hood) or "pyarrow"
  • Future string dtypes using NaN (behind pd.options.future.infer_string = True): StringDtype() with storage being "pyarrow_numpy"
    • This is the dtype that is essentially being proposed in this PDEP, but already exists since pandas 2.1 (#54792)

And then this PDEP also describes to add an extra option for the third bullet point, i.e. having a StringDtype() using NaN but backed by object-dtype array instead of pyarrow, which I dubbed (for now) "python_numpy" in the PR adding this (#58451), but that name is still being discussed.

As a starter (still thinking about how to make this clearer in the PDEP), does the above clarify it for you?

Note that the above listing leaves out pd.ArrowDtype(<some pyarrow string type>), because that is not really relevant for the discussion right now and I am not sure mentioning it will help, but that is yet another way to store strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a starter (still thinking about how to make this clearer in the PDEP), does the above clarify it for you?

Yes, although I'd suggest for the PDEP a table that indicates the constructor version of the dtype (e.g., dtype=object, or dtype=pd.StringDtype()), and the string representation of the dtype.

Note that the above listing leaves out pd.ArrowDtype(<some pyarrow string type>), because that is not really relevant for the discussion right now and I am not sure mentioning it will help, but that is yet another way to store strings.

Is that equivalent to having a pure "pyarrow" backed string that uses pd.NA as the null semantics? If not, where does that fit it? Or is it not available?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that equivalent to having a pure "pyarrow" backed string that uses pd.NA as the null semantics? If not, where does that fit it? Or is it not available?

StringDtype(storage="pyarrow") is also a "pure pyarrow backed string that uses NA", to be clear. But so
StringDtype(storage="pyarrow")and ArrowDtype(pa.(large_)string()) are essentially equivalent, except for using a different dtype and array class.

can be backported to pandas 2.2.x.

The variant using numpy object-dtype could potentially also be backported to
2.2.x to allow easier testing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you mean "numpy 2.0 string-dtype" ? Because the "numpy object-dtype" is currently there? This labeling is confusing.

Given the discussions elsewhere about where the names are now for the dtype "string[pyarrow]", "string[pyarrow_numpy]", etc. which I can't keep track of, I think that the nomenclature in terms of strings that work should be specified, comparing what is in 2.2 to what would be implemented as a result of this PDEP.

The possible strings are confusing. Which strings can be used in a dtype argument in a constructor or astype() ? Which strings would be seen by users when they do Series.dtype ? There is what currently exists in pandas 2.2, and what would exist based on this PDEP, and I'm not seeing what the conclusion of that would be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the "numpy object-dtype" is currently there?

No, it is not yet there (see my answer to your previous comment)

Currently, the `StringDtype(storage="pyarrow_numpy")` is used, where
"pyarrow_numpy" is a rather confusing option.

TODO see if we can come up with a better naming scheme
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDEP-13 proposes StringDtype(backend="pyarrow", na_marker=np.nan). I think the repr should just be updated to reflect that; trying to sift through the meaning of int versus Int versus int[pyarrow] compared to string versus string[pyarrow] versus string[pyarrow_numpy] I think would be a distraction for this proposal


1. Delaying has a cost: it further postpones introducing a dedicated string
dtype that has massive benefits for our users, both in usability as (for the
significant part of the user base that has PyArrow installed) in performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this argument is that convincing either, although for slightly different reasons. I don't think we need to feel rushed for the next release

1. Delaying has a cost: it further postpones introducing a dedicated string
dtype that has massive benefits for our users, both in usability as (for the
significant part of the user base that has PyArrow installed) in performance.
2. In case we eventually transition to use `pd.NA` as the default missing value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the challenges around this will not be unique to the string dtype and
therefore not a reason to delay this.

I might be missing the intent but I don't understand why the larger issue of NA handling means we should be faster to implement this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the larger issue of NA handling means we should be faster to implement this

It's not a reason to do it "faster", but I meant to say that the discussion regarding NA is not a reason to do it "slower" (to delay introducing a dedicated string dtype)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the flip side is that if we aren't careful about the NA handling we can introduce some new keywords / terminology that makes it very confusing in the long run (which is essentially one of the problems with our strings naming conventions)

As a practical example, if we decided we wanted semantics= as a keyword argument to StringDtype in this PDEP to move the NA discussion along, that might be counter-productive when we look at more data types and decide semantics= was not a clear way to allow datetime data types to support pd.NaT as the missing value.

(not saying the above is necessarily the truth, just cherry picking from conversation so far)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one reason that I personally would prefer not introducing a keyword specifically for the missing value semantics, for now (just for this PDEP / the string dtype). I just listed some options in #58613, and I think we can do without it.


Wouldn't adding even more variants of the string dtype will make things only more
confusing? Indeed, this proposal unfortunately introduces more variants of the
string dtype. However, the reason for this is to ensure the actual default user
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just retroactively clarifies the reasoning for string[pyarrow_numpy] to have existed in the first place right? Or is it supposed to be hinting at some other feature that the implementation details of the PDEP is proposing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's indeed explaining why we did this, which is of course "retroactively" given I was asked to write this PDEP partly for changes that have already been released. So a big part of the PDEP is retroactively in that sense (which it not necessarily helping to write it clearly ..).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it supposed to be hinting at some other feature that the implementation details of the PDEP is proposing?

however, more importantly, the PDEP makes this (the already added dtype) the default in 3.0. It would remain behind the future flag for the next release if enough people feel we are not ready.

One other backwards incompatible change is present for early adopters of the
existing `StringDtype`. In pandas 3.0, calling `pd.StringDtype()` will start
returning the new default string dtype, while up to now this returned the
experimental string dtype using `pd.NA` introduced in pandas 1.0. Those users
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically you would get this by using dtype="string" too right? I'm a little wary that we are underestimating the scope of how breaking this could be; I didn't even realize we considered that dtype experimental all this time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been available (as pyarrow backed) since 1.3, so almost three years (July 2, 2021). Even though considered experimental, if the new string dtype is not accepted for 3.0, then maybe a deprecation warning should be added? (We could also do this if decided a 2.3 release is needed?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A deprecation warning about what exactly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little wary that we are underestimating the scope of how breaking this could be

The scope of changing NaN to NA for all users is much bigger though (essentially what was decided in PDEP-10 if we would follow it strictly to the letter).
And similarly if we would in the future change NaN/NaT semantics to NA for all dtypes, the scope will be much bigger (because once that is enabled by default, for example a user that was doing dtype="float64" will probably get the new NA behaviour while now it uses NaN), but we are still considering that (granted, it's exactly those details that we have to discuss a lot more in detail (elsewhere) and figure out, though).

I know that this is not necessarily a good argument to justify this breaking change (because we certainly should be wary of the scope of those breaking changes), but I do want to point out again that the choice in this PDEP to use NaN semantics is to reduce the scope of the breaking changes for most users (at the expense of increasing the scope of breaking changes for the smaller subset of users that was already using dtype="string").

If we don't want to make dtype="string" breaking, then either we need to come up with a different name for the dtype (not using "string", like "utf8" or "text"), or either we need to delay introducing a default string dtype until after we have agreement on the NA discussions.

And personally I think "string" is by far the best name (and I find the small breakage worth it for being able to use that name), and as I argued elsewhere (and in the Why not delay introducing a default string dtype? section in the PDEP text), I think it is valuable for our users to not wait with adding a dedicated string dtype until we are ready with the NA discussion and implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the expense of increasing the scope of breaking changes for the smaller subset of users that was already using dtype="string"

This is where I am a little uncomfortable - I don't know how to measure the size of that, but I am wary of assuming it is not a signifcant number of users. The fact that "string" returns NA as a missing value is a documented difference in our code base:

https://pandas.pydata.org/docs/dev/user_guide/text.html#behavior-differences

And its usage has been promoted for quite some time:

https://stackoverflow.com/a/60553529/621736
https://towardsdatascience.com/why-we-need-to-use-pandas-new-string-dtype-instead-of-object-for-textual-data-6fd419842e24
https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.1.0.html#all-dtypes-can-now-be-converted-to-stringdtype

If we don't want to make dtype="string" breaking, then either we need to come up with a different name for the dtype (not using "string", like "utf8" or "text"), or either we need to delay introducing a default string dtype until after we have agreement on the NA discussions.

Yea none of these options are great...but out of them I still would probably prefer waiting. I think right now we are marching down a path of "string" missing values:

  1. Returning pd.NA today
  2. Returning np.nan with this PDEP (granted those changes are already in main)
  3. Going back to returning pd.NA with the NA PDEP

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But personally I think dtype="string" meaning something different than the default string dtype you get without specifying the dtype is going to be very confusing ..)

I think we have to carefully specify what the user specifies in a dtype argument and how that gets interpreted, versus what we return as the dtype when they look at Series.dtype.

So we could have a mapping that says

User specifies dtype= pandas returns Series.dtype
Unspecified "string[pyarrow_numpy]" OR "string[python]"
"string" "string[pyarrow]"
StringDtype("pyarrow") "string[pyarrow]"
StringDtype("python") "string[python]"
StringDtype("pyarrow_numpy") "string[pyarrow_numpy]"

The first row depends on whether pyarrow is installed.
For the second, third and fifth rows, if pyarrow is not installed, we raise an Exception.

Separately, we can then debate what the values in the second column should look like in #58613 . I personally am not a fan of "pyarrow_numpy"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, my answer to your example snippet was trying to explain how I would ensure this does not break (if we return bool column instead of object dtype with True/False/NaN will ensure that filtering keeps working).

Ah OK - I didn't realize you were proposing that change be a part of this PDEP, just thought it was an idea you had for the future. But that's a completely new behavior...and then begs the question of do we go back and change dtype=object to have that same behavior or just have dtype="string" exclusively have it. Ultimately we end up with the same issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also agree with Will that it's not fair to change this without warning for people already using "string".
(pd.NA is also a big selling point of the dtype="string" too)

Maybe a good compromise would be to use string[pyarrow] under the hood for those users (if they had it installed)?

If we were to move ahead with the move to nullable dtypes in general, I worry that this changing of the na value for dtype="string" from pd.NA -> np.nan -> pd.NA will cause a lot of confusion.

If we were to do 2.3 (like I suggested below), this might be addressable there (with a deprecation).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still adding some deprecation warnings in 2.x for current users of StringDtype is something we certainly could do. I am personally ambivalent about it, but fine with adding it if others think that is better (I do think it might become quite noisy, and it also does not change the fact that 3.0 would switch from NA to NaN)

The warning message could then point people to enable pd.options.future.infer_string = True in case they only care about having the (faster) string dtype, or otherwise update their dtype specification if they want the NA instead of NaN version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to carefully specify what the user specifies in a dtype argument and how that gets interpreted, versus what we return as the dtype when they look at Series.dtype.

So we could have a mapping that says

I created a variant of that table #58613 (comment) with a concrete proposal

For the second, third and fifth rows, if pyarrow is not installed, we raise an Exception.

(for clarity, this "second" row referred to specifying a dtype with "string")
If you explicitly ask for pyarrow, then yes raising an exception is fine and expected. But a generic "string" (or StringDtype()) has to mean "whatever string dtype that is the default" and so cannot raise an exception if pyarrow is not installed, but should return the object-dtype based fallback.

@jorisvandenbossche
Copy link
Member Author

One of the concrete discussion points is the API design of the StringDtype(..) constructor and the way to distinguish the various variants of the dtype (i.e. the current "pyarrow_numpy" naming we introduced in #54533 / #54792).
To keep that sub-discussion manageable, I opened a dedicated issue for that specific topic: #58613

@jbrockmendel
Copy link
Member

I'm with Joris pretty much across the board on this. I'm pretty sure @phofl will be too.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On board with the idea!

Making a nullable dtype available by default would require a lot more discussion which won't happen in time for 3.0 (which I'd also rather not delay too much). In particular, I know some people here who feel quite strongly (but with opposite opinions) about the NaN vs null topic, and that'll require further discussion

Happy to discuss nullable dtypes by default for the 2025 major release. Though in particular I'm curious about your (Joris') thoughts on whether you'd eventually be happy making PyArrow required if PyArrow dtypes were the default wherever possible

I see Will's point about making changes to StringDType, although given its experimental status, I think it'd be OK. And I also like Irv's 'String' suggestion, which feels consistent with 'int64' vs 'Int64'

Thanks for all the effort you put into this proposal and to answering so many questions

web/pandas/pdeps/0014-string-dtype.md Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented May 20, 2024

Thanks all for the feedback.
Pushed another update with minor text updates addressing some comments, and specifically added the suggestion to add a capitalized "String" alias to make the change for users that want to keep using the NA-variant smaller (dtype="string" to dtype="String" instead of dtype="pd.StringDtype(na_value=pd.NA)"), and that indeed makes it consistent with how we capitalize the string aliases for other nullable dtypes as well at the moment.

Happy to discuss nullable dtypes by default for the 2025 major release. Though in particular I'm curious about your (Joris') thoughts on whether you'd eventually be happy making PyArrow required if PyArrow dtypes were the default wherever possible

@MarcoGorelli Interesting question, but let's leave that for another thread to answer that ;) (this one is already long enough)

@simonjayhawkins summary of my response to your comment (#58551 (comment)):

  • I don't think there good reason to believe most users (that would benefit from it) already use the string dtype, especially not for the pyarrow-backed version.
  • I don't think we are even considering making numpy 2.0 a requirement for pandas 3.0, so any more concrete discussion related to that is out of scope for this PDEP (see also my answer to Kevin's comment from 2 weeks ago: PDEP-14: Dedicated string data type for pandas 3.0 #58551 (comment))

Will put my more detailed response I started to write up in a collapsed section, to reduce the wall of text a bit when scrolling through this PR.

If one was to argue that the users that benefit most from a dedicated string dtype are already aware of the "experimental" string dtype that has been pyarrow backed for almost 3 years

I don't think this would be true. I also don't have any concrete data to back this up, but I would argue that most likely a majority of users that would benefit from a dedicated string dtype is not already using it (and especially not the pyarrow-backed version, as you need to additionally opt-in to that beyond dtype="string").

So I personally do think that this proposal will give a significant benefit for many users not already using the dtype.

Some partial data points: on StackOverflow, searching for "StringDtype" with tag pandas, there is no explicit usage of the "pyarrow" storage, but only of "string" or StringDtype(). Searching explicitly for "string[pyarrow]" does not give that many relevant results. And when searching generally for "string dtype", the most relevant or most viewed questions again don't mention pyarrow.
(it might be interesting to do a search on eg kaggle notebooks)

However, I was also under the impression that the intention of the solution as initially proposed was to help get the 3.0 released unblocked if the PyArrow dependency requirement was dropped.

That is still the intention.

If we approve this PDEP with the modifications and that results in pandas 3.0 being released much later than planned

Pandas 3.0 will be released later than planned regardless, as it was originally planned for last month .. But I still think that with the current modifications (i.e. mostly adding a keyword to the StringDtype constructor), we can have an RC for 3.0 by the end of June, if there is agreement on the proposed changes.

move closer to the point where the NumPy native string solution may become a usable solution

Regardless of pandas 3.0 being released now or in a few months, I don't think we are ready (or should consider) to require numpy 2.0 for pandas 3.0.
If we want to add another variant of the dtype using numpy's 2.0 string dtype under the hood, that is perfectly fine, and this PDEP does not preclude any of that. Neither does it say we should not, at some point after pandas 3.0, switch the object-dtype based fallback to the numpy-2.0 string dtype based fallback (when we are ready to require numpy 2.0 and feel the numpy string dtype is stable enough).
See also my answer to Kevin from last week: #58551 (comment)

Assuming that I could safely say that nobody really likes any fallback solution for either performance, consistency, complexity, confusion or maintenance reasons then we should probably include in this PDEP the deprecation plan for the fallback.

Whether the fallback for the pyarrow-based string dtype uses numpy object dtype or numpy-2.0 string dtype, that does not change anything regarding consistency or confusion for users (it only helps for performance). And given the object-dtype based implementation has existed for many years, I would say that in terms of code complexity and short term maintenance for us, the object-dtype based one is far easier than a new dtype using numpy-2.0 strings under the hood.

which is expected to happen first, having PyArrow as a required dependency or having the minimum version of NumPy as 2.0?

This PDEP does not care about that. The only thing the PDEP describes is a proposal about if we want to have a default string dtype for pandas 3.0 on the relatively short term (like, this year) and do not (yet) want to require pyarrow as a hard dependency (for pandas 3.0). If we want to require pyarrow as a hard dependency or numpy 2.0 in a later pandas release, that is totally up for debate, but out of scope for this PDEP.

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment about the "String" alias and the table.

| `StringDtype("python")` | `StringDtype(storage="python", na_value=np.nan)` | "string" | (2) |
| `StringDtype("pyarrow", na_value=pd.NA)` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[pyarrow]" | |
| `StringDtype("python", na_value=pd.NA)` | `StringDtype(storage="python", na_value=pd.NA)` | "string[python]" | |
| `StringDtype(na_value=pd.NA)` | `StringDtype(storage="pyarrow" \| "python", na_value=pd.NA)` | "string[pyarrow]" or "string[python]" | (1) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this include "String" as the string alias?

If the goal is to just list the string alias that we show as the dtype of the column, it's fine, but maybe the "String alias" column should say "String representation of dtype" and then you have a column called "User string alias" that would include all the aliases a user could use to specify the dtype, which would include "String"

In addition, maybe consider that the NA variants return "String[pyarrow]" or "String[python]" as the dtype which makes it clear that it is using the pd.NA variant via capitalization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't entirely sure how to integrate "String" without making the table even more complex ..

Now, I can indeed change "string[python]" / "string[pyarow]" to the capitalized version "String[python]" / "String[python]", and mention "String" in the first column as alias for pd.StringDtype(na_value=pd.NA). EDIT: updated the table this way.

I was originally planning to keep "string[python]" / "string[pyarow]" mean what it does today for backwards compatibility (and since I mention that this is not allowed for the new NaN-variant of the dtype). Now that I updated the proposal to use capitalized "String" for the NA-variant, that of course makes things a bit confusing if the lower case versions with explicit storage in the alias still means the NA-variant ..

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented May 22, 2024

For the backwards compatibility for existing users of dtype="string", what would people think about providing a specific option that those users can enable to keep using the NA-variant of the string dtype by default? (something that came up in the dev call today)

Assume this is something like pd.options.mode.use_string_dtype_with_na = True (exact name to bike-shed, but first want to see if this would be helpful). Users that are already using the current StringDtype with NA and that would like to keep using that instead of the future default, could enable this option without otherwise having to update their code to specifically choose this variant (i.e. don't have to update their code to change dtype="string" to dtype="String").

That would make updating to pandas 3.0 for those users potentially easier. Although, note that this options would also not preserve the current behaviour exactly. Right now, such users only get StringDtype where explicitly asking for it (by specifying dtype="string", or by calling df.convert_dtypes(), or specifying dtype_backend in an IO method), while in pandas 3.0 they would also get the NA-variant everywhere we infer a string dtype (and where you currently still get object dtype).

On the other hand, that is yet another (and very specific) option, while the code changes required for dtype="string" to dtype="String" is also not that big.

This also came up before in this thread, see #58551 (comment), but in context of providing a general option to opt-in for NA-dtypes (not specific to string dtype). While we certainly will want such global option at some point, the main problem is that we are not ready to provide a full implementation of such option right now (e.g. the constructors are not yet set up to infer the nullable numeric/bool dtypes, and also not all dtypes would then follow this option), and it will probably be confusing the have an publicized option that is only partially implemented.

@WillAyd
Copy link
Member

WillAyd commented May 23, 2024

I don't think an option with partially implemented features like that is a good idea. Our type system in its current state is an endless array of aliases / types with diverging behavior; I don't think adding more aliases and options makes things any easier, just kind of shuffles the problems around.

I still generally don't understand why we feel the need to break backwards compatability with dtype="string"; this PDEP can achieve its objective without doing that and introducing new aliases if it just repurposes the dtype=str constructor. We already have a difference between dtype=str and dtype="string" today, I don't see the value in adding yet another dtype="String" while changing the behavior of dtype="string"

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 24, 2024

I don't think an option with partially implemented features like that is a good idea.

The idea behind the option is that if you have 2.x code that uses pd.StringDtype() or ”string”, then 3.0 code with that option turned on would work as it does today, i.e. it would be as “partially implemented” as it is today and you’d only need a one-line change to existing code to retain behavior.

@jorisvandenbossche
Copy link
Member Author

this PDEP can achieve its objective without doing that and introducing new aliases if it just repurposes the dtype=str constructor.

We can indeed document and use dtype=str as the way to specify the default string dtype (we should allow this anyway, also under the current proposal), and that would indeed reduce the backwards incompatible changes quite a bit.
The reasons this would not be my preferred solution to only use it and not "string" for the default dtype:

  • If we use dtype=str | dtype="str" for the default NaN-variant and keep dtype="string" for the NA-variant, then I think also the string representation of the dtype should be "str" (e.g. what we show in the output of df.dtypes or the repr of a series). Because otherwise if users see string as the dtype description, they would rightfully expect they can do dtype="string" to get that dtype.
  • That means that users can see both "str" and "string" as dtype descriptions, and personally I think that explaining "string" vs "String" is easier than "str" vs "string" (because it is more consistent with "int64" vs "Int64")
  • For people that use the explicit constructor and not the string alias, i.e. pd.StringDtype(), this would still be backwards incompatible, because without any arguments it should IMO still give the default dtype. I suppose dtype="string" is used quite a bit more than dtype=pd.StringDtype(), so only keeping dtype="string" backwards compatible would already help a lot, but it is one more inconsistency to explain.

To be honest, I don't think this are necessarily very strong arguments that I am giving, and more coming down to preferences (I think "string" is the better name, so would prefer to have that for the default dtype that most users will see), at which point the trade-off with the back compat issues is maybe harder to justify.
So if others would also prefer or be on board with going for dtype="str" for the default dtype, I could certainly go along with that as well.

@WillAyd
Copy link
Member

WillAyd commented May 24, 2024

  • If we use dtype=str | dtype="str" for the default NaN-variant and keep dtype="string" for the NA-variant, then I think also the string representation of the dtype should be "str" (e.g. what we show in the output of df.dtypes or the repr of a series). Because otherwise if users see string as the dtype description, they would rightfully expect they can do dtype="string" to get that dtype

Definitely agree on this point - in our current release I find it confusing that the repr shows dtype="string" but .dtype returns "string[pyarrow_numpy]".

  • That means that users can see both "str" and "string" as dtype descriptions, and personally I think that explaining "string" vs "String" is easier than "str" vs "string" (because it is more consistent with "int64" vs "Int64")

Definitely understand this argument, but in the current PDEP design there is an inconsistency anyway between dtype=int and dtype=float actually returning int/float types whereas dtype=str does not return a string, and then this PDEP also breaks the pd.IntDtype(), pd.FloatDtype, pd.StringDtype() NA consistency

With respect to capitalization, the semantics of that are not going to scale well over time so I'm hesistant to put more overloaded meaning into that. Particularly as we think about adding first class support for aggregate types - should List[string] work the same as List[String] or should the former not be allowed? Are we going to bother with a list[string] or list[String] at all?

  • For people that use the explicit constructor and not the string alias, i.e. pd.StringDtype(), this would still be backwards incompatible, because without any arguments it should IMO still give the default dtype. I suppose dtype="string" is used quite a bit more than dtype=pd.StringDtype(), so only keeping dtype="string" backwards compatible would already help a lot, but it is one more inconsistency to explain.

I was still hoping that we wouldn't change the pd.StringDtype() constructor either - is that a hard requirement?

@simonjayhawkins
Copy link
Member

Definitely agree on this point - in our current release I find it confusing that the repr shows dtype="string" but .dtype returns "string[pyarrow_numpy]".

The original thinking here was to make code portable. A user would write out their string data (either with the PyArrow or NumPy object backend) with the dtype string and could be read in whether the data receiver had PyArrow installed or not.

So this was in keeping with the idea that the PyArrow backend was an implementation detail and that the api and behavior of the object backed and Pyarrow back string arrays should be identical and interchagable.

However, with the advent of the ArrowExtensionArray using string[pyarrow] as the repr (to be consistent with the other Arrow types) this now adds to the confusion.

image

@jorisvandenbossche
Copy link
Member Author

I was still hoping that we wouldn't change the pd.StringDtype() constructor either - is that a hard requirement?

My thinking here was that if we provide a StringDtype() constructor that is used for the default dtype, then the "default" call to it (without any arguments) should ideally give you the default dtype.
Of course, we could just not document pd.StringDtype() at all for the default dtype (and only point users to dtype=str or dtype="str" for specifying the default string dtype), and keep pd.StringDtype() (as it is documented) mainly for opt-in NA-variant of the dtype.
In practice we would still need pd.StringDtype(storage="python"|"pyarrow", na_value=np.nan) for testing, but if it is only for testing, it is maybe fine that those arguments are not the default (although not ideal, because it will leak into user code at some point I think).


What are other people's thoughts on using "str" and "string" instead of "string" and "String" as the string aliases for the dtype (for the NaN and NA variant, respectively) ?

@jbrockmendel
Copy link
Member

jbrockmendel commented Jun 4, 2024 via email

@WillAyd
Copy link
Member

WillAyd commented Jun 4, 2024

Just to clarify, I only ever suggested for dtype=str to map to the new type, since that is an existing valid construction that has np.nan nullability semantics. Changing dtype=str improves existing code without breaking dtype="string", and still from an end user signals intent that they want a string data type. Continuing to map that to object when we have a more proper string implementation doesn't make sense to me. I assume dtype="str" is a far less common construction so no strong opinion on that, but I would think that also signals intent that you don't want object

Our type aliases are already a mess...the more we change the worse off we will be. Having to teach someone that "Int", "Float" and "string" provide NA before 3.x, but "Int", "Float" and "String" are required during the 3.x series, possibly reverting back to the old behavior in a future release is really confusing. Then to say IntDtype, FloatDtype, and StringDtype provided NA behavior up until 3.x but then StringDtype() changed back to np.nan doubles down on that.

Going through all this API churn is not value added to users, and is super confusing

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 4, 2024

Going through all this API churn is not value added to users, and is super confusing

I agree with Will. Based on the above discussion, here's a proposal that I think is a compromise, and which probably has warts that people will shut down:

  1. Keep dtype="string" the same as in 2.x, i.e., using pd.NA. Add in dtype="String" for symmetry with "int"/"Int", "float"/"Float", but it is equivalent to dtype="string" today. Announce that "string" and "String" will be deprecated in a future release (or we could just skip creating dtype="String")
  2. Create dtype="str" and dtype=str to use np.NaN semantics with pyarrow if installed, otherwise python strings if not.
  3. Create dtype="Str" to use pd.NA, and uses pyarrow if installed, otherwise python strings if not.
  4. Keep StringDtype() as it is today - no change in the API, but announce it will be deprecated in a future release.
  5. Create StrDtype() that has all the controls for specifying pyarrow, python, np.NaN and pd.NA

What this results in is an API that has any 2.x code that works as it does today - no changes needed on users for 3.0. BUT they are given a deprecation warning indicating that they have to use "str" or StrDtype() in a future release.

Anyone wanting the new behavior uses "str"/str or StrDtype().

Note that the naming of StrDtype() and using str matches IntDtype() and int, FloatDtype() and float, i.e., we are using the python name for the type.

Any "default" behavior (e.g., in I/O readers, inferring dtypes) would use "str" not "string"

Net result is that we remove the word "string" from the current vocabulary, and replace it with "str" because we deprecate the word "string"

@jorisvandenbossche
Copy link
Member Author

I don't think we need to move long term to use "str" instead of "string", or at least we don't have to decide that right now. So if we go for "str" as the string alias for the NaN-variant of the dtype right now, and keep "string" for the NA-variant, then at the point in the future where the NA-variant might become the default, we can still decide then whether we want to keep using "string" for the dtype repr (and make "str" just an alias of that) or the other way around (use "str" for the repr and make "string" an alias).

3. Create dtype="Str" to use pd.NA, and uses pyarrow if installed, otherwise python strings if not.

I don't think there is a need to already introduce another name. People have been using "string" and they can continue to do that for now if they want the NA variant? Why would we cause the code churn of using a different name?

4. Keep StringDtype() as it is today - no change in the API, but announce it will be deprecated in a future release.

Given you can't yet (or should not yet) act on that deprecation, and we are also not yet certain about how the transition to NA dtypes will look like exactly, I am not sure there is a good reason to already make announcements related to that.

I am starting to get convinced that using "str" instead of "string" for the new default dtype would be a good idea to help the backwards compatibility story, but then I would not go any further than that and just leave it at those two names (and not add other new aliases like "String" or "Str")

@jorisvandenbossche
Copy link
Member Author

Continuing to map (dtype=str) that to object when we have a more proper string implementation doesn't make sense to me.

To be clear, even if we would eventually go with dtype="string" for the default dtype anyways (i.e. the current state of the PDEP text in this PR), I think we should map dtype=str to mean the default string dtype, instead of object dtype. Because dtype=str currently indeed means "give me string data" (just using object dtype, because that's how it works), and we should keep that meaning but using the proper dtype when it is available.
The same is probably true for any other alias we currently map to "ensure string data in object dtype"? So that also includes things like "str", "U", np.str_. This is essentially just the same as we map dtype=int to the default int64 dtype (and not to object dtype with python integers)

(this is not actually implemented right now like that when enabling the future behaviour with pd.options.future.infer_string = True, but I would consider that as a missing piece in the implementation and had been planning to open an issue/PR for it)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 4, 2024

I am starting to get convinced that using "str" instead of "string" for the new default dtype would be a good idea to help the backwards compatibility story, but then I would not go any further than that and just leave it at those two names (and not add other new aliases like "String" or "Str")

So would dtype="string" keep the current behavior 2.x? If so, why not use pd.StrDtype() to represent the new 3.x behavior and let pd.StringDtype() represent the old 2.x behavior? That's basically what I'm suggesting, which would mean all 2.x code would still work, and we deprecate pd.StringDtype() to force people to change if they are using that class.

In the future we can make dtype="string" and dtype=str and dtype="str" mean the same thing (strings with pd.NA),

@WillAyd
Copy link
Member

WillAyd commented Jun 4, 2024

I think pd.StrDtype() might end up in a no man's land. All pandas types that follow that construction pattern today use NA semantics, and I don't think we are going to introduce an equivalent constructor for the types that would be returned from any pd.StrDtype operations

@jbrockmendel
Copy link
Member

jbrockmendel commented Jun 4, 2024 via email

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 5, 2024

I think pd.StrDtype() might end up in a no man's land. All pandas types that follow that construction pattern today use NA semantics, and I don't think we are going to introduce an equivalent constructor for the types that would be returned from any pd.StrDtype operations

In my proposal, this is temporary, at least for one release.

With pd.StrDtype(), you'd have the arguments storage and na_value that would allow you to get the equivalent of what pd.StringDtype() does today, i.e., pd.StrDtype(storage="python", na_value=pd.NA) is the same as pd.StringDtype(). But pd.StrDtype() would be equivalent to pd.StrDtype(storage = "python | pyarrow", na_value=np.nan)

So the default behavior of pd.StrDtype() would give you np.nan semantics, but you could still get the equivalent of what pd.StringDtype() does today to remove the deprecation warning by calling pd.StrDtype(na_value=pd.NA). If we then deprecate pd.StringDtype(), then in the future we can change the default behavior of pd.StrDtype() to be na_value=pd.NA whenever we're ready to make everything use pd.NA for missing values.

@WillAyd
Copy link
Member

WillAyd commented Jun 5, 2024

Do you mind expanding on why you think we would deprecate pd.StringDtype at some point? I am under the impression this PDEP would still offer pd.StringDtype(na_value=pd.NA|np.nan) but the default na_value would remain pd.NA. If we wanted pd.StrDtype I assumed that would just be an alias for pd.StringDtype(na_value=np.nan)

@jorisvandenbossche
Copy link
Member Author

So would dtype="string" keep the current behavior 2.x? If so, why not use pd.StrDtype() to represent the new 3.x behavior and let pd.StringDtype() represent the old 2.x behavior? That's basically what I'm suggesting

If we choose "str" for the new default dtype, then yes dtype="string" would keep the current behaviour. While that's maybe the core of what you were suggesting, you were also suggesting a lot of other things on top of that (adding dtype="Str" as an alias for dtype="string", deprecating StringDtype), and that's what I was responding to.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 5, 2024

Do you mind expanding on why you think we would deprecate pd.StringDtype at some point? I am under the impression this PDEP would still offer pd.StringDtype(na_value=pd.NA|np.nan) but the default na_value would remain pd.NA. If we wanted pd.StrDtype I assumed that would just be an alias for pd.StringDtype(na_value=np.nan)

I'm thinking of the future state. Let's assume that we go for pd.NA semantics across the board in pandas 4.0. We'd then have pd.IntDtype(), pd.FloatDtype() and pd.StrDtype(), all defaulting to using pd.NA for missing values. There would be no need for pd.StringDtype() because pd.StrDtype() would have arguments that do the same thing.

If in pandas 3.0, we tell people who are using pd.StringDtype() that it is being deprecated, they migrate their code to use pd.StrDtype(na_value=pd.NA). That code will have the same behavior in 3.0 as 4.0. The difference in 3.0 vs. 4.0 in pd.StrDtype() is the default value of na_value changing from np.nan to pd.NA.

So with 3.0, any code that uses pd.StringDtype() still works, with a deprecation warning, and there is a migration path to a future state that uses pd.NA everywhere. And if we decide not to make pd.NA the default everywhere, people who start using pd.StrDtype(na_value=pd.NA) will have working code as it works today in pandas 2.x.

In essence, pd.StrType() is the "new string type", and pd.StringDtype() is the "old string type", and there is a migration path from old to new that is pretty clean, IMHO.

@jorisvandenbossche
Copy link
Member Author

"str"/"string" seems much worse confusion-wise than "string"/"String".

@jbrockmendel an the other hand, we do have a similar naming situation with "bool" vs "boolean" (although in this case "bool" is an actual numpy dtype with no missing value support, but it's similar in "default dtype vs opt-in NA-variant")

I certainly prefer "string" as the dtype name (long term), but in the end I think a newcomer not aware of the differences can be confused about either of those options, while both are "explainable" (and we will need to do a good job doing that in the docs).

@jorisvandenbossche
Copy link
Member Author

I'm thinking of the future state. Let's assume that we go for pd.NA semantics across the board in pandas 4.0. We'd then have pd.IntDtype(), pd.FloatDtype() and pd.StrDtype(), all defaulting to using pd.NA for missing values. There would be no need for pd.StringDtype() because pd.StrDtype() would have arguments that do the same thing.

@Dr-Irv I think you could perfectly swap StringDtype and StrDtype in the above paragraph and still have a good argument .. In such a 4.0, there would be no need for StrDtype because the current StringDtype would give exactly the same.

If we land at that future point, we could also just make StrDtype an alias of StringDtype, and things also keep working as is without requiring people that already use StringDtype() to change their code.

And if we don't like having an alias, we can at that point discuss and decide to deprecate one of both. But already asking users of pd.StringDtype to change to pd.StrDtype(na_value=pd.NA) in pandas 3.x seems unnecessary when keeping pd.StringDtype working as is.

In any case, a potential deprecation is an unnecessary part of what needs to be done for 3.0, so I would prefer to leave that out of scope for this discussion (it's already difficult enough ;))

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 5, 2024

And if we don't like having an alias, we can at that point discuss and decide to deprecate one of both. But already asking users of pd.StringDtype to change to pd.StrDtype(na_value=pd.NA) in pandas 3.x seems unnecessary when keeping pd.StringDtype working as is.

In any case, a potential deprecation is an unnecessary part of what needs to be done for 3.0, so I would prefer to leave that out of scope for this discussion (it's already difficult enough ;))

I see your point.

My proposal is to eventually get rid of pd.StringDtype() and have everyone migrate to pd.StrDtype() (or make pd.StringDtype() an alias for a specific pd.StrDtype() behavior). In that sense, pd.StringDtype() is "temporary" - it is there to provide a transition path to pd.StrDtype().

Your proposal is the reverse. Have people use pd.StrDtype() temporarily, keep the default behavior of pd.StringDtype() the same as it is in pandas 2.x, and then pd.StrDtype() becomes the alias for pd.StringDtype() in the future. So in that sense, pd.StrDtype() is "temporary" - it is there to provide a transition path to pd.StringDtype() .

I made these proposals so that 2.x code that uses pd.StringDtype() with no arguments will not have a behavior change. Whether in the future pd.StrDtype() or pd.StringDtype() is the recommended dtype constructor (with one being an alias for the other) probably doesn't matter, although I think it is easier to remove docs that recommend pd.StringDtype() now and then recommend pd.StrDtype() in the future, as opposed to introducing pd.StrDtype() now, and in the future recommend pd.StringDtype().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PDEP pandas enhancement proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet