Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-13: The pandas Logical Type System #58455

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
197 changes: 197 additions & 0 deletions web/pandas/pdeps/0013-logical-type-system.md
@@ -0,0 +1,197 @@
# PDEP-13: The pandas Logical Type System

- Created: 27 Apr 2024
- Status: Under discussion
- Discussion: [#58141](https://github.com/pandas-dev/pandas/issues/58141)
- Author: [Will Ayd](https://github.com/willayd),
- Revision: 2

## Abstract

This PDEP proposes a logical type system for pandas which will decouple user semantics (i.e. _this column should be an integer_) from the pandas implementation/internal (i.e. _we will use NumPy/pyarrow/X to store this array_).By decoupling these through a logical type system, the expectation is that this PDEP will:

* Abstract underlying library differences from end users
* More clearly define the types of data pandas supports
* Allow pandas developers more flexibility to manage type implementations
* Pave the way for continued adoption of Arrow in the pandas code base

## Background

When pandas was originally built, the data types that it exposed were a subset of the NumPy type system. Starting back in version 0.23.0, pandas introduced Extension types, which it also began to use internally as a way of creating arrays instead of exclusively relying upon NumPy. Over the course of the 1.x releases, pandas began using pyarrow for string storage and in version 1.5.0 introduced the high level ``pd.ArrowDtype`` wrapper.

While these new type systems have brought about many great features, they have surfaced three major problems.

### Problem 1: Inconsistent Type Naming / Behavior

There is no better example of our current type system being problematic than strings. Let's assess the number of string iterations a user could create (this is a non-exhaustive list):

```python
dtype=object
dtype=str
dtype="string"
dtype=pd.StringDtype()
dtype=pd.StringDtype("pyarrow")
dtype="string[pyarrow]"
dtype="string[pyarrow_numpy]"
dtype=pd.ArrowDtype(pa.string())
dtype=pd.ArrowDtype(pa.large_string())
```

``dtype="string"`` was the first truly new string implementation starting back in pandas 0.23.0, and it is a common pitfall for new users not to understand that there is a huge difference between that and ``dtype=str``. The pyarrow strings have trickled in in more recent memory, but also are very difficult to reason about. The fact that ``dtype="string[pyarrow]"`` is not the same as ``dtype=pd.ArrowDtype(pa.string()`` or ``dtype=pd.ArrowDtype(pa.large_string())`` was a surprise [to the author of this PDEP](https://github.com/pandas-dev/pandas/issues/58321).

While some of these are aliases, the main reason why we have so many different string dtypes is because we have historically used NumPy and created custom missing value solutions around the ``np.nan`` marker, which are incompatible with the ``pd.NA`` sentinel introduced a few years back. Our ``pd.StringDtype()`` uses the pd.NA sentinel, as do our pyarrow based solutions; bridging these into one unified solution has proven challenging.

To try and smooth over the different missing value semantics and how they affect the underlying type system, the status quo has been to add another string dtype. ``string[pyarrow_numpy]`` was an attempt to use pyarrow strings but adhere to NumPy nullability semantics, under the assumption that the latter offers maximum backwards compatibility. However, being the exclusive data type that uses pyarrow for storage but NumPy for nullability handling, this data type just adds more inconsistency to how we handle missing data, a problem we have been attempting to solve back since discussions around pandas2. The name ``string[pyarrow_numpy]`` is not descriptive to end users, and unless it is inferred requires users to explicitly ``.astype("string[pyarrow_numpy]")``, again putting a burden on end users to know what ``pyarrow_numpy`` means and to understand the missing value semantics of both systems.

PDEP-14 has been proposed to smooth over that and change our ``pd.StringDtype()`` to be an alias for ``string[pyarrow_numpy]``. This would at least offer some abstraction to end users who just want strings, but on the flip side would be breaking behavior for users that have already opted into ``dtype="string"`` or ``dtype=pd.StringDtype()`` and the related pd.NA missing value marker for the prior 4 years of their existence.

A logical type system can help us abstract all of these issues. At the end of the day, this PDEP assumes a user wants a string data type. If they call ``Series.str.len()`` against a Series of that type with missing data, they should get back a Series with an integer data type.

### Problem 2: Inconsistent Constructors

The second problem is that the conventions for constructing types from the various _backends_ are inconsistent. Let's review string aliases used to construct certain types:

| logical type | NumPy | pandas extension | pyarrow |
| int | "int64" | "Int64" | "int64[pyarrow]" |
| string | N/A | "string" | N/A |
| datetime | N/A | "datetime64[us]" | "timestamp[us][pyarrow]" |

"string[pyarrow]" is excluded from the above table because it is misleading; while "int64[pyarrow]" definitely gives you a pyarrow backed string, "string[pyarrow]" gives you a pandas extension array which itself then uses pyarrow. Subtleties like this then lead to behavior differences (see [issue 58321](https://github.com/pandas-dev/pandas/issues/58321)).

If you wanted to try and be more explicit about using pyarrow, you could use the ``pd.ArrowDtype`` wrapper. But this unfortunately exposes gaps when trying to use that pattern across all backends:

| logical type | NumPy | pandas extension | pyarrow |
| int | np.int64 | pd.Int64Dtype() | pd.ArrowDtype(pa.int64()) |
| string | N/A | pd.StringDtype() | pd.ArrowDtype(pa.string()) |
| datetime | N/A | ??? | pd.ArrowDtype(pa.timestamp("us")) |
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved

It would stand to reason in this approach that you could use a ``pd.DatetimeDtype()`` but no such type exists (there is a ``pd.DatetimeTZDtype`` which requires a timezone).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #24662


### Problem 3: Lack of Clarity on Type Support

The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).
The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though pandas claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).


## Assessing the Current Type System(s)

A best effort at visualizing the current type system(s) with types that we currently "support" or reasonably may want to is shown [in this comment](https://github.com/pandas-dev/pandas/issues/58141#issuecomment-2047763186). Note that this does not include the ["pyarrow_numpy"](https://github.com/pandas-dev/pandas/pull/58451) string data type or the string data type that uses the NumPy 2.0 variable length string data type (see [comment](https://github.com/pandas-dev/pandas/issues/57073#issuecomment-2080798081)) as they are under active discussion.

## Proposal

### Proposed Logical Types

Derived from the hierarchical visual in the previous section, this PDEP proposes that pandas supports at least all of the following _logical_ types, excluding any type widths for brevity:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we already support all of these, just not with all three of the storages. Is this paragraph suggesting that we implement all of these for all three (four if we include sparse) storages?

Copy link
Member Author

@WillAyd WillAyd Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all - this is just suggesting that these be the logical types we offer, whether through the factory functions or class-based design that follows.

In today's world it is the responsibility of the user to know what may or may not be offered by the various backends and pick from them accordingly. With dates/datetimes being a good example, a user would have to explicitly pick from some combination of pd.ArrowDtype(pa.date32()), pd.ArrowDtype(pa.timestamp(<unit>)) and datetime64[unit] since we don't offer any logical abstraction.

In the model of this PDEP, we would offer something to the effect pd.date() and pd.datetime(), abstracting away the type implementation details. This would be helpful for us as developers to solve things like #58315


- Signed Integer
- Unsigned Integer
- Floating Point
- Fixed Point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this correspond to pyarrow's decimal128?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either decimal128 or decimal256

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course those names don't have to match exactly with the eventual APIs, but I would personally just call this "Decimal" (which maps to both Python and Arrow)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you call it "Decimal", is there possible confusion with the python standard library decimal ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm even if there is I'm not sure that would matter. Do you think its that different today from the distinction between a python built-in int versus a "int64"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, because the python decimal type lets you specify any number of digits for the precision, as well as rounding.

So I don't know what the correspondence between pyarrow's decimal128 and decimal256 is. Just concerned that if you name this "decimal", there is potential confusion for users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the arrow types also have a user-configurable precision / scale. The decimal128 variant offers up to 38 units of precision and 256 goes up to 76.

Taking a step back from Python/Arrow I think the problem you are hinting at is not new. Databases like postgres also offer Decimal types with different internal representations. So there might be some translation / edge cases to handle but generally I think calling it Decimal as a logical type signals the intent of the type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can argue the same for "integer" which for Python is of unlimited size, while our ints are bound by the max of int64. In the case of the decimal stdlib module, there a few more differences, but regardless from that, if we would add a decimal / fixed point dtype to pandas in the future, I would personally still call it "decimal" (also because of python using that name, regardless of some subtle differences)

- Boolean
- Date
- Datetime
- Duration
- CalendarInterval
- Period
- Binary
- String
- Map
- List
- Struct
- Interval
- Object

One of the major problems this PDEP has tried to highlight is the historical tendency of our team to "create more types" to solve existing problems. To minimize the need for that, this PDEP proposes re-using our existing extension types where possible, and only adding new ones where they do not exist.

The existing extension types which will become our "logical types" are:

- pd.StringDtype()
- pd.IntXXDtype()
- pd.UIntXXDtype()
- pd.FloatXXDtype()
- pd.BooleanDtype()
- pd.PeriodDtype(freq)
- pd.IntervalDtype()

To satisfy all of the types highlighted above, this would require the addition of:

- pd.DecimalDtype()
- pd.DateDtype()
- pd.DatetimeDtype(unit, tz)
- pd.Duration()
- pd.CalendarInterval()
- pd.BinaryDtype()
- pd.MapDtype() # or pd.DictDtype()
- pd.ListDtype()
- pd.StructDtype()
- pd.ObjectDtype()

The storage / backend to each of these types is left as an implementation detail. The fact that ``pd.StringDtype()`` may be backed by Arrow while ``pd.PeriodDtype()`` continues to be a custom solution is of no concern to the end user. Over time this will allow us to adopt more Arrow behind the scenes without breaking the front end for our end users, but _still_ giving us the flexibility to produce data types that Arrow will not implement (e.g. ``pd.ObjectDtype()``).

The methods of each logical type are expected in turn to yield another logical type. This can enable us to smooth over differences between the NumPy and Arrow world, while also leveraging the best of both backends. To illustrate, let's look at some methods where the return type today deviates for end users depending on if they are using NumPy-backed data types or Arrow-backed data types. The equivalent PDEP-13 logical data type is presented as the last column:

| Method | NumPy-backed result | Arrow Backed result type | PDEP-13 result type |
|--------------------|---------------------|--------------------------|--------------------------------|
| Series.str.len() | np.float64 | pa.int64() | pd.Int64Dtype() |
| Series.str.split() | object | pa.list(pa.string()) | pd.ListDtype(pa.StringDtype()) |
| Series.dt.date | object | pa.date32() | pd.DateDtype() |

The ``Series.dt.date`` example is worth an extra look - with a PDEP-13 logical type system in place we would theoretically have the ability to keep our default ``pd.DatetimeDtype()`` backed by our current NumPy-based array but leverage pyarrow for the ``Series.dt.date`` solution, rather than having to implement a DateArray ourselves.

While this PDEP proposes reusing existing extension types, it also necessitates extending those types with extra metadata:

```python
class BaseType:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this then get mixed into ExtensionDtype? if not, what is this used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it depends on whether we as a team want to go down the path of offering constructors like pd.string() or pd.StringDtype(); with the former approach this would not mix into the ExtensionDtype, since that would be closer to a physical type. If we went with the latter approach, we would be repurposing / clarifying the scope of the pandas extension type system to be totally logical

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I think through it more it may make sense for these to be separate; the existing ExtensionDtype refers to a structure that pandas wants to contain a particular set of values. For extension types it is possible for there not to be a logical type representation in pandas

Maybe this should be renamed to BaseLogicalType to clairfy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then im unclear on what we're doing with it? and if EA authors are supposed to implement something for it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - for third party EAs we would not be able to offer a logical type, since we don't know what the EA is. Let's see what other teammates think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the .dtype that users interact with? That's the logical type?

Yea that's where my head is at - essentially they would see / interact with a subclass of the BaseType shown here

For external extension dtypes, I think it will be very hard to extend this concept of logical types to multiple external libraries, like your example of two independent external libraries implementing the same logical type.

Yea this is very tricky. I don't know what the best answer is but I don't really want to muddy the waters between an Extension type and a Logical type; I think these are two totally different things. But this PDEP would need to clarify further how users would use Extension types knowing there is no equivalent logical type for them (or maybe we need to have them register themselves as a logical type?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really want to muddy the waters between an Extension type and a Logical type; I think these are two totally different things

I still don't fully follow why they are totally different things. I know our own types are messy (with the mixture of numpy, numpy-based extension dtypes and pyarrow extension dtypes), but for the external extension dtypes that I know, those are almost all to be considered as "logical" types (eg geometries, tensors, ..)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because our third party extension arrays give EA authors theoretical control over data buffers, masks, and algorithms. A logical type should have none of these

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because our third party extension arrays give EA authors theoretical control over data buffers, masks, and algorithms. A logical type should have none of these

That's definitely true that in that sense an EA is also a physical type. But at the same time, in the idea that a user generally specifies the logical type, all external EAs are in practice also a logical type.

So I understand that those are different concepts, but I think in practice the external EA serves both purposes, and so what I don't fully see is 1) how it would be possible for external EAs to distinguish those concepts, and 2) what the value would be of doing so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your point 1) would definitely need to be developed further. Off the top of my head maybe an extension array would need to define an attribute like logical_type going forward. If the logical type if "foo", we would let that be constructed using pd.foo or pd.FooDtype

I think the value to 2) is that we could in theory also allow a third party EA to make its way into our logical type system. There are no shortage of string implementations that can be surmised, maybe there is some value in having an EA register yet another "string" logical type?


@property
def data_manager -> Literal["numpy", "pyarrow"]:
"""
Who manages the data buffer - NumPy or pyarrow
"""
...

@property
def physical_type:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there use cases for this other than the pyarrow strings? i.e. is it really needed for the general case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Other areas where I think it would matter would be:

  • Differentiation between decimal128/decimal256
  • Any parametrized type (e.g. datetime types with different units / timezones, container types with different value types)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks. is there a reason to include the decimalN case but not (int|float|uint)N? is that distinct from itemsize?

its still unclear to me why this is needed in the general case. the way it is presented here suggests this is something i access on my dtype object, so in the DatetimeTZDtype case i would check dtype.physical_type.tz instead of dtype.tz? or i could imagine standardizing the dtype constructors so you would write dtype = pd.datetime(physical_type=something_that_specifices_tz) instead of dtype=pd.datetime(tz=tz). Neither of these seems great, so i feel like I'm missing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question about the decimal case. If coming from a database world its not uncommon to have to pick between different integer sizes, but I think rare to pick between different decimal sizes. With decimals you would usually just select your precision, from which we could dispatch to a 128 or 256 bit implementation...but yea I'm not sure the best way to approach that

Doing something like pd.datetime(physical_type=something_that_specifies_tz) is definitely not what this PDEP suggests users should do. This PDEP suggests the 99% use case would be pd.datetime(tz=tz) and a user should not care if we use pandas or pyarrow underneath that.

Using #58315 as an example again, if we used the ser.dt.date accessor on a series with a logical type of pd.datetime(), then the user should expect back a pd.date() back. That gives us a lot of flexibility as library authors - maybe we still want pd.datetime() to be backed by our datetime extension by default, but could then use pyarrow's date implementation to back pd.date() by default rather than having to build that from scratch

In today's world we are either asking users to explicitly use pa.timestamp() to get back a true date type from ser.dt.date, or they use the traditional pandas extension type and get back dtype=object when they use the accessor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PDEP suggests the 99% use case would be pd.datetime(tz=tz)

This seems reasonable, but then im not clear on why physical_type is in the base class.

but could then use pyarrow's date implementation to back pd.date() by default

Doesn't this then give the user mixed-and-matched propagation behavior that we agreed was unwanted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to discuss if there were more clarity on what this base class is used for. With that very big caveat, my impression is that the relevant keyword/attribute will vary between string/decimal/datetime/whatever and this doesn't belong in the base class at all.

Are you referring to the physical_type keyword or something else? Every logical type instance should have a physical type underneath. So the "string" type could be physically implemented as a pa.string(), pa.large_string(), object, etc...The "datetime" type could be physically implemented as a pa.timestamp or the existing pandas extension datetime. Right now a decimal would only be backed by either a pa.decimal128 or pa.decimal256

If i have a non-pyarrow pd.datetime Series and i do ser.dt.date and get a pd.date Series that under the hood is date32[pyarrow] then i have mixed semantics.

In the logical type world that is totally fine. It is up to our implementation to manage the fact that a datetime can be implemented either as a pa.timestamp or our own datetime type, and a date can only be implemented as a pa.date32, but that is not a concern of the end user. I am definitely not suggesting we develop a non-pyarrow date type

BTW im nitpicking a bunch, but im generally positive on the central idea here.

The feedback is very helpful and its good to flesh this conversation out so no worries. Much appreciated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I certainly understand the idea of a physical type, I am not sure such a simple attribute will fit for all cases. For example for our own masked Int64Dtype, what's the physical type? np.int64? (but that's only for the data array, while there is also a mask array)
And is this physical type then always either a numpy type or pyarrow type?

Copy link
Member Author

@WillAyd WillAyd Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider the physical type in that case to be the pd.core.arrays.integer.Int64Dtype but maybe we should add terminology to this PDEP?

In the scope of this document a logical type has no control over the layout of a data buffer or mask; it simply serves as a conceptual placeholder for the idea of a type. The physical type, by contrast, does have at least some control how its data buffer and mask are laid out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to discuss if there were more clarity on what this base class is used for. With that very big caveat, my impression is that the relevant keyword/attribute will vary between string/decimal/datetime/whatever and this doesn't belong in the base class at all.

Are you referring to the physical_type keyword or something else? Every logical type instance should have a physical type underneath. So the "string" type could be physically implemented as a pa.string(), pa.large_string(), object, etc...The "datetime" type could be physically implemented as a pa.timestamp or the existing pandas extension datetime. Right now a decimal would only be backed by either a pa.decimal128 or pa.decimal256

The "more clarity" request was for what we do with BaseDtype. I guess examples for both usage and construction with, say, pd.DatetimeTZDtype(unit="ns", tz="US/Pacific") would also be helpful.

If i have a non-pyarrow pd.datetime Series and i do ser.dt.date and get a pd.date Series that under the hood is date32[pyarrow] then i have mixed semantics.

In the logical type world that is totally fine. It is up to our implementation to manage the fact that a datetime can be implemented either as a pa.timestamp or our own datetime type, and a date can only be implemented as a pa.date32, but that is not a concern of the end user. I am definitely not suggesting we develop a non-pyarrow date type

You're saying mixed semantics are fine here, but in #58315 you agreed we should avoid that. What changed? Having consistent/predictable semantics is definitely a concern of the end user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "more clarity" request was for what we do with BaseDtype. I guess examples for both usage and construction with, say, pd.DatetimeTZDtype(unit="ns", tz="US/Pacific") would also be helpful.

I think it should be as simple as:

>>> ser = pd.Series(dtype=pd.datetime(unit="ns", tz="US/Pacific")
>>> ser.dtype
pd.DatetimeDtype(unit="ns", tz="US/Pacific")

Or

>>> ser = pd.Series(dtype=pd.DatetimeDtype(unit="ns", tz="US/Pacific")
>>> ser.dtype
pd.DatetimeDtype(unit="ns", tz="US/Pacific")

You're saying mixed semantics are fine here, but in #58315 you agreed we should avoid that. What changed? Having consistent/predictable semantics is definitely a concern of the end user.

I think interoperating between pyarrow / numpy / our own extension types behind the scenes is perfectly fine. It would be problematic to have those implementation details spill out into the end user space though.

When a user says "I want a datetime type" they should just get something that works like a datetime type. When they then say ser.dt.date() they should get back something that works like a date. Having logical types works as a contract with the end user to promise we will deliver that. Whether we use NumPy, pyarrow, or something else underneath to do that is no concern of the end user

"""
For logical types which may have different implementations, what is the
actual implementation? For pyarrow strings this may mean pa.string() versus
pa.large_string() versrus pa.string_view(); for NumPy this may mean object
or their 2.0 string implementation.
"""
...

@property
def na_marker -> pd.NA|np.nan|pd.NaT:
"""
Sentinel used to denote missing values
"""
...
```

``na_marker`` is expected to be read-only (see next section). For advanced users that have a particular need for a storage type, they may be able to construct the data type via ``pd.StringDtype(data_manager=np)`` to assert NumPy managed storage. While the PDEP allows constructing in this fashion, operations against that data make no guarantees that they will respect the storage backend and are free to convert to whichever storage the internals of pandas considers optimal (Arrow will typically be preferred).

### Missing Value Handling
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel wanted to ping this off you. When we talked I said would stick with np.nan as the missing value marker, but since we are reusing the pd.StringDtype() in this version of the proposal I don't think we should move that away from pd.NA

Its not the default but maybe this pd.na_marker="legacy" serves as a good middle ground?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's a global option, it should go through pd.set_option/get_option

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why data_manager instead of storage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I wanted to include the word data is because we are strictly referring to the data buffer and nothing else. I don't have a strong preference on the terminology of this though


Missing value handling is a tricky area as developers are split between pd.NA semantics versus np.nan, and the transition path from one to the other is not always clear.

Because this PDEP proposes reuse of the existing pandas extension type system, the default missing value marker will consistently be ``pd.NA``. However, to help with backwards compatibility for users that heavily rely on the equality semantics of np.nan, an option of ``pd.na_marker = "legacy"`` can be set. This would mean that the missing value indicator for logical types would be:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like you're not engaging with what joris and i tried to explain about "you need to do it all at once" on Wednesday's call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry missed this before - what do you think is missing in this section?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section dramatically expands the scope of this PDEP, to encompass a lot of what would be PDEP-NA. If you want to make PDEP-NA, great, go for it. Until then, this one should either a) be tightly scoped so as to not depend on PDEP-NA, or b) be explicitly "this is on the back-burner until after PDEP-NA". I prefer the first option, but you do you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your objection just to the default of pd.NA or to the more general approach of allowing na_marker to allow different nullability semantics while still using the same logical type? I am indifferent if we felt the need to start with the "legacy" approach of mixed sentinels, but I think the latter na_marker is a critical piece to execute this PDEP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your objection just to the default of pd.NA

Yes, that is a deal-breaker for me until we have a dedicated PDEP about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the na_marker="legacy" makes sense? Curious to know if you think the NA marker for logical types that have no NumPy equivalent (ex: ListDtype) should still be np.nan or pd.NA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bikeshedding naming aside, I'd be fine with that as long as it is the default until we do an all-at-once switchover following PDEP-NA.


| Logical Type | Default Missing Value | Legacy Missing Value |
| pd.BooleanDtype() | pd.NA | np.nan |
| pd.IntXXType() | pd.NA | np.nan |
| pd.FloatXXType() | pd.NA | np.nan |
| pd.StringDtype() | pd.NA | np.nan |
| pd.DatetimeType() | pd.NA | pd.NaT |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Period, timedelta64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can make this a fully exhaustive list. For the Legacy missing value on BinaryDtype, ListDtype, StructDtype and MapDtype do you think np.nan or pd.NA makes more sense?


However, all data types for which there is no legacy NumPy-backed equivalent will continue to use ``pd.NA``, even in "legacy" mode. Legacy is provided only for backwards compatibility, but pd.NA usage is encouraged going forward to give users one exclusive missing value indicator.

### Transitioning from Current Constructors

To maintain a consistent path forward, _all_ constructors with the implementation of this PDEP are expected to map to the logical types. This means that providing ``np.int64`` as the data type argument makes no guarantee that you actually get a NumPy managed storage buffer; pandas reserves the right to optimize as it sees fit and may decide instead to just pyarrow.

The theory behind this is that the majority of users are not expecting anything particular from NumPy to happen when they say ``dtype=np.int64``. The expectation is that a user just wants _integer_ data, and the ``np.int64`` specification owes to the legacy of pandas' evolution.

This PDEP makes no guarantee that we will stay that way forever; it is certainly reasonable that a few years down the road we deprecate and fully stop support for backend-specifc constructors like ``np.int64`` or ``pd.ArrowDtype(pa.int64())``. However, for the execution of this PDEP, such an initiative is not in scope.

## PDEP-11 History
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## PDEP-11 History
## PDEP-13 History


- 27 April 2024: Initial version
- 10 May 2024: First revision