BUG: incorrect assumption of full-name month format when "May" is initial month #58328

matthew-brett · 2024-04-19T09:24:55Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.to_datetime(['31 May 2023', '1 Apr 2023'])

Issue Description

When the first date entries have "May" as the month, the parser assumes the months are written out as "month as locale’s full name" (format code %B), and then, when it hits an abbreviated month (correct format code %b - abbreviated name), it gives an error:

ValueError: time data "1 Apr 2023" doesn't match format "%d %B %Y", at position 1. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

Expected Behavior

It would be very good if the parser remained agnostic as to whether the month specifier was abbreviated until after it had seen a month other than "May".

Installed Versions

INSTALLED VERSIONS

commit : 2fbabb1
python : 3.10.12.final.0
python-bits : 64
OS : Darwin
OS-release : 23.4.0
Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:11:05 PDT 2024; root:xnu-10063.101.17~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+736.g2fbabb11db
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.2.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.23.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

Aloqeely · 2024-04-20T04:15:37Z

Thanks for the report! This is a duplicate of #57521 which got closed for "not a bug"
But I think we should improve this, I can work on a fix but want to get @rhshadrach's opinion first

rhshadrach · 2024-04-20T10:28:05Z

Thanks for the ping @Aloqeely. That issue was closed by its author while I sought the opinion of @MarcoGorelli. I'd like to do that here as well!

MarcoGorelli · 2024-04-20T10:40:34Z

thanks for the ping - I wouldn't consider this a bug, this looks intended

if this was auto-inferred as %b, then someone else would complain when it would raise on "April"

But I think we should improve this

how, what do you suggest?

asishm · 2024-04-20T10:47:08Z

I'd agree in general with @rhshadrach's comments in the linked issue (specifically around the edge cases where say most of the entries are May but one is April etc.). In the meantime, I think a doc update which contains language from the release notes (see below) would be helpful. Maybe add this specific caveat around inferring May which could lead to either %b or %B in the notes section.

As of version 2.0.0, parsing will use a consistent format, determined by the first non-NA value (unless the user specifies a format, in which case that is used).

That language is present in the current docs under, but the fact that this param is deprecated doesn't make it very clear as to what is the current behavior.

infer_datetime_formatbool, default False
If True and no format is given, attempt to infer the format of the datetime strings based on the first non-NaN element, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.
Deprecated since version 2.0.0: A strict version of this argument is now the default, passing it has no effect.

matthew-brett · 2024-04-20T10:49:31Z

Just to say - I've read the previous issue - but I do think this is a bug in the guess of the initial date format, unless there is, for some reason, an absolute requirement that the guess is not revisited, and cannot be improved by later entries. I mean, I think it is reasonable to consider a risky guess as a bug. Here, when the first date has "May" as the month, the chances of full-month-format (%B) being correct are - difficult to estimate - it would depend on how frequent full-month and abbreviated-month formats are in the wild. My guess is that abbreviated month formats are more common, so the current guess is more often wrong than right. Which is not to suggest that the default be reversed, but that it is in principle reasonable to consider ways of updating the guess if, as here, there is a high probability that the guess is wrong.

matthew-brett · 2024-04-20T10:54:27Z

I don't know the code, and if that's an issue - I can take a look. I guess the problem is that there isn't currently a way of updating the guess once an error arises? If not - would that be practical to implement in restricted circumstances with high risk guesses and obvious alternatives, as here?

MarcoGorelli · 2024-04-20T10:56:25Z

tbh I wouldn't be opposed to a simple solution which considers more than just the first element - it needs to be simple though, this code's already quite complex

asishm · 2024-04-20T11:01:14Z

A 2-pass approach might work. If the first pass errors out, then re-run with format='mixed' - and should also raise a PerformanceWarning

matthew-brett · 2024-04-20T11:04:55Z

@MarcoGorelli - thanks - I do not know the Pandas code-base very well, but if no-one else is interested to look at this, I could give it a go.

Aloqeely · 2024-04-20T11:14:14Z

But I think we should improve this

how, what do you suggest?

Using the first element that doesn't have May as month. Opened #58336

rhshadrach · 2024-04-20T11:33:30Z

tbh I wouldn't be opposed to a simple solution which considers more than just the first element - it needs to be simple though, this code's already quite complex

Another requirement I'd add is predictable - not only in behavior but performance. It should be easy to describe to a user what pandas is doing to infer, and the procedure should be in the docstring.

matthew-brett · 2024-04-20T11:36:56Z

Just thinking aloud here, but I wonder whether there is some way of maintaining an "alternative" format that would be triggered by an error. As in:

31-May-2023 -> guessed %B, alternative %b
1-Apr-2023 -> error for %B, switch to alternative %b, recast previous and continue

but

31-May-2023 -> guessed %B, alternative %b
1-January-2023 -> remove alternative as invalid
1-Apr-2023 -> raise usual error

This could also mean automatic correct parsing of the old dd-mm-yyyy issue:

12-10-2023 -> guessed mm-dd-yyyy, alternative dd-mm-yyyy
13-10-2023 -> error for mm-dd-yyyy, switch to dd-mm-yyyy, recast all previous.

Where:

12-10-2023 -> guessed mm-dd-yyyy, alternative dd-mm-yyyy
12-13-2023 -> remove alternative as invalid
13-10-2023 -> raise usual error

Of course this would have some parsing-time cost, but this would only be for a few elements, typically, before the alternative can be excluded.

Aloqeely · 2024-04-20T12:42:13Z

The general "alternative format" idea sounds great but there will be an endless amount of alternative formats, and I'm not sure if it is considered a "simple solution", maybe you can try tweaking with it and see if it's easily doable.

matthew-brett · 2024-04-20T14:11:35Z

I'll have a look at the code - thanks.

Sketching here, I think it need only catch some common cases, and others could argue for further extensions as they run into them. Then the code might look something like:

...
found_fmt = guess_format(first_entry)
alternatives = _find_alternatives(found_fmt)
...
for current_entry in entries:
    ...
    if alternatives:
        alternatives = _prune_alternatives(current_entry, alternatives)
    try:
        out = parse_entry(current_entry, found_fmt)
    except ValueError as e:
        if alternatives:
            return dp.to_datetime(series, format=alternatives)
        raise e

MarcoGorelli · 2024-04-21T14:52:30Z

Using the first element that doesn't have May as month. Opened #58336

thanks - tbh I feel uneasy about special-casing 'may' like this

Maybe just take the first 5 non-null elements, guess the format for each, and take a majority vote? I think this'd be pretty simple and easy to explain

In the end, this "guess datetime format" is just provided for convenience, if you want reliable parsing you should pass a format explicitly

Aloqeely · 2024-04-21T16:00:07Z

thanks - tbh I feel uneasy about special-casing 'may' like this

I agree, Brett also mentioned (in my PR) that it wouldn't work for other locale's month names

Maybe just take the first 5 non-null elements, guess the format for each, and take a majority vote? I think this'd be pretty simple and easy to explain

I'm a big fan of your suggestion and Brett's suggestion and think we should implement both, I.E. take first 5 non-null elements, for each element see all possible guesses, and take the first format that works for all 5. Is this good?

Now I don't know if we can accomplish this without complicating the code, @matthew-brett are you able to do it? If not, I can just implement the idea of checking the first 5 non-null elements and call it a day

asishm · 2024-04-21T16:53:51Z

I'm not necessarily a fan of using first K because there's nothing fundamentally different from K=1 (which is today's behavior) and you would still run into issues if "April" pops up on the 6th value instead of within the first K.

Aloqeely · 2024-04-21T18:38:57Z

What do you suggest?

Aloqeely · 2024-04-21T19:02:47Z

After having this conversation, one thing is clear to me, not only do we have to describe the procedure pandas does to infer (as Mr. Shadrach suggested), we should also highly encourage the use of the format parameter.

matthew-brett · 2024-04-23T12:04:15Z

@Aloqeely - thanks for looking at this - yes - I'm happy to take a go at this, and try and come up with something suitably simple, to make it as likely as possible that the guess will be correct. I have about 2 weeks of heavy work just now, but I will get to it soon afterwards.

matthew-brett added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 19, 2024

Aloqeely mentioned this issue Apr 20, 2024

BUG: to_datetime(..., infer_datetime_format=True) fails if argument is np.str_ #48969

Closed

3 tasks

Aloqeely mentioned this issue Apr 20, 2024

BUG: Fix possible wrong format guess if first element had May as month #58336

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: incorrect assumption of full-name month format when "May" is initial month #58328

BUG: incorrect assumption of full-name month format when "May" is initial month #58328

matthew-brett commented Apr 19, 2024

INSTALLED VERSIONS

Aloqeely commented Apr 20, 2024 •

edited

rhshadrach commented Apr 20, 2024

MarcoGorelli commented Apr 20, 2024

asishm commented Apr 20, 2024 •

edited

matthew-brett commented Apr 20, 2024

matthew-brett commented Apr 20, 2024 •

edited

MarcoGorelli commented Apr 20, 2024

asishm commented Apr 20, 2024

matthew-brett commented Apr 20, 2024

Aloqeely commented Apr 20, 2024 •

edited

rhshadrach commented Apr 20, 2024 •

edited

matthew-brett commented Apr 20, 2024 •

edited

Aloqeely commented Apr 20, 2024

matthew-brett commented Apr 20, 2024 •

edited

MarcoGorelli commented Apr 21, 2024

Aloqeely commented Apr 21, 2024

asishm commented Apr 21, 2024

Aloqeely commented Apr 21, 2024

Aloqeely commented Apr 21, 2024

matthew-brett commented Apr 23, 2024

BUG: incorrect assumption of full-name month format when "May" is initial month #58328

BUG: incorrect assumption of full-name month format when "May" is initial month #58328

Comments

matthew-brett commented Apr 19, 2024

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

Aloqeely commented Apr 20, 2024 • edited

rhshadrach commented Apr 20, 2024

MarcoGorelli commented Apr 20, 2024

asishm commented Apr 20, 2024 • edited

matthew-brett commented Apr 20, 2024

matthew-brett commented Apr 20, 2024 • edited

MarcoGorelli commented Apr 20, 2024

asishm commented Apr 20, 2024

matthew-brett commented Apr 20, 2024

Aloqeely commented Apr 20, 2024 • edited

rhshadrach commented Apr 20, 2024 • edited

matthew-brett commented Apr 20, 2024 • edited

Aloqeely commented Apr 20, 2024

matthew-brett commented Apr 20, 2024 • edited

MarcoGorelli commented Apr 21, 2024

Aloqeely commented Apr 21, 2024

asishm commented Apr 21, 2024

Aloqeely commented Apr 21, 2024

Aloqeely commented Apr 21, 2024

matthew-brett commented Apr 23, 2024

Aloqeely commented Apr 20, 2024 •

edited

asishm commented Apr 20, 2024 •

edited

matthew-brett commented Apr 20, 2024 •

edited

Aloqeely commented Apr 20, 2024 •

edited

rhshadrach commented Apr 20, 2024 •

edited

matthew-brett commented Apr 20, 2024 •

edited

matthew-brett commented Apr 20, 2024 •

edited