Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: incorrect assumption of full-name month format when "May" is initial month #58328

Open
3 tasks done
matthew-brett opened this issue Apr 19, 2024 · 20 comments
Open
3 tasks done
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@matthew-brett
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.to_datetime(['31 May 2023', '1 Apr 2023'])

Issue Description

When the first date entries have "May" as the month, the parser assumes the months are written out as "month as locale’s full name" (format code %B), and then, when it hits an abbreviated month (correct format code %b - abbreviated name), it gives an error:

ValueError: time data "1 Apr 2023" doesn't match format "%d %B %Y", at position 1. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

Expected Behavior

It would be very good if the parser remained agnostic as to whether the month specifier was abbreviated until after it had seen a month other than "May".

Installed Versions

INSTALLED VERSIONS

commit : 2fbabb1
python : 3.10.12.final.0
python-bits : 64
OS : Darwin
OS-release : 23.4.0
Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:11:05 PDT 2024; root:xnu-10063.101.17~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+736.g2fbabb11db
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.2.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.23.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@matthew-brett matthew-brett added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 19, 2024
@Aloqeely
Copy link
Contributor

Aloqeely commented Apr 20, 2024

Thanks for the report! This is a duplicate of #57521 which got closed for "not a bug"
But I think we should improve this, I can work on a fix but want to get @rhshadrach's opinion first

@rhshadrach
Copy link
Member

Thanks for the ping @Aloqeely. That issue was closed by its author while I sought the opinion of @MarcoGorelli. I'd like to do that here as well!

@MarcoGorelli
Copy link
Member

thanks for the ping - I wouldn't consider this a bug, this looks intended

if this was auto-inferred as %b, then someone else would complain when it would raise on "April"

But I think we should improve this

how, what do you suggest?

@asishm
Copy link
Contributor

asishm commented Apr 20, 2024

I'd agree in general with @rhshadrach's comments in the linked issue (specifically around the edge cases where say most of the entries are May but one is April etc.). In the meantime, I think a doc update which contains language from the release notes (see below) would be helpful. Maybe add this specific caveat around inferring May which could lead to either %b or %B in the notes section.

As of version 2.0.0, parsing will use a consistent format, determined by the first non-NA value (unless the user specifies a format, in which case that is used).

That language is present in the current docs under, but the fact that this param is deprecated doesn't make it very clear as to what is the current behavior.

infer_datetime_formatbool, default False
If True and no format is given, attempt to infer the format of the datetime strings based on the first non-NaN element, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.
Deprecated since version 2.0.0: A strict version of this argument is now the default, passing it has no effect.

@matthew-brett
Copy link
Contributor Author

Just to say - I've read the previous issue - but I do think this is a bug in the guess of the initial date format, unless there is, for some reason, an absolute requirement that the guess is not revisited, and cannot be improved by later entries. I mean, I think it is reasonable to consider a risky guess as a bug. Here, when the first date has "May" as the month, the chances of full-month-format (%B) being correct are - difficult to estimate - it would depend on how frequent full-month and abbreviated-month formats are in the wild. My guess is that abbreviated month formats are more common, so the current guess is more often wrong than right. Which is not to suggest that the default be reversed, but that it is in principle reasonable to consider ways of updating the guess if, as here, there is a high probability that the guess is wrong.

@matthew-brett
Copy link
Contributor Author

matthew-brett commented Apr 20, 2024

I don't know the code, and if that's an issue - I can take a look. I guess the problem is that there isn't currently a way of updating the guess once an error arises? If not - would that be practical to implement in restricted circumstances with high risk guesses and obvious alternatives, as here?

@MarcoGorelli
Copy link
Member

tbh I wouldn't be opposed to a simple solution which considers more than just the first element - it needs to be simple though, this code's already quite complex

@asishm
Copy link
Contributor

asishm commented Apr 20, 2024

A 2-pass approach might work. If the first pass errors out, then re-run with format='mixed' - and should also raise a PerformanceWarning

@matthew-brett
Copy link
Contributor Author

@MarcoGorelli - thanks - I do not know the Pandas code-base very well, but if no-one else is interested to look at this, I could give it a go.

@Aloqeely
Copy link
Contributor

Aloqeely commented Apr 20, 2024

But I think we should improve this

how, what do you suggest?

Using the first element that doesn't have May as month. Opened #58336

@rhshadrach
Copy link
Member

rhshadrach commented Apr 20, 2024

tbh I wouldn't be opposed to a simple solution which considers more than just the first element - it needs to be simple though, this code's already quite complex

Another requirement I'd add is predictable - not only in behavior but performance. It should be easy to describe to a user what pandas is doing to infer, and the procedure should be in the docstring.

@matthew-brett
Copy link
Contributor Author

matthew-brett commented Apr 20, 2024

Just thinking aloud here, but I wonder whether there is some way of maintaining an "alternative" format that would be triggered by an error. As in:

31-May-2023 -> guessed %B, alternative %b
1-Apr-2023 -> error for %B, switch to alternative %b, recast previous and continue

but

31-May-2023 -> guessed %B, alternative %b
1-January-2023 -> remove alternative as invalid
1-Apr-2023 -> raise usual error

This could also mean automatic correct parsing of the old dd-mm-yyyy issue:

12-10-2023 -> guessed mm-dd-yyyy, alternative dd-mm-yyyy
13-10-2023 -> error for mm-dd-yyyy, switch to dd-mm-yyyy, recast all previous.

Where:

12-10-2023 -> guessed mm-dd-yyyy, alternative dd-mm-yyyy
12-13-2023 -> remove alternative as invalid
13-10-2023 -> raise usual error

Of course this would have some parsing-time cost, but this would only be for a few elements, typically, before the alternative can be excluded.

@Aloqeely
Copy link
Contributor

The general "alternative format" idea sounds great but there will be an endless amount of alternative formats, and I'm not sure if it is considered a "simple solution", maybe you can try tweaking with it and see if it's easily doable.

@matthew-brett
Copy link
Contributor Author

matthew-brett commented Apr 20, 2024

I'll have a look at the code - thanks.

Sketching here, I think it need only catch some common cases, and others could argue for further extensions as they run into them. Then the code might look something like:

...
found_fmt = guess_format(first_entry)
alternatives = _find_alternatives(found_fmt)
...
for current_entry in entries:
    ...
    if alternatives:
        alternatives = _prune_alternatives(current_entry, alternatives)
    try:
        out = parse_entry(current_entry, found_fmt)
    except ValueError as e:
        if alternatives:
            return dp.to_datetime(series, format=alternatives)
        raise e

@MarcoGorelli
Copy link
Member

Using the first element that doesn't have May as month. Opened #58336

thanks - tbh I feel uneasy about special-casing 'may' like this

Maybe just take the first 5 non-null elements, guess the format for each, and take a majority vote? I think this'd be pretty simple and easy to explain

In the end, this "guess datetime format" is just provided for convenience, if you want reliable parsing you should pass a format explicitly

@Aloqeely
Copy link
Contributor

thanks - tbh I feel uneasy about special-casing 'may' like this

I agree, Brett also mentioned (in my PR) that it wouldn't work for other locale's month names

Maybe just take the first 5 non-null elements, guess the format for each, and take a majority vote? I think this'd be pretty simple and easy to explain

I'm a big fan of your suggestion and Brett's suggestion and think we should implement both, I.E. take first 5 non-null elements, for each element see all possible guesses, and take the first format that works for all 5. Is this good?

Now I don't know if we can accomplish this without complicating the code, @matthew-brett are you able to do it? If not, I can just implement the idea of checking the first 5 non-null elements and call it a day

@asishm
Copy link
Contributor

asishm commented Apr 21, 2024

I'm not necessarily a fan of using first K because there's nothing fundamentally different from K=1 (which is today's behavior) and you would still run into issues if "April" pops up on the 6th value instead of within the first K.

@Aloqeely
Copy link
Contributor

What do you suggest?

@Aloqeely
Copy link
Contributor

After having this conversation, one thing is clear to me, not only do we have to describe the procedure pandas does to infer (as Mr. Shadrach suggested), we should also highly encourage the use of the format parameter.

@matthew-brett
Copy link
Contributor Author

@Aloqeely - thanks for looking at this - yes - I'm happy to take a go at this, and try and come up with something suitably simple, to make it as likely as possible that the guess will be correct. I have about 2 weeks of heavy work just now, but I will get to it soon afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants