Improve parsing of author information #521

yggi49 · 2022-11-12T21:31:52Z

Instead of relying on regular expressions, this patch leverages Python’s builtin email.utils.parseaddr() functionality to parse an RFC-822-compliant email address string into its name and address parts.

This should also resolve issues with special characters in the name part; see for example:

This is a followup to python-poetry/poetry PR #1040, as advised by @Secrus’ comment.

If this PR gets accepted, there will be two additional steps:

Refactor poetry.console.commands.init.InitCommand._validate_author() to use the new parse_author() helper, see https://github.com/python-poetry/poetry/blob/master/src/poetry/console/commands/init.py#L441
Remove AUTHOR_REGEX from poetry.core.packages.package.

Added tests for changed code.
Updated documentation for changed code.

Instead of relying on regular expressions, this patch leverages Python’s builtin `email.utils.parseaddr()` functionality to parse an RFC-822-compliant email address string into its name and address parts. This should also resolve issues with special characters in the name part; see for example Poetry issues python-poetry#370 and #798. python-poetry/poetry#370 python-poetry/poetry#798

Secrus · 2022-11-13T11:54:13Z

tests/utils/test_helpers.py

@@ -118,3 +119,60 @@ def test_utils_helpers_readme_content_type(
    readme: str | Path, content_type: str
 ) -> None:
    assert readme_content_type(readme) == content_type
+
+
+def test_utils_helpers_parse_author():


please make it into parametrized test

Done; see commit 8286618.

Secrus · 2022-11-13T11:57:25Z

src/poetry/core/packages/package.py

@@ -32,6 +33,8 @@

    T = TypeVar("T", bound="Package")

+# TODO: once poetry.console.commands.init.InitCommand._validate_author
+# uses poetry.core.utils.helpers.parse_author, this can be removed.
 AUTHOR_REGEX = re.compile(r"(?u)^(?P<name>[- .,\w\d'’\"():&]+)(?: <(?P<email>.+?)>)?$")


Could you stack this upon #517, please?

finswimmer

I didn't know about email.utils before. If we can make it fly, it is clearly preferred over an own regex implementation 👍

I played around a bit with it. For any reasons with blabla <some@example.com I receive blabla as name and some@example.com as email. Which is surprising, because the closing > is missing. 🤔

finswimmer · 2022-11-13T18:43:32Z

src/poetry/core/utils/helpers.py

+    if "@" not in address:
+        return address, None
+    name, email = parseaddr(address)
+    if not name and "@" not in email:
+        return email, None
+    return name or None, email or None


IMO we can remove all the if blocks.

yggi49 · 2022-11-14T11:58:25Z

It seems it will be hard (if not impossible) to make this backward compatible, given the existing tests for valid and invalid author names. test_package_authors_valid() creates a few combinations that contradict RFC-822; on the other hand, test_package_author_names_invalid() has a couple of cases that would actually be valid names.

What’s the general concept in Poetry for author and maintainer strings? Judging from the test suite, it appears that a name is mandatory, whereas an email address is optional. Then again, the ValueError raised in poetry.core.packages.package.Package._get_author() explicitly says that author strings “Must be in the format: John Smith <john@example.com>”, indicating that an email address must be provided as well. (The same also goes for poetry.core.packages.package.Package._get_maintainer().)

For reference: According to the PyPA specification (derived from PEP-621), author and maintainer information can be either name-only, email-only, or name and email:

If only name is provided, the value goes in Author or Maintainer as appropriate.

If only email is provided, the value goes in Author-email or Maintainer-email as appropriate.

If both email and name are provided, the value goes in Author-email or Maintainer-email as appropriate, with the format {name} <{email}>.

Multiple values should be separated by commas.

Furthermore, RFC-822-compliance for names is mentioned as well:

The name value MUST be a valid email name (i.e. whatever can be put as a name, before an email, in RFC 822) and not contain commas. The email value MUST be a valid email address.

TL;DR: I think it needs to be decided whether to adopt the PyPA spec (at least in this regard, which would be a breaking change), or if Poetry wants to continue having its own concept of what constitutes valid author/maintainer strings. In the latter case, this PR can probably be closed without further ado.

finswimmer · 2022-11-23T08:17:52Z

Hello @yggi49,

thanks for these details 👍 . IMO the goal must be PyPA spec compliance. But maybe in two steps.

Within the current scope of your PR we should switch from the regex implementation to email.utils, but keep the author name mandatory and the email optional. This results in less strict rules for the names - which is backwards compatible. Tests must be adopted accordingly.

In a second PR we can than make name optional as well (unless other maintainers have a veto) and also make sure that name/email goes into the correct metadata fields as described in https://packaging.python.org/en/latest/specifications/declaring-project-metadata/#authors-maintainers.

fin swimmer

yggi49 · 2022-11-23T09:33:45Z

Thank you for providing guidance, @finswimmer. I am going to tweak the logic accordingly (name mandatory, email address optional), and adjust the tests as needed.

neersighted · 2022-11-23T19:59:35Z

I'm not a fan of how the tests are structured in this PR; I would like something much more thorough and incorporated to the existing tests (see the tests/additions in #517).

See python-poetry#521 (comment)

yggi49 · 2022-11-24T08:28:21Z

@finswimmer, I adapted the logic towards PyPA-compliance, with a mandatory name and an optional email address. Check out the test cases for the new parse_author() helper on what is supported and what is disallowed; I hope this matches your expectations. The non-RFC-conformant cases were added due to them being discussed in python-poetry/poetry PR #1040.

@neersighted, my primary goal was to factor out author parsing into a single separate helper method and provide a comprehensive list of test cases for it, to showcase the expected behavior. Initially, I didn’t want to mess around too much with what was there already, but merely adapted the cases of existing tests accordingly. Personally, I think that the tests in test_package.py—test_package_authors_invalid(), test_package_authors_valid(), and test_package_author_names_invalid()—might no longer be needed and could be removed, as everything should already be covered by the tests for parse_author().

finswimmer · 2022-11-24T08:46:41Z

Thanks for making progress on that @yggi49 👍

I'm still a bit irritated by the behavior of email.utils.parseaddr() in some cases, e.g.:

email.utils.parseaddr("Me [Some Department] <me@example.com>")
('', 'Me')

I would expect ('Me [Some Department] ', 'me@example.com').

email.utils.parseaddr("Me <me@example.com")
('Me', 'me@example.com')

I would expected either a parsing error (preferred) or ('Me <me@example.com, '').

I don't understand why it behaves like this. Expected? Bug? How do we want to handle this?

We could check if we can reconstruct the address that goes into parseaddr, with the returned value and throw an error if not. Bu that would mean, that the example above is really invalid. Where are the specs for this? 🤔

fin swimmer

yggi49 · 2022-11-24T10:02:17Z

If brackets should be part of the name, the name must be quoted—i.e., it should read "Me [Some Department]" <me@example.com>. The same goes for white space and other special characters: ()<>@,;:\"..

Python’s address parsing is implemented in email._parseaddr.AddrlistClass. I didn’t really dive into it, and neither did I do a deep dive into the RFC, so I cannot tell what the unquoted Me [Some Department] <me@example.com> actually means. (Email address strings can get quite complicated, covering a couple of arcane use cases, and I don’t think it’s worth reinventing the wheel here.)

I did implement a poor-man’s reverse check into the parse_author() helper, though, and calling parse_author("Me [Some Department] <me@example.com>") will raise a ValueError.

finswimmer · 2022-11-25T08:06:07Z

It's a rabbit hole ... 🐇

I asked at the PyPA discord server about the behavior of email.utils.parseaddr(). And as you've already find out, quoting (double quotes only!) is the key. They came up with another solution for parsing name and email that do also a validation:

>>> from email.headerregistry import AddressHeader
>>> parsed = AddressHeader.value_parser("Me [Some Department] <me@example.com>")
>>> len(parsed.all_defects) > 0
True

>>> parsed = AddressHeader.value_parser('"Me [Some Department]" <me@example.com>')
>>> parsed.all_defects
[]
>>> parsed.mailboxes[0].addr_spec
'me@example.com'
>>> parsed.mailboxes[0].display_name
'Me [Some Department]'

So if we detect an @ in the address we might go this way to parse out name and email. But ...

I'm not sure if the current supported values for a name would already require to quote the name. If yes, this would be indeed a breaking change if we now say: please quote your name.
Require to quote the name in pyproject.toml is ugly. Maybe it should be the responsible of the backend to quote the name if needed when writing to Author-email metadata. But if the name is not quoted in the pyproject.toml, we need some kind of pre-processing to add some, before we can parse out name and email as described above.

This all leads to the question: Is it all worth it? I don't know and does not have a strong opinion about it. Maybe some of the other maintainers?

yggi49 · 2022-11-25T16:43:15Z

I'm not sure if the current supported values for a name would already require to quote the name. If yes, this would be indeed a breaking change if we now say: please quote your name.

@finswimmer, I already mentioned that this is going to be the case, since the existing regular expression for the name part (?P<name>[- .,\w\d'’\"():&]+) allows several special characters—specifically (),.:", plus space—that the RFC requires to be quoted.

Maybe it’s easier to approach this entire topic from the other side, and make a basic strategic decision for this project first: is PyPA compatiblity the goal, or does Poetry want to keep doing it its own way?

In case of the former, this is inevitably going to introduce a breaking change. Plus, you won’t have to worry whether quoted names in pyproject.toml are ugly or not, since that decision has already been made by someone else, as quoted above.

I don’t think, though, that the breaking change would hurt too much, as the majority of cases probably still could go unquoted. After all, only names that contain special characters require quoting.
If Poetry wants to pursue its own rules, it should then be specified what is supported and what isn’t by defining the test cases for the parse_author() helper, and adapt the implementation accordingly.

sonarcloud · 2022-12-05T12:31:03Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

clemenskaposi-cpi · 2023-02-08T13:30:00Z

@finswimmer, any updates on this front?

Secrus requested changes Nov 13, 2022

View reviewed changes

finswimmer requested changes Nov 13, 2022

View reviewed changes

Parametrize tests for parse_author helper

8286618

neersighted mentioned this pull request Nov 18, 2022

1.3.0 Release python-poetry/poetry#6972

Closed

Make author parsing PyPA-compatible, with mandatory name

ad7e115

See python-poetry#521 (comment)

Merge branch 'main' into 'improve-author-parsing'

2d03267

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parsing of author information #521

Improve parsing of author information #521

yggi49 commented Nov 12, 2022

Secrus Nov 13, 2022

yggi49 Nov 14, 2022

Secrus Nov 13, 2022

finswimmer left a comment

finswimmer Nov 13, 2022

yggi49 commented Nov 14, 2022 •

edited

finswimmer commented Nov 23, 2022

yggi49 commented Nov 23, 2022

neersighted commented Nov 23, 2022

yggi49 commented Nov 24, 2022 •

edited

finswimmer commented Nov 24, 2022 •

edited

yggi49 commented Nov 24, 2022 •

edited

finswimmer commented Nov 25, 2022

yggi49 commented Nov 25, 2022

sonarcloud bot commented Dec 5, 2022

clemenskaposi-cpi commented Feb 8, 2023

Improve parsing of author information #521

Are you sure you want to change the base?

Improve parsing of author information #521

Conversation

yggi49 commented Nov 12, 2022

Secrus Nov 13, 2022

Choose a reason for hiding this comment

yggi49 Nov 14, 2022

Choose a reason for hiding this comment

Secrus Nov 13, 2022

Choose a reason for hiding this comment

finswimmer left a comment

Choose a reason for hiding this comment

finswimmer Nov 13, 2022

Choose a reason for hiding this comment

yggi49 commented Nov 14, 2022 • edited

finswimmer commented Nov 23, 2022

yggi49 commented Nov 23, 2022

neersighted commented Nov 23, 2022

yggi49 commented Nov 24, 2022 • edited

finswimmer commented Nov 24, 2022 • edited

yggi49 commented Nov 24, 2022 • edited

finswimmer commented Nov 25, 2022

yggi49 commented Nov 25, 2022

sonarcloud bot commented Dec 5, 2022

clemenskaposi-cpi commented Feb 8, 2023

yggi49 commented Nov 14, 2022 •

edited

yggi49 commented Nov 24, 2022 •

edited

finswimmer commented Nov 24, 2022 •

edited

yggi49 commented Nov 24, 2022 •

edited