"Found the next address, delimited by a comma" #70

Kermer · 2018-01-03T12:12:36Z

MailAddressCollection.Parse fails to parse address when it contains comma in the display name. It splits single address into multiple ones which leads to having some invalid address(es) (first part of display name) and valid address with incorrect displayname.

In the example above it's From address, so when I use MailMessage.From I get the first, invalid, address from the 1st screenshot.

The text was updated successfully, but these errors were encountered:

Kermer · 2018-01-03T12:19:18Z

I can add that the raw "From" header looks like this

jstedfast · 2018-01-03T12:22:59Z

Seems like the parser is parsing mail addresses after decoding the header which is causing this sort of problem (unquoted commas in address headers are meant to split addresses).

The parser should be fixed to parse the headers and then decode the names.

Comma in the "displayName" was splitting 1 address into multiple, invalid ones. Fixes bertjohnson#70

jstedfast · 2018-01-04T09:53:39Z

The patch is wrong :-\

Don't decode the address until after you have completely parsed it and determined which part is the name and which part is the address and then you only want to decode the name portion.

Also: yikes, that address parser code is very... brute force. It should really be token-based like the parser I wrote in MimeKit.

Main loop for parsing address lists: https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddressList.cs#L546

Logic for parsing an individual address (used by above loop): https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddress.cs#L555

You'll also note that MimeKit's parser handles unquoted commas in the name portion of the address as well (see https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddress.cs#L623).

jstedfast · 2018-01-04T09:56:00Z

Keep in mind that the rfc2047 encoded-word token, once decoded, might contain <>'s or @'s or any other character that will screw up the logic in the current MailAddressCollection.Parse() method.

jstedfast · 2018-01-04T10:04:35Z

For anyone that wants to try a different approach to my parser in MimeKit, you could always try parsing the address list in reverse. This approach probably helps simplify the parser logic a bit because parsing forward makes it difficult to know what the tokens belong to (is it the name token? or is it the local-part of an addr-spec? hard to know until I consume a few more tokens...).

For example, consider the following BNF grammar:

address         =       mailbox / group
mailbox         =       name-addr / addr-spec
name-addr       =       [display-name] angle-addr
angle-addr      =       [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr
group           =       display-name ":" [mailbox-list / CFWS] ";"
                        [CFWS]
display-name    =       phrase
word            =       atom / quoted-string
phrase          =       1*word / obs-phrase
addr-spec       =       local-part "@" domain
local-part      =       dot-atom / quoted-string / obs-local-part
domain          =       dot-atom / domain-literal / obs-domain
obs-local-part  =       word *("." word)

Now consider the following email address: "Joe Example" <joe@example.com>

The first token you read will be "Joe Example" and you might think that that token indicates that it is the display name, but it doesn't. All you know is that you've got a quoted-string token. A quoted-string can be part of a phrase or it can be (a part of) the local-part of the address itself. You must read at least 1 more token before you'll be able to figure out what it actually is (obs-local-part makes things slightly more difficult). In this case, you'll get a '<' which indicates the start of an angle-addr, allowing you to assume that the quoted-string you just got is indeed the display-name.

If, however, you parse the address in reverse, things become a little simpler because you know immediately what to expect the next token to be a part of.

It's been an approach that I've been meaning to try, but since my current parser in MimeKit works well, I haven't been able to summon the enthusiasm to rewrite it (as the old saying goes, "if it ain't broke, don't fix it").

jstedfast · 2018-01-04T10:14:30Z

It may be a good idea to test against all of the address examples in my unit tests as well: https://github.com/jstedfast/MimeKit/blob/master/UnitTests/InternetAddressListTests.cs

For example, the patch here will likely break for addresses like this:

To: "Worthington, William" <william@the-worthingtons.com>

You can't just assume that all commas split addresses.

Kermer · 2018-01-04T10:47:06Z

With the newest commit (forget the first one) from the PR, but sadly tested only using Mozilla Thunderbird client, I tested out both encoded (=?iso-8859-2? [...] ?= <mail@ex.com>) and unencoded (" [...] " <mail@ex.com>) addresses.

I gotta agree that the Parse method is a bit messy, but after that fix it seems to work fine for me.
e.g.

To: "Worthington, William" <william@the-worthingtons.com>
Was and is working fine, as it's not encoded, so symbols between quotes are not threated as special symbols
To: =?utf-8?Q?Worthington=2C_William?= <william@the-worthingtons.com>
(decoded: Worthington, William <william@the-worthingtons.com>)
When it doesn't have quotes around it (which is usually the case in encoded names) - earlier it would be split into Worthington and William <william@the-worthingtons.com (thanks to that comma).
If it would be From instead of To you would get invalid address on MailMessage.From
With this fix it's now decoded to "Worthington, William" <william@the-worthingtons.com> and after that we basically go back to point 1.

I'm saw some emails encoded as =?iso-8859-2?B? and they seemed to be working fine as well.

jstedfast · 2018-01-04T12:31:00Z

Okay, cool.

Kermer · 2018-01-10T09:32:16Z

Ok. While this fixed the address issue, it introduced a new bug as apparently Functions.DecodeMailHeader wasn't the best place to put it in 😢. Namely now other headers get quoted and another interesting thing is that the subject headers (so probably other headers as well) are split into parts of max. 32 chars.
So instead of getting subject like
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean et sagittis nisl.
it becomes
"Lorem ipsum dolor sit amet, cons""ectetur adipiscing elit. Aenean ""et sagittis nisl."

Kermer · 2018-01-10T09:50:10Z

Above gets patched with e6cd1fc , hopefully without any further surprises.

Kermer pushed a commit to Kermer/OpaqueMail that referenced this issue Jan 3, 2018

Split address headers before decoding them

f3864b3

Comma in the "displayName" was splitting 1 address into multiple, invalid ones. Fixes bertjohnson#70

Kermer mentioned this issue Jan 4, 2018

Place display name between quotes after decoding. #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Found the next address, delimited by a comma" #70

"Found the next address, delimited by a comma" #70

Kermer commented Jan 3, 2018 •

edited

Kermer commented Jan 3, 2018

jstedfast commented Jan 3, 2018

jstedfast commented Jan 4, 2018

jstedfast commented Jan 4, 2018

jstedfast commented Jan 4, 2018 •

edited

jstedfast commented Jan 4, 2018

Kermer commented Jan 4, 2018

jstedfast commented Jan 4, 2018

Kermer commented Jan 10, 2018

Kermer commented Jan 10, 2018

"Found the next address, delimited by a comma" #70

"Found the next address, delimited by a comma" #70

Comments

Kermer commented Jan 3, 2018 • edited

Kermer commented Jan 3, 2018

jstedfast commented Jan 3, 2018

jstedfast commented Jan 4, 2018

jstedfast commented Jan 4, 2018

jstedfast commented Jan 4, 2018 • edited

jstedfast commented Jan 4, 2018

Kermer commented Jan 4, 2018

jstedfast commented Jan 4, 2018

Kermer commented Jan 10, 2018

Kermer commented Jan 10, 2018

Kermer commented Jan 3, 2018 •

edited

jstedfast commented Jan 4, 2018 •

edited