Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Found the next address, delimited by a comma" #70

Open
Kermer opened this issue Jan 3, 2018 · 10 comments
Open

"Found the next address, delimited by a comma" #70

Kermer opened this issue Jan 3, 2018 · 10 comments

Comments

@Kermer
Copy link

Kermer commented Jan 3, 2018

MailAddressCollection.Parse fails to parse address when it contains comma in the display name. It splits single address into multiple ones which leads to having some invalid address(es) (first part of display name) and valid address with incorrect displayname.
image
image
In the example above it's From address, so when I use MailMessage.From I get the first, invalid, address from the 1st screenshot.

@Kermer
Copy link
Author

Kermer commented Jan 3, 2018

I can add that the raw "From" header looks like this
image

@jstedfast
Copy link

Seems like the parser is parsing mail addresses after decoding the header which is causing this sort of problem (unquoted commas in address headers are meant to split addresses).

The parser should be fixed to parse the headers and then decode the names.

Kermer pushed a commit to Kermer/OpaqueMail that referenced this issue Jan 3, 2018
Comma in the "displayName" was splitting 1 address into multiple, invalid ones.
Fixes bertjohnson#70
@jstedfast
Copy link

The patch is wrong :-\

Don't decode the address until after you have completely parsed it and determined which part is the name and which part is the address and then you only want to decode the name portion.

Also: yikes, that address parser code is very... brute force. It should really be token-based like the parser I wrote in MimeKit.

Main loop for parsing address lists: https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddressList.cs#L546

Logic for parsing an individual address (used by above loop): https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddress.cs#L555

You'll also note that MimeKit's parser handles unquoted commas in the name portion of the address as well (see https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddress.cs#L623).

@jstedfast
Copy link

Keep in mind that the rfc2047 encoded-word token, once decoded, might contain <>'s or @'s or any other character that will screw up the logic in the current MailAddressCollection.Parse() method.

@jstedfast
Copy link

jstedfast commented Jan 4, 2018

For anyone that wants to try a different approach to my parser in MimeKit, you could always try parsing the address list in reverse. This approach probably helps simplify the parser logic a bit because parsing forward makes it difficult to know what the tokens belong to (is it the name token? or is it the local-part of an addr-spec? hard to know until I consume a few more tokens...).

For example, consider the following BNF grammar:

address         =       mailbox / group
mailbox         =       name-addr / addr-spec
name-addr       =       [display-name] angle-addr
angle-addr      =       [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr
group           =       display-name ":" [mailbox-list / CFWS] ";"
                        [CFWS]
display-name    =       phrase
word            =       atom / quoted-string
phrase          =       1*word / obs-phrase
addr-spec       =       local-part "@" domain
local-part      =       dot-atom / quoted-string / obs-local-part
domain          =       dot-atom / domain-literal / obs-domain
obs-local-part  =       word *("." word)

Now consider the following email address: "Joe Example" <joe@example.com>

The first token you read will be "Joe Example" and you might think that that token indicates that it is the display name, but it doesn't. All you know is that you've got a quoted-string token. A quoted-string can be part of a phrase or it can be (a part of) the local-part of the address itself. You must read at least 1 more token before you'll be able to figure out what it actually is (obs-local-part makes things slightly more difficult). In this case, you'll get a '<' which indicates the start of an angle-addr, allowing you to assume that the quoted-string you just got is indeed the display-name.

If, however, you parse the address in reverse, things become a little simpler because you know immediately what to expect the next token to be a part of.

It's been an approach that I've been meaning to try, but since my current parser in MimeKit works well, I haven't been able to summon the enthusiasm to rewrite it (as the old saying goes, "if it ain't broke, don't fix it").

@jstedfast
Copy link

It may be a good idea to test against all of the address examples in my unit tests as well: https://github.com/jstedfast/MimeKit/blob/master/UnitTests/InternetAddressListTests.cs

For example, the patch here will likely break for addresses like this:

To: "Worthington, William" <william@the-worthingtons.com>

You can't just assume that all commas split addresses.

@Kermer
Copy link
Author

Kermer commented Jan 4, 2018

With the newest commit (forget the first one) from the PR, but sadly tested only using Mozilla Thunderbird client, I tested out both encoded (=?iso-8859-2? [...] ?= <mail@ex.com>) and unencoded (" [...] " <mail@ex.com>) addresses.

I gotta agree that the Parse method is a bit messy, but after that fix it seems to work fine for me.
e.g.

  1. To: "Worthington, William" <william@the-worthingtons.com>
    Was and is working fine, as it's not encoded, so symbols between quotes are not threated as special symbols
  2. To: =?utf-8?Q?Worthington=2C_William?= <william@the-worthingtons.com>
    (decoded: Worthington, William <william@the-worthingtons.com>)
    When it doesn't have quotes around it (which is usually the case in encoded names) - earlier it would be split into Worthington and William <william@the-worthingtons.com (thanks to that comma).
    If it would be From instead of To you would get invalid address on MailMessage.From
    With this fix it's now decoded to "Worthington, William" <william@the-worthingtons.com> and after that we basically go back to point 1.

I'm saw some emails encoded as =?iso-8859-2?B? and they seemed to be working fine as well.

@jstedfast
Copy link

Okay, cool.

@Kermer
Copy link
Author

Kermer commented Jan 10, 2018

Ok. While this fixed the address issue, it introduced a new bug as apparently Functions.DecodeMailHeader wasn't the best place to put it in 😢. Namely now other headers get quoted and another interesting thing is that the subject headers (so probably other headers as well) are split into parts of max. 32 chars.
So instead of getting subject like
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean et sagittis nisl.
it becomes
"Lorem ipsum dolor sit amet, cons""ectetur adipiscing elit. Aenean ""et sagittis nisl."

@Kermer
Copy link
Author

Kermer commented Jan 10, 2018

Above gets patched with e6cd1fc , hopefully without any further surprises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants