[WIP] Ignore entities. #273

benubois · 2019-02-13T17:04:42Z

This change will make mercury ignore html entities and special characters.

This is a big change in behavior and breaks about 80 tests. I'm happy to go through and update the tests, but wanted to make sure that's the right direction.

For example, a lot of the tests rely on the old default behavior, where cheerio would convert HTML entities back to characters:

For example the title:

AT&amp;T just declared war on an open internet (and us)

Would now come through as exactly that, instead of the old behavior where it would result in:

AT&T just declared war on an open internet (and us)

I think this is a better behavior, but it would be surprising if you were relying on the old way.

The big advantage here is that now everything that was getting converted to entities (just about every non-ASCII character) will be left alone.

adampash · 2019-02-13T17:53:02Z

Hmm, this is an interesting question, and you're right, it certain changes expectation. One other thing, just by looking at the test output, is that it's also complicated by the browser version, where that cheerio option isn't relevant. (In the browser, cheerio is replaced by a slightly shimmed version of jQuery.)

The big advantage here is that now everything that was getting converted to entities (just about every non-ASCII character) will be left alone.

In terms of the output, what are the cons of the current behavior if you were to decode the one time, on initial load?

Also, it's definitely a big decision to put the decoding on the user. Ideally we'd want Mercury to return the output as human-friendly as possible, since, even though Mercury is for computers, the goal is more or less to let our software see the page more like humans do.

I'm very interested in your thoughts on this, though, and really appreciate the perspective.

benubois · 2019-02-13T18:31:48Z

One other thing, just by looking at the test output, is that it's also complicated by the browser version, where that cheerio option isn't relevant. (In the browser, cheerio is replaced by a slightly shimmed version of jQuery.)

I think the browser version (assuming the Mercury Reader extension for Chrome is using that) already works without any entity encoding conversions.

In terms of the output, what are the cons of the current behavior if you were to decode the one time, on initial load?

Actually, it looks like passing decodeEntities: false only to cheerio.load() and not $.html() has the same effect as passing it to both.

Doing the reverse, passing decodeEntities: false only on output, causes the most surprising behavior. I parsed this XSS Filter Evasion Cheat Sheet with mercury and the result was entities that were already HTML encoded, were converted back into unencoded characters. For example > got converted to >. On this page this is especially bad because it contains so many JavaScript code samples. These became executable once they went through mercury.

Also, it's definitely a big decision to put the decoding on the user. Ideally we'd want Mercury to return the output as human-friendly as possible, since, even though Mercury is for computers, the goal is more or less to let our software see the page more like humans do.

Maybe a hybrid would be best? What if only the content key adopted the new behavior of decodeEntities: false, and all other keys retained the current behavior. I would assume content always gets used in an HTML context (browsers/web views) but other keys like title might be used in other contexts (like native views) where HTML encoded characters would look wrong unless the authors knew to HTML decode them.

adampash · 2019-02-14T17:54:22Z

Maybe a hybrid would be best? What if only the content key adopted the new behavior of decodeEntities: false, and all other keys retained the current behavior. I would assume content always gets used in an HTML context (browsers/web views) but other keys like title might be used in other contexts (like native views) where HTML encoded characters would look wrong unless the authors knew to HTML decode them.

This seems to me like a fair compromise. With the exception of content, everything else is just plain text.

The only other thing to consider is the newly added contentType option, which allows the user to request the content is converted to markdown or just text:

https://github.com/postlight/mercury-parser/blob/b044cfa958014beae03c6f69a3b32499abe765af/src/mercury.js#L87-L92

At that point it might be sensible to also adopt the new behavior before returning.

adampash · 2019-02-16T00:42:13Z

@benubois Thanks again for looking into this. Just let me know when you've got something you'd like a review on.

benubois · 2019-02-16T01:52:04Z

@adampash actually I'm a bit stuck on how best to proceed and could use your help! My latest attempt was to decode entities back into characters for the text keys.

However, this resulted in a number of side-effects. For example, some extractors look for HTML attributes that contain JSON. When decodeEntites: false is passed, these attributes look like this:

value="{&quot;postID&quot;:34002,&quot;authorType&quot;:&quot;Fan Contributor&quot;,&quot;isVideo&quot;:false,&quot;postType&quot;:&quot;post&quot;}"

But in order for the extraction to work, it needs decodeEntites: true so it can be parsed as JSON:

value='{"postID":34002, "authorType": "Fan Contributor", "isVideo": false, "postType":"post"}

It also causes some problems with date parsing. Maybe because some encoded entities looks like years? i.e. —

I can think of a few option on how to proceed:

Continue with this idea of decoding on output. There are about 9 failing tests, but some of them seem tricky to resolve.
Revert back to decodeEntites: true as the default, but also do a second extraction just for the content with decodeEntites: false. Not really sure what would be needed to make this happen, seems like a big change.

Any thoughts on these approaches or alternatives that come to mind?

adampash · 2019-02-19T21:41:20Z

Hmm, you're right, this is a tough one to resolve. Even in the failing tests for the images, there are query parameters in the URL that actually make a difference in what the image looks like. (I.e., cropping params.)

I'm leaning toward option 2, though like you said, it's potentially a big change, and loading cheerio twice is its own potential performance hit (though it may be negligible; in my experience, cheerio is pretty fast).

If we went that route, the Resource.create may return two documents — the decoded and non-decoded docs. Then you'd pass those to the RootExtractor, which would just need to make sure to pass the latter specifically to the content extractor.

All things considered, it doesn't seem like it'd be a huge change (unless I'm missing something or being overly optimistic). If you have time, do you want to look into it? If you could also take a stab at whether or not you're noticing a performance hit, that would be interesting to see. (I'm also optimistic that it might not be that big of a deal.)

Either way, let me know what you think! And thanks again for looking into this.

benubois · 2019-02-21T01:03:21Z

Before going further with this I wanted to check if it's needed.

Since the content is displayed as HTML, what are the downsides if the characters are left encoded?

Would be interested to hear your thoughts @HenryQW.

HenryQW · 2019-02-21T02:08:31Z

It’s probably fine with small use cases, I just recalled my old solution to this problem:

https://github.com/HenryQW/OpenCC.henry.wang/blob/07909d16abe4644efd068a01cf8acdef136871a0/route.js#L7-L21

The downside is the performance, as I’ve tested replace() with multiple RegEx on HTML is really really really computationally expensive.

My original plan was to use Mercury in RSSHub.

adampash · 2019-02-26T01:18:15Z

@benubois If this is only a problem for reading the HTML before it's actually rendered on a page, I agree that maybe it's not a huge concern. The one use case I might want to handle it more gracefully for would be the new alternate output options for content type:

https://github.com/postlight/mercury-parser/blob/ed14203e9788ca3975060d6eb1f1d33642ac15d7/src/mercury.js#L87-L92

What do you think about that?

Initial support for ignoring entities.

7e712f8

benubois mentioned this pull request Feb 13, 2019

feat: add ftchinese parser #264

Closed

Decode entities for author, dek, and title.

c6732d0

mtashley closed this Aug 21, 2019

mtashley reopened this Aug 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Ignore entities. #273

[WIP] Ignore entities. #273

benubois commented Feb 13, 2019

adampash commented Feb 13, 2019

benubois commented Feb 13, 2019

adampash commented Feb 14, 2019

adampash commented Feb 16, 2019

benubois commented Feb 16, 2019

adampash commented Feb 19, 2019

benubois commented Feb 21, 2019

HenryQW commented Feb 21, 2019

adampash commented Feb 26, 2019 •

edited

[WIP] Ignore entities. #273

Are you sure you want to change the base?

[WIP] Ignore entities. #273

Conversation

benubois commented Feb 13, 2019

adampash commented Feb 13, 2019

benubois commented Feb 13, 2019

adampash commented Feb 14, 2019

adampash commented Feb 16, 2019

benubois commented Feb 16, 2019

adampash commented Feb 19, 2019

benubois commented Feb 21, 2019

HenryQW commented Feb 21, 2019

adampash commented Feb 26, 2019 • edited

adampash commented Feb 26, 2019 •

edited