pageText fails on CNN articles #163

gregobad · 2021-10-25T21:51:24Z

Our extension (https://code.stanford.edu/gjmartin/beyond_the_paywall) uses pageText.content.js to extract full text of news articles. It consistently fails, returning an empty string, on articles from cnn.com. Here is an example:

https://www.cnn.com/2021/10/08/politics/white-house-privilege-january-6-committee/index.html

My guess is the unusual URL format (with index.html at the end) trips up the isArticle check.

jonathanmayer · 2021-10-26T00:10:32Z

The WebScience pageText module relies on Mozilla Readability to parse articles. That library is the de facto standard for article parsing in JavaScript. And, unfortunately, it appears to have a years-old incompatibility with detecting articles on CNN.com owing to unusual <div> formatting: mozilla/readability#420.

The good news is, this issue is probably easy to fix. Readability operates in two passes: a fast heuristic to determine whether the webpage is likely an article that can be parsed, followed by a slow parsing step. The issue on CNN.com is solely with the heuristic; I confirmed that parsing works fine, other than some punctuation formatting oddities.

We could easily add a dataset of URL match patterns that we know are likely articles and that are false negatives for the Readability heuristic. If a webpage loads with a URL that matches the dataset, we would skip the heuristic step and immediately invoke article parsing. We need to update how pageText handles article detection anyway, because our current approach relies a Firefox-specific API for the Readability heuristic: https://github.com/mozilla-rally/web-science/blob/main/src/pageText.js#L190.

@rhelmer, thoughts?

jonathanmayer · 2021-10-26T16:10:58Z

Hm, this might be a different issue. Are you able to replicate this problem locally with your study extension?

gregobad · 2021-10-26T17:33:23Z

for this page, no; it seems to work consistently when I access locally.

jonathanmayer · 2021-10-27T22:49:54Z

Are there specific webpages or websites where this issue is particularly common? And are these page visits of at least a few seconds, such that the pageText functionality should run?

gregobad · 2021-10-28T14:02:34Z

The example I sent is one where we noticed it happening several times in the data. But there doesn't seem to be much pattern to the pages where it occurs. And we've confirmed that at least some of these visits are longer than a few seconds.

jonathanmayer · 2021-10-28T17:07:43Z

Ok, thanks for checking. After looking further through the pageText implementation and the codebase for your study, my suspicion is that Readability parsing is sometimes failing on these pages. I don't see another code path that would result in the pageText.onTextParsed event firing with empty title, content, and textContent fields.

I think a fix here is easy, too. We could generate fields in the pageText.onTextParsed event that are independent of the Readability library, reflecting the document title (document.title) and HTML (document.documentElement.outerHTML). If client-side Readability parsing fails for any reason, you could send those values and use server-side parsing of the HTML as a backup. The only downside I see is a possible performance penalty—rendering the DOM to HTML can be slow on large or complex pages—but we can try to minimize jank with careful scheduling. Sound reasonable?

knowtheory · 2021-10-28T18:22:32Z

Reposting Greg's earlier comment

Thanks, Jonathan, this is helpful. Probably related to the same underlying issue, we've also noticed that the same page, visited at different times or by different users, will sometimes return parsed text and sometimes not. Here's an example (page text truncated for readability). The first visit returned parsed text but the second and third did not. We checked and there doesn't seem to be any relationship to visit duration or time of day.

visit_id,pioneer_id,url,title,text_content,url_content
6dad52b4-94ae-447e-96f8-168fc10f33e3,xxx,https://www.fastcompany.com/90676758/firefox-suggest-wikipedia-ebay-results-mozilla-google,Firefox’s new feature is part of an ambitious plan to change how we search,"At first glance, a new Firefox feature called Suggest doesn’t seem like a big deal. Type into the browser’s address bar, and you might see suggested links from Wikipedia or shopping results from eBay. In the future, you might be able to peek at the weather, or perform quick mathematical calculations, similar to what Google offers in its Chrome browser today. Even if those features save you some time, they won’t really change how you browse...",https://www.fastcompany.com/90676758/firefox-suggest-wikipedia-ebay-results-mozilla-google
9a52edda-e3fd-4d8c-860a-d8c42257c098,xxx,https://www.fastcompany.com/90676758/firefox-suggest-wikipedia-ebay-results-mozilla-google,,,
0d7b2787-636e-4db1-a1b5-378df0ceb13a,yyy,https://www.fastcompany.com/90676758/firefox-suggest-wikipedia-ebay-results-mozilla-google,,,

gregobad · 2021-10-28T19:09:37Z

Thanks @jonathanmayer. This seems to happen in a pretty small fraction of pages, and we often have multiple versions of the same page from different users that we could use as backup. So, saving the full HTML is probably overkill in the vast majority of instances. If there is a specific case that causes Readability parsing to fail, that's worth fixing, but I could not detect any pattern in our recent data.

rhelmer · 2021-11-01T18:22:39Z

The WebScience pageText module relies on Mozilla Readability to parse articles. That library is the de facto standard for article parsing in JavaScript. And, unfortunately, it appears to have a years-old incompatibility with detecting articles on CNN.com owing to unusual <div> formatting: mozilla/readability#420.

The good news is, this issue is probably easy to fix. Readability operates in two passes: a fast heuristic to determine whether the webpage is likely an article that can be parsed, followed by a slow parsing step. The issue on CNN.com is solely with the heuristic; I confirmed that parsing works fine, other than some punctuation formatting oddities.

We could easily add a dataset of URL match patterns that we know are likely articles and that are false negatives for the Readability heuristic. If a webpage loads with a URL that matches the dataset, we would skip the heuristic step and immediately invoke article parsing. We need to update how pageText handles article detection anyway, because our current approach relies a Firefox-specific API for the Readability heuristic: https://github.com/mozilla-rally/web-science/blob/main/src/pageText.js#L190.

@rhelmer, thoughts?

@jonathanmayer I think this would work, but can we do it without needing to maintain the "fails heuristic but actually parses" dataset of URLs? Would it be harmful is it to skip the fast heuristic step (isProbablyReaderable) and always run the slow parsing step?

jonathanmayer · 2021-11-02T00:31:51Z

My assumption has been that we don't want to incur the performance penalty of parsing unless we have some confidence that it'll work. But maybe that assumption was wrong. From some very quick and ad hoc testing, the heuristic step is ~1ms and the parsing step is ~50-100ms on my 6.5-year-old notebook. That might be tolerable, especially if we use requestIdleCallback.

mozilla-rally deleted a comment from gregobad Oct 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageText fails on CNN articles #163

pageText fails on CNN articles #163

gregobad commented Oct 25, 2021

jonathanmayer commented Oct 26, 2021

jonathanmayer commented Oct 26, 2021

gregobad commented Oct 26, 2021

jonathanmayer commented Oct 27, 2021

gregobad commented Oct 28, 2021

jonathanmayer commented Oct 28, 2021

knowtheory commented Oct 28, 2021

gregobad commented Oct 28, 2021

rhelmer commented Nov 1, 2021

jonathanmayer commented Nov 2, 2021

pageText fails on CNN articles #163

pageText fails on CNN articles #163

Comments

gregobad commented Oct 25, 2021

jonathanmayer commented Oct 26, 2021

jonathanmayer commented Oct 26, 2021

gregobad commented Oct 26, 2021

jonathanmayer commented Oct 27, 2021

gregobad commented Oct 28, 2021

jonathanmayer commented Oct 28, 2021

knowtheory commented Oct 28, 2021

gregobad commented Oct 28, 2021

rhelmer commented Nov 1, 2021

jonathanmayer commented Nov 2, 2021