Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageText fails on CNN articles #163

Open
gregobad opened this issue Oct 25, 2021 · 10 comments
Open

pageText fails on CNN articles #163

gregobad opened this issue Oct 25, 2021 · 10 comments

Comments

@gregobad
Copy link

Our extension (https://code.stanford.edu/gjmartin/beyond_the_paywall) uses pageText.content.js to extract full text of news articles. It consistently fails, returning an empty string, on articles from cnn.com. Here is an example:

https://www.cnn.com/2021/10/08/politics/white-house-privilege-january-6-committee/index.html

My guess is the unusual URL format (with index.html at the end) trips up the isArticle check.

@jonathanmayer
Copy link
Contributor

The WebScience pageText module relies on Mozilla Readability to parse articles. That library is the de facto standard for article parsing in JavaScript. And, unfortunately, it appears to have a years-old incompatibility with detecting articles on CNN.com owing to unusual <div> formatting: mozilla/readability#420.

The good news is, this issue is probably easy to fix. Readability operates in two passes: a fast heuristic to determine whether the webpage is likely an article that can be parsed, followed by a slow parsing step. The issue on CNN.com is solely with the heuristic; I confirmed that parsing works fine, other than some punctuation formatting oddities.

We could easily add a dataset of URL match patterns that we know are likely articles and that are false negatives for the Readability heuristic. If a webpage loads with a URL that matches the dataset, we would skip the heuristic step and immediately invoke article parsing. We need to update how pageText handles article detection anyway, because our current approach relies a Firefox-specific API for the Readability heuristic: https://github.com/mozilla-rally/web-science/blob/main/src/pageText.js#L190.

@rhelmer, thoughts?

@jonathanmayer
Copy link
Contributor

Hm, this might be a different issue. Are you able to replicate this problem locally with your study extension?

@gregobad
Copy link
Author

for this page, no; it seems to work consistently when I access locally.

@jonathanmayer
Copy link
Contributor

Are there specific webpages or websites where this issue is particularly common? And are these page visits of at least a few seconds, such that the pageText functionality should run?

@gregobad
Copy link
Author

The example I sent is one where we noticed it happening several times in the data. But there doesn't seem to be much pattern to the pages where it occurs. And we've confirmed that at least some of these visits are longer than a few seconds.

@jonathanmayer
Copy link
Contributor

Ok, thanks for checking. After looking further through the pageText implementation and the codebase for your study, my suspicion is that Readability parsing is sometimes failing on these pages. I don't see another code path that would result in the pageText.onTextParsed event firing with empty title, content, and textContent fields.

I think a fix here is easy, too. We could generate fields in the pageText.onTextParsed event that are independent of the Readability library, reflecting the document title (document.title) and HTML (document.documentElement.outerHTML). If client-side Readability parsing fails for any reason, you could send those values and use server-side parsing of the HTML as a backup. The only downside I see is a possible performance penalty—rendering the DOM to HTML can be slow on large or complex pages—but we can try to minimize jank with careful scheduling. Sound reasonable?

@knowtheory
Copy link

Reposting Greg's earlier comment

Thanks, Jonathan, this is helpful. Probably related to the same underlying issue, we've also noticed that the same page, visited at different times or by different users, will sometimes return parsed text and sometimes not. Here's an example (page text truncated for readability). The first visit returned parsed text but the second and third did not. We checked and there doesn't seem to be any relationship to visit duration or time of day.

visit_id,pioneer_id,url,title,text_content,url_content
6dad52b4-94ae-447e-96f8-168fc10f33e3,xxx,https://www.fastcompany.com/90676758/firefox-suggest-wikipedia-ebay-results-mozilla-google,Firefox’s new feature is part of an ambitious plan to change how we search,"At first glance, a new Firefox feature called Suggest doesn’t seem like a big deal. Type into the browser’s address bar, and you might see suggested links from Wikipedia or shopping results from eBay. In the future, you might be able to peek at the weather, or perform quick mathematical calculations, similar to what Google offers in its Chrome browser today. Even if those features save you some time, they won’t really change how you browse...",https://www.fastcompany.com/90676758/firefox-suggest-wikipedia-ebay-results-mozilla-google
9a52edda-e3fd-4d8c-860a-d8c42257c098,xxx,https://www.fastcompany.com/90676758/firefox-suggest-wikipedia-ebay-results-mozilla-google,,,
0d7b2787-636e-4db1-a1b5-378df0ceb13a,yyy,https://www.fastcompany.com/90676758/firefox-suggest-wikipedia-ebay-results-mozilla-google,,,

@mozilla-rally mozilla-rally deleted a comment from gregobad Oct 28, 2021
@gregobad
Copy link
Author

Thanks @jonathanmayer. This seems to happen in a pretty small fraction of pages, and we often have multiple versions of the same page from different users that we could use as backup. So, saving the full HTML is probably overkill in the vast majority of instances. If there is a specific case that causes Readability parsing to fail, that's worth fixing, but I could not detect any pattern in our recent data.

@rhelmer
Copy link
Contributor

rhelmer commented Nov 1, 2021

The WebScience pageText module relies on Mozilla Readability to parse articles. That library is the de facto standard for article parsing in JavaScript. And, unfortunately, it appears to have a years-old incompatibility with detecting articles on CNN.com owing to unusual <div> formatting: mozilla/readability#420.

The good news is, this issue is probably easy to fix. Readability operates in two passes: a fast heuristic to determine whether the webpage is likely an article that can be parsed, followed by a slow parsing step. The issue on CNN.com is solely with the heuristic; I confirmed that parsing works fine, other than some punctuation formatting oddities.

We could easily add a dataset of URL match patterns that we know are likely articles and that are false negatives for the Readability heuristic. If a webpage loads with a URL that matches the dataset, we would skip the heuristic step and immediately invoke article parsing. We need to update how pageText handles article detection anyway, because our current approach relies a Firefox-specific API for the Readability heuristic: https://github.com/mozilla-rally/web-science/blob/main/src/pageText.js#L190.

@rhelmer, thoughts?

@jonathanmayer I think this would work, but can we do it without needing to maintain the "fails heuristic but actually parses" dataset of URLs? Would it be harmful is it to skip the fast heuristic step (isProbablyReaderable) and always run the slow parsing step?

@jonathanmayer
Copy link
Contributor

My assumption has been that we don't want to incur the performance penalty of parsing unless we have some confidence that it'll work. But maybe that assumption was wrong. From some very quick and ad hoc testing, the heuristic step is ~1ms and the parsing step is ~50-100ms on my 6.5-year-old notebook. That might be tolerable, especially if we use requestIdleCallback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants