-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageText fails on CNN articles #163
Comments
The WebScience The good news is, this issue is probably easy to fix. Readability operates in two passes: a fast heuristic to determine whether the webpage is likely an article that can be parsed, followed by a slow parsing step. The issue on CNN.com is solely with the heuristic; I confirmed that parsing works fine, other than some punctuation formatting oddities. We could easily add a dataset of URL match patterns that we know are likely articles and that are false negatives for the Readability heuristic. If a webpage loads with a URL that matches the dataset, we would skip the heuristic step and immediately invoke article parsing. We need to update how @rhelmer, thoughts? |
Hm, this might be a different issue. Are you able to replicate this problem locally with your study extension? |
for this page, no; it seems to work consistently when I access locally. |
Are there specific webpages or websites where this issue is particularly common? And are these page visits of at least a few seconds, such that the |
The example I sent is one where we noticed it happening several times in the data. But there doesn't seem to be much pattern to the pages where it occurs. And we've confirmed that at least some of these visits are longer than a few seconds. |
Ok, thanks for checking. After looking further through the I think a fix here is easy, too. We could generate fields in the |
Reposting Greg's earlier comment
|
Thanks @jonathanmayer. This seems to happen in a pretty small fraction of pages, and we often have multiple versions of the same page from different users that we could use as backup. So, saving the full HTML is probably overkill in the vast majority of instances. If there is a specific case that causes Readability parsing to fail, that's worth fixing, but I could not detect any pattern in our recent data. |
@jonathanmayer I think this would work, but can we do it without needing to maintain the "fails heuristic but actually parses" dataset of URLs? Would it be harmful is it to skip the fast heuristic step ( |
My assumption has been that we don't want to incur the performance penalty of parsing unless we have some confidence that it'll work. But maybe that assumption was wrong. From some very quick and ad hoc testing, the heuristic step is ~1ms and the parsing step is ~50-100ms on my 6.5-year-old notebook. That might be tolerable, especially if we use |
Our extension (https://code.stanford.edu/gjmartin/beyond_the_paywall) uses pageText.content.js to extract full text of news articles. It consistently fails, returning an empty string, on articles from cnn.com. Here is an example:
https://www.cnn.com/2021/10/08/politics/white-house-privilege-january-6-committee/index.html
My guess is the unusual URL format (with index.html at the end) trips up the isArticle check.
The text was updated successfully, but these errors were encountered: