Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing related links at end of article/sidebar on news websites? #584

Open
rahulbot opened this issue May 3, 2024 · 2 comments
Open
Labels
bug Something isn't working

Comments

@rahulbot
Copy link

rahulbot commented May 3, 2024

Over here in the Media Cloud project we're seeing poor performance on the content extraction task for a variety of pages that include links to other "related" stories at the end of article content. Our use case is trying to extract only article content as text. Do you have advice on tweaks to make to improve that performance? This might be the opposite of #518, because we do not want related links as part of content.

Here's sample code with real examples parsed in a way that looks very similar to our usage. The function returns true if the supplied text is included in the extracted content (the erroneous results, in our use case). Each of these incorrectly includes text that is part of a "related links" type callout that appears after article content. Any advice appreciated.

import trafilatura
import requests
MEDIA_CLOUD_USER_AGENT = 'Mozilla/5.0 (compatible; mediacloud academic archive; mediacloud.org)'

def is_text_in_webpage_content(txt:str, url:str) -> bool:
    req = requests.get(url, headers={'User-Agent': MEDIA_CLOUD_USER_AGENT},timeout=30)
    parsed = trafilatura.bare_extraction(req.text, only_with_metadata=False, url=url,
                                         include_images=False, include_comments=False)
    content_text = parsed['text']
    return txt in content_text

print(is_text_in_webpage_content(
    'Thai Official',  # item on bottom of page in "Latest News" section
    'https://www.ibtimes.co.uk/falling-inflation-shifts-focus-when-ecb-could-cut-rates-1722106'))
print(is_text_in_webpage_content(
    'HIV from Terrence Higgins to Today',  # <li> under the "listen on sounds" banner after article
    'https://www.bbc.co.uk/sport/football/67640638'))
print(is_text_in_webpage_content(
    'Madhuri Dixit',  # title of an item in the featured movie below the main content area
    'https://timesofindia.indiatimes.com/videos/lifestyle/fashion/10-indian-saris-every-woman-should-have-in-her-wardrobe/videoshow/105809845.cms'))
print(is_text_in_webpage_content(
    'Immigration, Ukraine',  # title of an item in the "most popular" sidebar content
    'https://www.bfmtv.com/cote-d-azur/nice-25-personnes-expulsees-lors-d-operations-anti-squat-menees-dans-le-quartier-des-liserons_AN-202312150639.html'))
@adbar
Copy link
Owner

adbar commented May 6, 2024

Hi @rahulbot, thanks for your feedback, I'll need to check the webpages and the current approach to see if I can find a way to exclude related links. It can be confusing since links are sometimes part of the article and sometimes not.

In the meantime, can you try using favor_precision=True on those pages? This option allows for more restrictive content filtering.

@adbar adbar changed the title best approaches to removing related links at end of article/sidebar? Removing related links at end of article/sidebar on news websites? May 6, 2024
@rahulbot
Copy link
Author

rahulbot commented May 7, 2024

Oh, using favor_precision=True helps resolve it in 3/4 of the cases. We'll do some more testing on that internally to assess it. Thanks for the tip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants