Removing related links at end of article/sidebar on news websites? #584

rahulbot · 2024-05-03T17:19:12Z

Over here in the Media Cloud project we're seeing poor performance on the content extraction task for a variety of pages that include links to other "related" stories at the end of article content. Our use case is trying to extract only article content as text. Do you have advice on tweaks to make to improve that performance? This might be the opposite of #518, because we do not want related links as part of content.

Here's sample code with real examples parsed in a way that looks very similar to our usage. The function returns true if the supplied text is included in the extracted content (the erroneous results, in our use case). Each of these incorrectly includes text that is part of a "related links" type callout that appears after article content. Any advice appreciated.

import trafilatura
import requests
MEDIA_CLOUD_USER_AGENT = 'Mozilla/5.0 (compatible; mediacloud academic archive; mediacloud.org)'

def is_text_in_webpage_content(txt:str, url:str) -> bool:
    req = requests.get(url, headers={'User-Agent': MEDIA_CLOUD_USER_AGENT},timeout=30)
    parsed = trafilatura.bare_extraction(req.text, only_with_metadata=False, url=url,
                                         include_images=False, include_comments=False)
    content_text = parsed['text']
    return txt in content_text

print(is_text_in_webpage_content(
    'Thai Official',  # item on bottom of page in "Latest News" section
    'https://www.ibtimes.co.uk/falling-inflation-shifts-focus-when-ecb-could-cut-rates-1722106'))
print(is_text_in_webpage_content(
    'HIV from Terrence Higgins to Today',  # <li> under the "listen on sounds" banner after article
    'https://www.bbc.co.uk/sport/football/67640638'))
print(is_text_in_webpage_content(
    'Madhuri Dixit',  # title of an item in the featured movie below the main content area
    'https://timesofindia.indiatimes.com/videos/lifestyle/fashion/10-indian-saris-every-woman-should-have-in-her-wardrobe/videoshow/105809845.cms'))
print(is_text_in_webpage_content(
    'Immigration, Ukraine',  # title of an item in the "most popular" sidebar content
    'https://www.bfmtv.com/cote-d-azur/nice-25-personnes-expulsees-lors-d-operations-anti-squat-menees-dans-le-quartier-des-liserons_AN-202312150639.html'))

adbar · 2024-05-06T15:40:17Z

Hi @rahulbot, thanks for your feedback, I'll need to check the webpages and the current approach to see if I can find a way to exclude related links. It can be confusing since links are sometimes part of the article and sometimes not.

In the meantime, can you try using favor_precision=True on those pages? This option allows for more restrictive content filtering.

rahulbot · 2024-05-07T14:50:25Z

Oh, using favor_precision=True helps resolve it in 3/4 of the cases. We'll do some more testing on that internally to assess it. Thanks for the tip.

rahulbot mentioned this issue May 3, 2024

investigate appearance of other article headlines in article content mediacloud/story-indexer#278

Closed

adbar added the bug Something isn't working label May 6, 2024

adbar changed the title ~~best approaches to removing related links at end of article/sidebar?~~ Removing related links at end of article/sidebar on news websites? May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing related links at end of article/sidebar on news websites? #584

Removing related links at end of article/sidebar on news websites? #584

rahulbot commented May 3, 2024

adbar commented May 6, 2024

rahulbot commented May 7, 2024

Removing related links at end of article/sidebar on news websites? #584

Removing related links at end of article/sidebar on news websites? #584

Comments

rahulbot commented May 3, 2024

adbar commented May 6, 2024

rahulbot commented May 7, 2024