Link section missed at bottom of page #518

adbar · 2024-03-01T18:03:03Z

Discussed in #516

^{Originally posted by mertdeveci5 February 29, 2024}
I read that this might be a feature request hence sharing here if someone figured it out.

On using extract, I use include_links=True. However the links in the website are not scraped for some reason. Not sure if I am using this in the wrong way so would appreciate anyone pointing me into the right direction.

Example:

# import the necessary functions
from trafilatura import fetch_url, extract, sitemaps
from rich import print as rprint
# grab a HTML file to extract data from

URL = "https://jam.dev/careers"
downloaded = fetch_url(URL)

sitemap = sitemaps.sitemap_search(URL)

# output main content and comments as plain text
result = extract(downloaded)

# change the output format to XML (allowing for preservation of document structure)
result = extract(downloaded, include_links=True, output_format="xml")

# discard potential comment and change the output to JSON
extract(downloaded, output_format="json", include_comments=False)

rprint(sitemap)
rprint(result)

Here most of the text is scraped EXCEPT the part where job listings are listed. It is critical to get this content though.

The text was updated successfully, but these errors were encountered:

adbar · 2024-03-01T18:03:11Z

Usually the bottom section contains unwanted links, however here there is actual content to be found. Especially with include_links on relevant parts are missing.

mertdeveci5 · 2024-03-11T18:34:56Z

Thanks for opening up a bug issue for this. Also wondering if there are some settings I can play around with based on you mentioning "unwanted links". This might be related to a bug in there as it could be that these links are detected as unwanted

adbar · 2024-03-14T14:17:29Z

You could try favor_recall=True as a parameter to the extraction function.

The culprit would be here, obviously the approach is limited as the fixed thresholds cannot work all the time:

trafilatura/trafilatura/htmlprocessing.py

Line 184 in 3d0c934

    
           def delete_by_link_density(subtree, tagname, backtracking=False, favor_precision=False):

adbar added the bug Something isn't working label Mar 1, 2024

adbar mentioned this issue Mar 1, 2024

Include links and Include formatting do not work together properly #511

Open

rahulbot mentioned this issue May 3, 2024

Removing related links at end of article/sidebar on news websites? #584

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link section missed at bottom of page #518

Link section missed at bottom of page #518

adbar commented Mar 1, 2024

adbar commented Mar 1, 2024

mertdeveci5 commented Mar 11, 2024

adbar commented Mar 14, 2024

Link section missed at bottom of page #518

Link section missed at bottom of page #518

Comments

adbar commented Mar 1, 2024

Discussed in #516

adbar commented Mar 1, 2024

mertdeveci5 commented Mar 11, 2024

adbar commented Mar 14, 2024