Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link section missed at bottom of page #518

Open
adbar opened this issue Mar 1, 2024 Discussed in #516 · 3 comments
Open

Link section missed at bottom of page #518

adbar opened this issue Mar 1, 2024 Discussed in #516 · 3 comments
Labels
bug Something isn't working

Comments

@adbar
Copy link
Owner

adbar commented Mar 1, 2024

Discussed in #516

Originally posted by mertdeveci5 February 29, 2024
I read that this might be a feature request hence sharing here if someone figured it out.

On using extract, I use include_links=True. However the links in the website are not scraped for some reason. Not sure if I am using this in the wrong way so would appreciate anyone pointing me into the right direction.

Example:

# import the necessary functions
from trafilatura import fetch_url, extract, sitemaps
from rich import print as rprint
# grab a HTML file to extract data from

URL = "https://jam.dev/careers"
downloaded = fetch_url(URL)

sitemap = sitemaps.sitemap_search(URL)

# output main content and comments as plain text
result = extract(downloaded)

# change the output format to XML (allowing for preservation of document structure)
result = extract(downloaded, include_links=True, output_format="xml")

# discard potential comment and change the output to JSON
extract(downloaded, output_format="json", include_comments=False)

rprint(sitemap)
rprint(result)

Here most of the text is scraped EXCEPT the part where job listings are listed. It is critical to get this content though.

@adbar adbar added the bug Something isn't working label Mar 1, 2024
@adbar
Copy link
Owner Author

adbar commented Mar 1, 2024

Usually the bottom section contains unwanted links, however here there is actual content to be found. Especially with include_links on relevant parts are missing.

@mertdeveci5
Copy link

Thanks for opening up a bug issue for this. Also wondering if there are some settings I can play around with based on you mentioning "unwanted links". This might be related to a bug in there as it could be that these links are detected as unwanted

@adbar
Copy link
Owner Author

adbar commented Mar 14, 2024

You could try favor_recall=True as a parameter to the extraction function.

The culprit would be here, obviously the approach is limited as the fixed thresholds cannot work all the time:

def delete_by_link_density(subtree, tagname, backtracking=False, favor_precision=False):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants