You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally posted by mertdeveci5 February 29, 2024
I read that this might be a feature request hence sharing here if someone figured it out.
On using extract, I use include_links=True. However the links in the website are not scraped for some reason. Not sure if I am using this in the wrong way so would appreciate anyone pointing me into the right direction.
Example:
# import the necessary functions
from trafilatura import fetch_url, extract, sitemaps
from rich import print as rprint
# grab a HTML file to extract data from
URL = "https://jam.dev/careers"
downloaded = fetch_url(URL)
sitemap = sitemaps.sitemap_search(URL)
# output main content and comments as plain text
result = extract(downloaded)
# change the output format to XML (allowing for preservation of document structure)
result = extract(downloaded, include_links=True, output_format="xml")
# discard potential comment and change the output to JSON
extract(downloaded, output_format="json", include_comments=False)
rprint(sitemap)
rprint(result)
Here most of the text is scraped EXCEPT the part where job listings are listed. It is critical to get this content though.
The text was updated successfully, but these errors were encountered:
Usually the bottom section contains unwanted links, however here there is actual content to be found. Especially with include_links on relevant parts are missing.
Thanks for opening up a bug issue for this. Also wondering if there are some settings I can play around with based on you mentioning "unwanted links". This might be related to a bug in there as it could be that these links are detected as unwanted
Discussed in #516
Originally posted by mertdeveci5 February 29, 2024
I read that this might be a feature request hence sharing here if someone figured it out.
On using
extract
, I useinclude_links=True
. However the links in the website are not scraped for some reason. Not sure if I am using this in the wrong way so would appreciate anyone pointing me into the right direction.Example:
Here most of the text is scraped EXCEPT the part where job listings are listed. It is critical to get this content though.
The text was updated successfully, but these errors were encountered: