New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include links and Include formatting do not work together properly #511
Comments
Hi @ibestvina, this is a known issue. I'm not primarily working with these options and added them after feature requests, so the interaction between option can be patchy at times. I'm open to accept PRs on the topic. |
This is indeed a big issue as anything with a link is not scraped which leaves a lot of the page. Any PRs on this that we can help out to complete? Critical for a scraper |
@mertdeveci5 There are no PRs at the moment as it's not my main focus and nobody else seems to be contributing on this. Do you need both formatting and links? Links alone work fine, that would be the critical function for a scraper e.g. in a SEO context (where Trafilatura is used). |
Links themselves - to give you the full context: Tried to scrape jam.dev/careers Trafilatura can scrape everything except the links in the bottom where the actual job postings are listed. Tried it with a lot of websites but for half of them it did not work. Couldn't figure out if I am doing something wrong |
This is another issue then, not a problem between extraction options but (probably) a case where the extractor misses the relevant section of the page. edit: see #518 |
version: 1.7.0.
Please see Problem 3 below as the main issue I am reporting. First two problems are given just to make sure I didn't completely misunderstand how the library is supposed to work. Sorry for a very messy issue, as it seems like any little change I make to the inputs completely changes the output.
Starting with the code:
I get results as expected:
Problem 1
Setting
include_links=True
does not change this output at all. I would expect the link to be included as a markdown slug url, but maybe I am misunderstanding whatinclude_links
does.Problem 2
Setting
include_formatting=True
does not change the output either.Problem 3 (main issue)
Setting
<div class="content">
changes above behavior, and nowinclude_links
andinclude_formatting
on their own seem to work, however the paragraph is always duplicated (see output below).More importantly, if both
inlcude_formatting=True
andinclude_links=True
, then all the bold text jumps to the end of the paragraph and links are ignored.Here is the code with changes applied to highlight the main issue I am reporting:
Output:
Additional note: this seems to only happen if there is no space between
<em>
and<a
. When space is added, links and formatting are completely ignored.The text was updated successfully, but these errors were encountered: