Include links and Include formatting do not work together properly #511

ibestvina · 2024-02-21T13:46:46Z

version: 1.7.0.

Please see Problem 3 below as the main issue I am reporting. First two problems are given just to make sure I didn't completely misunderstand how the library is supposed to work. Sorry for a very messy issue, as it seems like any little change I make to the inputs completely changes the output.

Starting with the code:

html = """
<!DOCTYPE html>
<html>
<body>
	<div>
		<h1>
			This is the title of the page
		</h1>
		<p>
			This is a paragraph, and it contains a <em><a href="https://www.example.com/"> bolded link to some page</a>, some additional bolded text</em> and some text that is not bolded.
		</p>
	</div>
</body>
</html>
"""
    
result = extract(
    html,
    output_format="md",
    include_links=False,
    include_formatting=False,
)
print(result)

I get results as expected:

This is a paragraph, and it contains a bolded link to some page, some additional bolded text and some text that is not bolded.

Problem 1
Setting include_links=True does not change this output at all. I would expect the link to be included as a markdown slug url, but maybe I am misunderstanding what include_links does.

Problem 2
Setting include_formatting=True does not change the output either.

Problem 3 (main issue)
Setting <div class="content"> changes above behavior, and now include_links and include_formatting on their own seem to work, however the paragraph is always duplicated (see output below).

More importantly, if both inlcude_formatting=True and include_links=True, then all the bold text jumps to the end of the paragraph and links are ignored.

Here is the code with changes applied to highlight the main issue I am reporting:

html = """
<!DOCTYPE html>
<html>
<body>
	<div class="content">
		<h1>
			This is the title of the page
		</h1>
		<p>
			This is a paragraph, and it contains a <em><a href="https://www.example.com/"> bolded link to some page</a>, some additional bolded text</em> and some text that is not bolded.
		</p>
	</div>
</body>
</html>
"""
    
result = extract(
    html,
    output_format="md",
    include_links=True,
    include_formatting=True,
)
print(result)

Output:

# This is the title of the page
This is a paragraph, and it contains a
* and some text that is not bolded.
bolded link to some page, some additional bolded text*
This is a paragraph, and it contains a
* and some text that is not bolded.
bolded link to some page, some additional bolded text*

Additional note: this seems to only happen if there is no space between <em> and <a. When space is added, links and formatting are completely ignored.

The text was updated successfully, but these errors were encountered:

adbar · 2024-02-21T16:25:41Z

Hi @ibestvina, this is a known issue. I'm not primarily working with these options and added them after feature requests, so the interaction between option can be patchy at times. I'm open to accept PRs on the topic.

mertdeveci5 · 2024-02-29T14:25:03Z

This is indeed a big issue as anything with a link is not scraped which leaves a lot of the page. Any PRs on this that we can help out to complete? Critical for a scraper

adbar · 2024-03-01T12:36:36Z

@mertdeveci5 There are no PRs at the moment as it's not my main focus and nobody else seems to be contributing on this. Do you need both formatting and links? Links alone work fine, that would be the critical function for a scraper e.g. in a SEO context (where Trafilatura is used).

mertdeveci5 · 2024-03-01T16:32:49Z

Links themselves - to give you the full context: Tried to scrape jam.dev/careers

Trafilatura can scrape everything except the links in the bottom where the actual job postings are listed. Tried it with a lot of websites but for half of them it did not work. Couldn't figure out if I am doing something wrong

adbar · 2024-03-01T17:59:43Z

This is another issue then, not a problem between extraction options but (probably) a case where the extractor misses the relevant section of the page.

edit: see #518

adbar added the bug Something isn't working label Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include links and Include formatting do not work together properly #511

Include links and Include formatting do not work together properly #511

ibestvina commented Feb 21, 2024 •

edited

adbar commented Feb 21, 2024

mertdeveci5 commented Feb 29, 2024

adbar commented Mar 1, 2024

mertdeveci5 commented Mar 1, 2024

adbar commented Mar 1, 2024 •

edited

Include links and Include formatting do not work together properly #511

Include links and Include formatting do not work together properly #511

Comments

ibestvina commented Feb 21, 2024 • edited

adbar commented Feb 21, 2024

mertdeveci5 commented Feb 29, 2024

adbar commented Mar 1, 2024

mertdeveci5 commented Mar 1, 2024

adbar commented Mar 1, 2024 • edited

ibestvina commented Feb 21, 2024 •

edited

adbar commented Mar 1, 2024 •

edited