Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include links and Include formatting do not work together properly #511

Open
ibestvina opened this issue Feb 21, 2024 · 5 comments
Open
Labels
bug Something isn't working

Comments

@ibestvina
Copy link

ibestvina commented Feb 21, 2024

version: 1.7.0.

Please see Problem 3 below as the main issue I am reporting. First two problems are given just to make sure I didn't completely misunderstand how the library is supposed to work. Sorry for a very messy issue, as it seems like any little change I make to the inputs completely changes the output.

Starting with the code:

html = """
<!DOCTYPE html>
<html>
<body>
	<div>
		<h1>
			This is the title of the page
		</h1>
		<p>
			This is a paragraph, and it contains a <em><a href="https://www.example.com/"> bolded link to some page</a>, some additional bolded text</em> and some text that is not bolded.
		</p>
	</div>
</body>
</html>
"""
    
result = extract(
    html,
    output_format="md",
    include_links=False,
    include_formatting=False,
)
print(result)

I get results as expected:

This is a paragraph, and it contains a bolded link to some page, some additional bolded text and some text that is not bolded.

Problem 1
Setting include_links=True does not change this output at all. I would expect the link to be included as a markdown slug url, but maybe I am misunderstanding what include_links does.

Problem 2
Setting include_formatting=True does not change the output either.

Problem 3 (main issue)
Setting <div class="content"> changes above behavior, and now include_links and include_formatting on their own seem to work, however the paragraph is always duplicated (see output below).

More importantly, if both inlcude_formatting=True and include_links=True, then all the bold text jumps to the end of the paragraph and links are ignored.

Here is the code with changes applied to highlight the main issue I am reporting:

html = """
<!DOCTYPE html>
<html>
<body>
	<div class="content">
		<h1>
			This is the title of the page
		</h1>
		<p>
			This is a paragraph, and it contains a <em><a href="https://www.example.com/"> bolded link to some page</a>, some additional bolded text</em> and some text that is not bolded.
		</p>
	</div>
</body>
</html>
"""
    
result = extract(
    html,
    output_format="md",
    include_links=True,
    include_formatting=True,
)
print(result)

Output:

# This is the title of the page
This is a paragraph, and it contains a
* and some text that is not bolded.
bolded link to some page, some additional bolded text*
This is a paragraph, and it contains a
* and some text that is not bolded.
bolded link to some page, some additional bolded text*

Additional note: this seems to only happen if there is no space between <em> and <a. When space is added, links and formatting are completely ignored.

@adbar adbar added the bug Something isn't working label Feb 21, 2024
@adbar
Copy link
Owner

adbar commented Feb 21, 2024

Hi @ibestvina, this is a known issue. I'm not primarily working with these options and added them after feature requests, so the interaction between option can be patchy at times. I'm open to accept PRs on the topic.

@mertdeveci5
Copy link

This is indeed a big issue as anything with a link is not scraped which leaves a lot of the page. Any PRs on this that we can help out to complete? Critical for a scraper

@adbar
Copy link
Owner

adbar commented Mar 1, 2024

@mertdeveci5 There are no PRs at the moment as it's not my main focus and nobody else seems to be contributing on this. Do you need both formatting and links? Links alone work fine, that would be the critical function for a scraper e.g. in a SEO context (where Trafilatura is used).

@mertdeveci5
Copy link

Links themselves - to give you the full context: Tried to scrape jam.dev/careers

Trafilatura can scrape everything except the links in the bottom where the actual job postings are listed. Tried it with a lot of websites but for half of them it did not work. Couldn't figure out if I am doing something wrong

@adbar
Copy link
Owner

adbar commented Mar 1, 2024

This is another issue then, not a problem between extraction options but (probably) a case where the extractor misses the relevant section of the page.

edit: see #518

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants