Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

include_links option mixes texts and links #476

Open
hugoobauer opened this issue Jan 12, 2024 · 6 comments
Open

include_links option mixes texts and links #476

hugoobauer opened this issue Jan 12, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@hugoobauer
Copy link

When I activate the "links" option, some links are not correctly extracted. When I was looking for the cause of the problem, I noticed that there are text shifts when the option is activated.

Here's an example on https://www.menshealth.com/entertainment/g42398628/best-movies-2023/ :
image
Above is a screen shot of the web page. Below is the part extracted with the bug. You can see that the text with the link "Cabin at the End of the World" after "based on the novel" has disappeared, and has been moved to the place of the following link found a little further on "(led by Dave Bautista)"
image

I only started examining the package today, so I don't have any hints yet.

@adbar adbar added the question Further information is requested label Jan 15, 2024
@adbar
Copy link
Owner

adbar commented Jan 15, 2024

Hi @hugoobauer, I don't know how you generated the output so I can't reproduce it exactly but my impression is that in TXT output (and not in XML) a new line is added before the link, which may be the issue here:

Knock at the Cabin—based on the novel
[Cabin at the End of the World](https://www.amazon.com/Cabin-End-World-Novel/dp/0062679104)—finds a family taking a trip

What do you think?

@hugoobauer
Copy link
Author

Hi @adbar,
So I just figured out the issue comes with the --formatting parameter coupled with the --link parameter. You can reproduce the issue with the command line. Here is an example :
trafilatura -u "https://www.menshealth.com/entertainment/g42398628/best-movies-2023/" --formatting --links
The bug also occurs with --xml parameter.

If parameters are used individually, there is no offset in the text. However, if --formatting and --links are used at the same time, the problem arises.

@adbar adbar added bug Something isn't working and removed question Further information is requested labels Jan 16, 2024
@adbar
Copy link
Owner

adbar commented Jan 16, 2024

Thanks for the further details, there is a mismatch in the way formatting and links and handled here. At first sight I'm not sure which part of the code to change, I'll leave the thread open.

@hugoobauer
Copy link
Author

hugoobauer commented Jan 16, 2024

It looks like it come from the handle_textnode function.

image

This node meets the condition element.text is None, so element.text, element.tail = element.tail, '' is applied to the node.
With my limited understanding of the code, I have the impression that this function doesn't manage elements in their depth. Since the link is in a "formatting" node one depth lower, the text "finds a family [...] (led by '" which was in the tail, is moved before the link (on element.text). I hope this helps

@adbar
Copy link
Owner

adbar commented Jan 17, 2024

I tried to isolate the problem, does that replicate it efficiently enough? The title of the novel is misplaced (as you say) and the paragraph gets broken in two.

Input:

<html>
<body>
<article>
<p><em>Knock at the Cabin</em>—based on the novel <em><a href="https://www.amazon.com/Cabin-End-World-Novel/dp/0062679104?linkCode=ogi&amp;tag=menshealth-auto-20&amp;ascsubtag=">Cabin at the End of the World</a>—</em>finds a family taking a trip to a secluded cabin only to be confronted by violent conspiracy theorists (led by <a href="https://www.menshealth.com/entertainment/a37712403/dave-bautista-dune-movie-interview/">Dave Bautista</a>)who hold them hostage for increasingly wild reasons.</p>
</article>
</body>
</html>

Output (--formatting --links --xml):

<doc categories="" tags="" fingerprint="5adafb74d0d7300e">
  <main>
    <p><hi rend="#i">Knock at the Cabin</hi>—based on the novel <hi rend="#i">finds a family taking a trip to a secluded cabin only to be confronted by violent conspiracy theorists (led by Cabin at the End of the World—</hi></p>
    <p><ref target="https://www.menshealth.com/entertainment/a37712403/dave-bautista-dune-movie-interview/">Dave Bautista</ref>)who hold them hostage for increasingly wild reasons.</p>
  </main>
  <comments/>
</doc>

@hugoobauer
Copy link
Author

Yes your minimal example looks good ! You could also remove the 3 empty <em></em> after the last link to reduce noise, I don't think they serve much purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants