New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
include_links option mixes texts and links #476
Comments
Hi @hugoobauer, I don't know how you generated the output so I can't reproduce it exactly but my impression is that in TXT output (and not in XML) a new line is added before the link, which may be the issue here:
What do you think? |
Hi @adbar, If parameters are used individually, there is no offset in the text. However, if |
Thanks for the further details, there is a mismatch in the way formatting and links and handled here. At first sight I'm not sure which part of the code to change, I'll leave the thread open. |
It looks like it come from the This node meets the condition |
I tried to isolate the problem, does that replicate it efficiently enough? The title of the novel is misplaced (as you say) and the paragraph gets broken in two. Input:
Output (
|
Yes your minimal example looks good ! You could also remove the 3 empty |
When I activate the "links" option, some links are not correctly extracted. When I was looking for the cause of the problem, I noticed that there are text shifts when the option is activated.
Here's an example on https://www.menshealth.com/entertainment/g42398628/best-movies-2023/ :
Above is a screen shot of the web page. Below is the part extracted with the bug. You can see that the text with the link "Cabin at the End of the World" after "based on the novel" has disappeared, and has been moved to the place of the following link found a little further on "(led by Dave Bautista)"
I only started examining the package today, so I don't have any hints yet.
The text was updated successfully, but these errors were encountered: