New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract more text #488
Comments
I just try to change
in file https://github.com/adbar/trafilatura/blob/master/trafilatura/core.py and I could get all text with index 1. 2. .... Can anyone explain why break loop when len(result_body) > 1? Thank you. |
Thank you for your feedback, the output is weird because the text is contained by a My guess is that it's a similar problem as multiple How to tackle these segments is an open question, see #432 and #487. Feel free to try something out and draft a pull request if you're interested. |
Hey @adbar, Thank you. |
@felipehertzer Can you try adding it to your PR in #509? |
@adbar I tested the |
@felipehertzer Yes, let's try that. |
for this url = "https://www.aia.com/en/health-wellness/healthy-living/healthy-mind/Managing-financial-stress",
I use
downloaded = trafilatura.fetch_url(url) trafilatura.bare_extraction(downloaded, url=url)
I get the text and this is a good result. However it only has text with index 1. while the website has text with index 1. 2. 3. 4. 5.
Even though I used favor_recall=True, nothing changed.
Thank you, however, for this library, it really is better than bs4!
The text was updated successfully, but these errors were encountered: