Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract more text #488

Open
vulinh48936 opened this issue Jan 26, 2024 · 6 comments
Open

Extract more text #488

vulinh48936 opened this issue Jan 26, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@vulinh48936
Copy link

vulinh48936 commented Jan 26, 2024

for this url = "https://www.aia.com/en/health-wellness/healthy-living/healthy-mind/Managing-financial-stress",
I use
downloaded = trafilatura.fetch_url(url) trafilatura.bare_extraction(downloaded, url=url)

I get the text and this is a good result. However it only has text with index 1. while the website has text with index 1. 2. 3. 4. 5.

Even though I used favor_recall=True, nothing changed.

Thank you, however, for this library, it really is better than bs4!

@vulinh48936
Copy link
Author

vulinh48936 commented Jan 26, 2024

I just try to change

if len(result_body) > 1:
    LOGGER.debug(expr)
    break

in file https://github.com/adbar/trafilatura/blob/master/trafilatura/core.py and I could get all text with index 1. 2. ....

Can anyone explain why break loop when len(result_body) > 1?

Thank you.

@adbar adbar added the bug Something isn't working label Jan 26, 2024
@adbar
Copy link
Owner

adbar commented Jan 26, 2024

Thank you for your feedback, the output is weird because the text is contained by a <div class="cmp-section__content"> element which isn't found by rule-based XPath expressions because it's rare or not really meaningful. So the extractor looks for text elements and gets confused because the original article uses multiple <div class="text"> where only one is expected.

My guess is that it's a similar problem as multiple <article> elements, len(result_body) > 1 is used because usually adding elements introduces noise (teasers at the bottom, unrelated text, etc.).

How to tackle these segments is an open question, see #432 and #487. Feel free to try something out and draft a pull request if you're interested.

@felipehertzer
Copy link
Contributor

Hey @adbar,
I have a similar problem, but with the site Stuff, it is only getting half of the content, because they are using the class 'stuff-article', which is very odd, I tried to add 'or contains(@Class, '-article')' and it worked, but I not sure how broad it tag will be. Do you have any other suggestion?

Thank you.

@adbar
Copy link
Owner

adbar commented Feb 15, 2024

@felipehertzer Can you try adding it to your PR in #509? ends-with(@class, '-article') could work, I don't remember if it's supported by LXML.

@felipehertzer
Copy link
Contributor

@adbar I tested the ends-with and LXML seems to do not support it, do you want me to include the contains(@class, "-article")?

@adbar
Copy link
Owner

adbar commented Feb 16, 2024

@felipehertzer Yes, let's try that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants