Extract more text #488

vulinh48936 · 2024-01-26T09:40:10Z

for this url = "https://www.aia.com/en/health-wellness/healthy-living/healthy-mind/Managing-financial-stress",
I use
downloaded = trafilatura.fetch_url(url) trafilatura.bare_extraction(downloaded, url=url)

I get the text and this is a good result. However it only has text with index 1. while the website has text with index 1. 2. 3. 4. 5.

Even though I used favor_recall=True, nothing changed.

Thank you, however, for this library, it really is better than bs4!

vulinh48936 · 2024-01-26T10:26:03Z

I just try to change

if len(result_body) > 1:
    LOGGER.debug(expr)
    break

in file https://github.com/adbar/trafilatura/blob/master/trafilatura/core.py and I could get all text with index 1. 2. ....

Can anyone explain why break loop when len(result_body) > 1?

Thank you.

adbar · 2024-01-26T12:03:47Z

Thank you for your feedback, the output is weird because the text is contained by a <div class="cmp-section__content"> element which isn't found by rule-based XPath expressions because it's rare or not really meaningful. So the extractor looks for text elements and gets confused because the original article uses multiple <div class="text"> where only one is expected.

My guess is that it's a similar problem as multiple <article> elements, len(result_body) > 1 is used because usually adding elements introduces noise (teasers at the bottom, unrelated text, etc.).

How to tackle these segments is an open question, see #432 and #487. Feel free to try something out and draft a pull request if you're interested.

felipehertzer · 2024-02-15T02:47:35Z

Hey @adbar,
I have a similar problem, but with the site Stuff, it is only getting half of the content, because they are using the class 'stuff-article', which is very odd, I tried to add 'or contains(@Class, '-article')' and it worked, but I not sure how broad it tag will be. Do you have any other suggestion?

Thank you.

adbar · 2024-02-15T12:56:15Z

@felipehertzer Can you try adding it to your PR in #509? ends-with(@class, '-article') could work, I don't remember if it's supported by LXML.

felipehertzer · 2024-02-15T22:26:33Z

@adbar I tested the ends-with and LXML seems to do not support it, do you want me to include the contains(@class, "-article")?

adbar · 2024-02-16T15:21:05Z

@felipehertzer Yes, let's try that.

adbar added the bug Something isn't working label Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract more text #488

Extract more text #488

vulinh48936 commented Jan 26, 2024 •

edited

vulinh48936 commented Jan 26, 2024 •

edited

adbar commented Jan 26, 2024

felipehertzer commented Feb 15, 2024

adbar commented Feb 15, 2024

felipehertzer commented Feb 15, 2024

adbar commented Feb 16, 2024

Extract more text #488

Extract more text #488

Comments

vulinh48936 commented Jan 26, 2024 • edited

vulinh48936 commented Jan 26, 2024 • edited

adbar commented Jan 26, 2024

felipehertzer commented Feb 15, 2024

adbar commented Feb 15, 2024

felipehertzer commented Feb 15, 2024

adbar commented Feb 16, 2024

vulinh48936 commented Jan 26, 2024 •

edited

vulinh48936 commented Jan 26, 2024 •

edited