You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Trafilatura seems to pick up the cookie consent text instead of the article text. For reference, goose3 does not have this issue. I'd appreciate any input on what's causing this.
Thanks
The text was updated successfully, but these errors were encountered:
Hi @praveng, thanks for your feedback, there is indeed a problem here. An aside element is removed during cleaning although it contains the main text, which thus cannot be retrieved afterwards.
I am not sure why this happens, maybe a parsing or cleaning issue. I started drafting a workaround in #571.
The failure is consistent across all articles on said sites; examples:
https://www.kiss1023.ca/2024/04/17/beyonce-is-bringing-her-fans-of-colour-to-country-music-will-they-be-welcomed-in/
https://www.kissottawa.com/2024/04/22/harmony-in-action-pop-stars-making-a-difference-for-the-environment/
https://www.country600.com/2024/04/22/embrace-the-earth-10-meaningful-activities-to-celebrate-earth-day/
https://www.chymfm.com/2024/04/22/fans-want-kelly-clarkson-to-be-a-disney-princess-voice-after-her-latest-kellyoke-pick/
And dozens more related sites.
Trafilatura seems to pick up the cookie consent text instead of the article text. For reference, goose3 does not have this issue. I'd appreciate any input on what's causing this.
Thanks
The text was updated successfully, but these errors were encountered: