Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cascade of different content extractors explicit and configurable #538

Open
adbar opened this issue Apr 3, 2024 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@adbar
Copy link
Owner

adbar commented Apr 3, 2024

So far Trafilatura is entwined with a version of readability-lxml, it also uses jusText as fallback before triggering the baseline extraction as last resort. This combination is robust and performs well in the benchmark, however it can be beneficial to refactor the code so as to expose the extractor chain.

The current configuration can be written as follows:

  • fast mode: ["trafilatura", "baseline"]
  • normal mode: ["trafilatura+readability", "justext", "baseline"]
@adbar adbar added the enhancement New feature or request label Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant