New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
focused_crawl returns nothing #589
Comments
The task is complex and the focused crawler integrated in Trafilatura does not solve all problems. I cannot answer this question in general. Do you have a precise example for me to reproduce? |
Here is my code with an example input. Code:
p.s: the page has nothing to do with my task, I share a random url that fails in the extract process. |
If you set the logging level to debug you'll see that the download fails (403 error), so there are no links to extract. |
How to fix it? I want to be able to get the news links in the website. |
You have to use a more complex download utility to make sure you get the full content, then you can use Trafilatura on the HTML. |
Thank you Adrien! |
Hello,
focused_crawl cannot exploit URLs in certain websites. Is there any parameter or method to overcome this problem?
The text was updated successfully, but these errors were encountered: