Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

focused_crawl returns nothing #589

Closed
bezir opened this issue May 7, 2024 · 6 comments
Closed

focused_crawl returns nothing #589

bezir opened this issue May 7, 2024 · 6 comments
Labels
feedback Feedback from users requested

Comments

@bezir
Copy link

bezir commented May 7, 2024

Hello,

focused_crawl cannot exploit URLs in certain websites. Is there any parameter or method to overcome this problem?

@adbar adbar added the feedback Feedback from users requested label May 8, 2024
@adbar
Copy link
Owner

adbar commented May 8, 2024

The task is complex and the focused crawler integrated in Trafilatura does not solve all problems. I cannot answer this question in general. Do you have a precise example for me to reproduce?

@bezir
Copy link
Author

bezir commented May 9, 2024

Here is my code with an example input.

Code:

def crawl_homepage(homepage_url, max_iteration, output_file):
    to_visit, known_links = focused_crawler(homepage_url, max_seen_urls=1)

    i = 0
    while i < max_iteration:
        to_visit, known_links = focused_crawler(homepage_url, max_seen_urls=10, max_known_urls=300000, todo=to_visit, known_links=known_links)
        print("LEN", len(known_links))
        save_links_to_file(known_links, output_file + ".json")  # Save links to file after every iteration
        i += 1
    print(f"Finished crawling. Total iterations: {i}")

homepage_url = "https://tr.motorsport.com"

p.s: the page has nothing to do with my task, I share a random url that fails in the extract process.

@adbar
Copy link
Owner

adbar commented May 13, 2024

If you set the logging level to debug you'll see that the download fails (403 error), so there are no links to extract.

@adbar adbar closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2024
@bezir
Copy link
Author

bezir commented May 14, 2024

How to fix it? I want to be able to get the news links in the website.

@adbar
Copy link
Owner

adbar commented May 14, 2024

You have to use a more complex download utility to make sure you get the full content, then you can use Trafilatura on the HTML.

@bezir
Copy link
Author

bezir commented May 14, 2024

Thank you Adrien!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback Feedback from users requested
Projects
None yet
Development

No branches or pull requests

2 participants