focused_crawl returns nothing #589

bezir · 2024-05-07T13:45:20Z

Hello,

focused_crawl cannot exploit URLs in certain websites. Is there any parameter or method to overcome this problem?

adbar · 2024-05-08T08:30:39Z

The task is complex and the focused crawler integrated in Trafilatura does not solve all problems. I cannot answer this question in general. Do you have a precise example for me to reproduce?

bezir · 2024-05-09T10:50:40Z

Here is my code with an example input.

Code:

def crawl_homepage(homepage_url, max_iteration, output_file):
    to_visit, known_links = focused_crawler(homepage_url, max_seen_urls=1)

    i = 0
    while i < max_iteration:
        to_visit, known_links = focused_crawler(homepage_url, max_seen_urls=10, max_known_urls=300000, todo=to_visit, known_links=known_links)
        print("LEN", len(known_links))
        save_links_to_file(known_links, output_file + ".json")  # Save links to file after every iteration
        i += 1
    print(f"Finished crawling. Total iterations: {i}")

homepage_url = "https://tr.motorsport.com"

p.s: the page has nothing to do with my task, I share a random url that fails in the extract process.

adbar · 2024-05-13T10:20:33Z

If you set the logging level to debug you'll see that the download fails (403 error), so there are no links to extract.

bezir · 2024-05-14T14:50:27Z

How to fix it? I want to be able to get the news links in the website.

adbar · 2024-05-14T16:16:31Z

You have to use a more complex download utility to make sure you get the full content, then you can use Trafilatura on the HTML.

bezir · 2024-05-14T21:34:46Z

Thank you Adrien!

adbar added the feedback Feedback from users requested label May 8, 2024

adbar closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

focused_crawl returns nothing #589

focused_crawl returns nothing #589

bezir commented May 7, 2024

adbar commented May 8, 2024

bezir commented May 9, 2024

adbar commented May 13, 2024

bezir commented May 14, 2024

adbar commented May 14, 2024

bezir commented May 14, 2024

focused_crawl returns nothing #589

focused_crawl returns nothing #589

Comments

bezir commented May 7, 2024

adbar commented May 8, 2024

bezir commented May 9, 2024

adbar commented May 13, 2024

bezir commented May 14, 2024

adbar commented May 14, 2024

bezir commented May 14, 2024