Crawling a whole blog #592

TomLucidor · 2024-05-13T02:03:22Z

TomLucidor
May 13, 2024

Assuming that a WordPress blog (or equivalent) does not have that much protection, what is the fastest way to do an operation similar to HTTRack (GUI full site scraper), Heritrix (InternetArchive's saver), and Grab-Site (ArchiveTeam's saver)? If the blog gets updated regularly, how does one avoid duplicate pages?
Q1: heard of Ferret, Crawlee, and AutoScraper? Seems like people are automating even more now
Q2: What do you think of Playwright and Puppeteer? Could they be used to offload scraping before article extraction with Trafilatura?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawling a whole blog #592

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Crawling a whole blog #592

TomLucidor May 13, 2024

Replies: 0 comments

TomLucidor
May 13, 2024