Crawling a whole blog #592
Unanswered
TomLucidor
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Assuming that a WordPress blog (or equivalent) does not have that much protection, what is the fastest way to do an operation similar to HTTRack (GUI full site scraper), Heritrix (InternetArchive's saver), and Grab-Site (ArchiveTeam's saver)? If the blog gets updated regularly, how does one avoid duplicate pages?
Q1: heard of Ferret, Crawlee, and AutoScraper? Seems like people are automating even more now
Q2: What do you think of Playwright and Puppeteer? Could they be used to offload scraping before article extraction with Trafilatura?
Beta Was this translation helpful? Give feedback.
All reactions