Skip to content
/ Scrappy Public

Script to scrap URLs from a webpage and archive them on the Wayback machine.

Notifications You must be signed in to change notification settings

4rnv/Scrappy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

A Python script to scrape URLs from a webpage and archive them to the Wayback machine. Uses Beautiful Soup to parse a page for anchor tags and then saves them using the Archive.org API. The name is supposed to be scrap.py as in scrapping plus python.

Usage

Clone or ZIP this repo. Install the modules mentioned in requirements.txt using pip install -r requirements.txt. Then run the script in your terminal and follow the screen instructions.

IMPORTANT: time.sleep(5) delays archival of each URL for 5 seconds. This is to avoid overloading the API with excess requests, due to which sometimes the server refuses the connection. A healthy gap between each request prevents that.

About

Script to scrap URLs from a webpage and archive them on the Wayback machine.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages