Skip to content
This repository has been archived by the owner on Dec 21, 2023. It is now read-only.

agentsoz/webscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

A simple web scraper script

Usage:

./webscraper.sh

The webscraper.sh script takes as input a text file (./urls.txt) with one URL per line such as,

https://some/web/page.html
https://another/web/page.html
https://yet/another/web/page.html

and saves the contents of each web URL to a separate text file in an output directory (./outdir). The output directory will be created if it does not already exist.

Saved files are named using the base name of the URL, i.e., everything after the last /, and with a .txt extension added. So, for example, https://some/web/page.html will be saved as page.html.txt.

WARNING: This means that URLS with the same base name, such as page.html in all the examples above, will end up being saved to the same file name, with one overwriting the other! In future versions, one may want to provide the URLs file with a second column that specifies the output file name per URL.

WARNING: Windows users, please ensure that the URLs file you provide is saved with Unix line endings (see this if you don't know what that means and how to resolve that using dos2unix).

This Bash script relies on the following utilities being available: wget, html2text.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages