A simple web scraper script

Usage:

./webscraper.sh

The webscraper.sh script takes as input a text file (./urls.txt) with one URL per line such as,

https://some/web/page.html
https://another/web/page.html
https://yet/another/web/page.html

and saves the contents of each web URL to a separate text file in an output directory (./outdir). The output directory will be created if it does not already exist.

Saved files are named using the base name of the URL, i.e., everything after the last /, and with a .txt extension added. So, for example, https://some/web/page.html will be saved as page.html.txt.

WARNING: This means that URLS with the same base name, such as page.html in all the examples above, will end up being saved to the same file name, with one overwriting the other! In future versions, one may want to provide the URLs file with a second column that specifies the output file name per URL.

WARNING: Windows users, please ensure that the URLs file you provide is saved with Unix line endings (see this if you don't know what that means and how to resolve that using dos2unix).

This Bash script relies on the following utilities being available: wget, html2text.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
webscraper.sh		webscraper.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

webscraper.sh

webscraper.sh

Repository files navigation

A simple web scraper script

About

Releases

Packages

Languages

License

agentsoz/webscraper

Folders and files

Latest commit

History

Repository files navigation

A simple web scraper script

About

Resources

License

Stars

Watchers

Forks

Languages