domain_scraper

Scrapes domains from one input URL or from a file list of domains for broken links, valid emails and valid social media links.

Usage

Check URL's from text file to scrape for emails and social media links. This also checks common paths found from the input domain such as contact and team pages to add to the queue to new URL's to scrape for emails and social media links. This does not store broken links, but does output them to STDOUT during runtime. This does save to a file all valid & unique emails addresses and social media links during runtime so data is stored in the event of an error.
Scrape for emails & social media links, checking for promising new links to scrape

$ ./domain_scraper.py [INPUT FILE] --scrape-n

Same as above, but do not check for new links to add to the queue

$ ./domain_scraper.py [INPUT FILE] --scrape

To check for broken links only from all URLS from the same domain based off of one main input URL

$ ./domain_scraper.py --url [URL TO SCRAPE]

Check URL's for broken links from text file

$ ./domain_scraper.py [INPUT FILE] --check

extract name associations from email list (used with results from scraping)

$ ./domain_scraper.py [INPUT FILE] --extract

Data storage

Data is written to a file during runtime of the email and social media scraper.

Data is written to file at runtime.
Specific errors are not written to the file, but instead printed to STDOUT
Files are stored in path ./file_storage

Example file & file cleanup

how to cleanup a .csv file

$ cat example_file_bad_format.txt
https://google.com/^Mhttps://cecinestpasun.site/^Mhttps://google.com/^Mhttp://www.davidjohncoleman.com/wp-content/uploads/2017/06/headshot-retro.png

# replace ^M character after copying from .csv file
$ tr '\r' '\n' < example_file_bad_format.txt > example_file.txt

# remove repeat links
$ awk '!seen[$0]++' example_file.txt > example_file_no_repeats.txt

$ cat example_file.txt
https://google.com/
https://cecinestpasun.site/
http://www.davidjohncoleman.com/wp-content/uploads/2017/06/headshot-retro.png

Author

David John Coleman II, davidjohncoleman.com | @djohncoleman

Contributors

edikxl, @edikxl
mrvnmchm, @mrvnmchm

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
examples		examples
file_storage		file_storage
modules		modules
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
domain_scraper.py		domain_scraper.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

file_storage

file_storage

modules

modules

tests

tests

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

domain_scraper.py

domain_scraper.py

requirements.txt

requirements.txt

Repository files navigation

domain_scraper

Usage

Data storage

Example file & file cleanup

Author

Contributors

License

About

Releases

Packages

Contributors 4

Languages

License

johncoleman83/domain_scraper

Folders and files

Latest commit

History

Repository files navigation

domain_scraper

Usage

Data storage

Example file & file cleanup

Author

Contributors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages