GitHub - ashaheedq/vgchartzScrape: a web scraping project for data capture of vgchartz

vgchartzfull is a python script with multiprocessing based on BeautifulSoup. proxies are implemented in the script, it can be disabled by changing it to False

It creates a dataset based on data from http://www.vgchartz.com/gamedb/

The dataset is saved as vgsales-%Y-%m-%d_%H_%M_%S.csv.

You will need to have the following dependencies installed:

BeautifulSoup4 
pandas
numpy
requests
unidecode
user_agent

Thanks to:

Free proxies: 1 2 3 4

added multiprocessing for faster results with a maximum of 24 workers.
added proxies to avoid being blocked
handling couple of exceptions
scraped data gets saved before raising an unexpected error
add the option to continue where we left off due to an unexpected error
clean version removes the print statements, should results in better performance!
optimize it
create a log file
convert the script to a class or use scrapy, reference
- https://edmundmartin.com/multi-threaded-crawler-in-python/

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.gitignore		.gitignore
README.md		README.md
clean-vgchartzfull.py		clean-vgchartzfull.py
proxies_gen.py		proxies_gen.py
vgchartzfull.py		vgchartzfull.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

clean-vgchartzfull.py

clean-vgchartzfull.py

proxies_gen.py

proxies_gen.py

vgchartzfull.py

vgchartzfull.py

Repository files navigation

About

Releases

Packages

Languages

ashaheedq/vgchartzScrape

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Languages