Skip to content

ashaheedq/vgchartzScrape

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vgchartzfull is a python script with multiprocessing based on BeautifulSoup. proxies are implemented in the script, it can be disabled by changing it to False

It creates a dataset based on data from http://www.vgchartz.com/gamedb/

The dataset is saved as vgsales-%Y-%m-%d_%H_%M_%S.csv.

You will need to have the following dependencies installed:

BeautifulSoup4 
pandas
numpy
requests
unidecode
user_agent

Thanks to:

Free proxies: 1 2 3 4

  • added multiprocessing for faster results with a maximum of 24 workers.
  • added proxies to avoid being blocked
  • handling couple of exceptions
  • scraped data gets saved before raising an unexpected error
  • add the option to continue where we left off due to an unexpected error
  • clean version removes the print statements, should results in better performance!
  • optimize it
  • create a log file
  • convert the script to a class or use scrapy, reference