The goal of any job seeker is to land a position. Data science is a very hot area for job prospects. Thus, there are a number of positions open and a number of seekers looking to fill those positions. The aim of this project is to assist job seekers by analyzing Data Scientist job postings on Indeed.com in 6 different cities across the United States. Those cities are:
- Charlotte, NC
- Chicago, IL
- Los Angeles, CA
- New York, NY
- Phoenix, AZ
- San Francisco, CA
Additional cities for later consideration:
- Atlanta, GA
- Austin, TX
- Boston, MA
- Seattle, WA
- Washington, DC
For detailed insights, please see the PPTX files in this repo.
- Enter in the desired locations and a desirable proxy in the
config.py
file in theindeed
sub-folder- For a good, free proxy, refer to https://www.us-proxy.org/
- Run the main scraper to get the job postings:
scrapy crawl indeed_spider
- Results will be placed in the
data
sub-folder asindeed_spider.csv
- Results will be placed in the
- Run the secondary scraper to resolve the original posting URL:
scrapy crawl redirect_spider
- Results will be placed in the
data
sub-folder asredirect_spider.csv
- Results will be placed in the
- Open the three Jupyter Notebooks to analyze the results:
- Job_Description_Word_Cloud.ipynb : Produces a naïve word cloud
- Job_Statistics_Calculations.ipynb : Produces statistical analysis from posting metadata
- Job_Text_Analysis.ipynb : Produces analysis from natural language processing from the job descriptions