- Scrape the Data
- Original
- Poly
- Flask App
- Extras
The hardest part of job hunting is when the recruiter turns to you and says
“So, what kind of salary are you looking for?”
Upon hearing this question, my heart dropped and I started to flip through the storybook of my life. I realized that when put on the spot, I only think of the negative parts of my life instead of all the great things that I have done.
I know that we are all affected by our prior experiences, sometimes with more emphasis on the mistakes than successes. With this in mind, how can I objectively measure my experiences and skills? In other words- can I build a tool that will accurately quantify my achievements, skills and experience in a way that translates to a dollar amount?
This was a great idea! A model that can take in either a resume or a job posting as an input and the output would be a salary or salary range- all without me having to worry about the subjective nature of my experiences.
I decided to use Google Jobs as my source of jobs. This is because they aggregate their jobs from a bunch of other job platforms. Additionally, they post salary estimates from ‘Glassdoor”, “Built in NYC”, and “PayScale”.
I downloaded a chrome driver to be used with Selenium. You can use this blog post by Atindra Bandito learn more about Selenium and how it can be used.
After inspecting the Google job platform, I was able to identify which data I would pull from the job posting. I wanted the:
- Title
- Company name
- Job posting
- Location
- Estimated salary (From Glassdoor or the others)
I started by creating a function to take an input of a search term and Selenium would open up chrome and open a Google job search using the search term given. See the code here
Now that we have the proper job data, we can continue to pull more jobs from different search words. I created a function that can pull up to 150 jobs at a time but it will require a little bit of manual scrolling and input.
After running the function, you will be asked to enter a search word (no quotes required). Once the page is opened, scroll on the list of jobs until the page stops refreshing. Once you are done, go back to the script, and type a ‘y’ to proceed. The script will go through each job and pull the necessary data and it will return a DataFrame.
After pulling over 6,000 jobs, it was time to go through the data and clean it up. Our data had 2 issues: 1. Only about 3,000 columns had estimated salary data 2. The estimated salary data was a range of values (EX: $75,000 — $120,000)
In order to fix this, I dropped all rows that did not have a salary, and I created a function to go through each cell in the salary column and find the average salary for the range (EX: $75,000 - $120,000 = $97,500). Once it got all of the averages, I created a new column in the DataFrame to store all of these new numbers.
In order to run text through a statistic model, we would have to turn our text into numbers. The cleaning technique that I used for the text consisted of tokenizing, lemmatizing, removing stop words, and vectorizing. I decided to use SkLearn’s TFIDF-Vectorizer to turn the text into numbers based on the frequency of the word in the document, and the entire corpus (all of our documents).
The hyper-parameters that I decided to use for the vectorizer was:
- n_grams = (1,3)
- max_df = .85
- min_df = .15
- binary = True
Now that the data was ready to be put through a stats model, I performed a train_test_split on it and initially ran it through a LinearRegression. This was to get a “base score” that I can use to see the accuracy of my other models, and for inference. I winded up using Ridge, Lasso, Random Forest, Gradient Boost, and even a Neural Network.
For all of these models, my metric was Root Mean Square Error
.
Glassdoor’s normal range
± $32,214
Linear Regression
± $30,595
Lasso
± $30,458
Ridge
± $30,560
Random Forest
± $19,290
Gradient Boost
± $18,379
Neural Network
± $28,736
The model predicts salaries with a lower error that Glassdoor!
My next step was to do some feature engineering and feature selection.
I made a new notebook to keep this separated from my original work
In addition to what was done in the data acquisition process before, I used Sklearn's Polynomial Features. This multiplies each column by every other column including itself. This helps becuase words can have a different meaning when they are used with other words.
Becuase the words were being multiplied, I realized that the vectorizer would need to be tweeked and the Binary
hyper-paramter would need to be set to False
After performing this, we ended up with over $39,000 features (
Like before, now that the data was ready to be put through a stats model, I performed a train_test_split on it and ran it through all of 6 of the models.
Glassdoor’s normal range
± $ 32,214
Linear Regression
± $ 25,390
Lasso
± $27,704
Ridge
± $24,862
Random Forest
± $18,057
Gradient Boost
± $18,033
Neural Network
± $15,194
I worked on a Flask App that allows a user to get their estimated salary in 3 easy steps.
- Either input text or drop in a resume/job posting (
.txt, .pdf, or .docx
). - Choose a model that you would like be used on your text.
- Press SUBMIT
The app takes the text, prepares the text for the model, runs it through the desired model, and outputs an estimated salary and a range ( Salary ± RMSE ).
I then decided to place my resume inside of the model to see what it predicted.
Linear Regression | Lasso | Ridge | Random Forest | Gradient Boost | Neural Net | Linear Regression Poly | Lasso Poly | Ridge Poly | Random Forest Poly | Gradient Boost Poly | Neural Net Poly | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
margin | 29,566 | 30,009 | 29,512 | 19,087 | 19,043 | 27,505 | 25390 | 27,703 | 24,861 | 18,056 | 18,032 | 15,193 |
worth | 84,534 | 82,372 | 78,737 | 95,026 | 97,914 | 93,068 | 87,647 | 76,896 | 82,104 | 116,140 | 115,486 | 87,074 |
You can see a video of the flask app being used here
As you can see, the Root Mean Squared Error
for the models are over 18% better scores than our original work and over 33% better than Glassdoor's!
I am happy with how this project turned out and I see a lot of room to grow this project.
Some of my ideas are:
- Running the model on other types of jobs (Not just tech)
- Having the flask app output suggestions such as:
- Suggested job title based on your resume.
- Suggested words to use in your resume.
- Suggested job postings that best fit your resume.
- Suggested skills to acquire to make yourself more valuable.
Link to my presentation (Google Slides)
Link to presentation pdf (No animarions :( )