Job & Resume Salary Predictor

How much am I worth?

The hardest part of job hunting is when the recruiter turns to you and says
“So, what kind of salary are you looking for?”

Upon hearing this question, my heart dropped and I started to flip through the storybook of my life. I realized that when put on the spot, I only think of the negative parts of my life instead of all the great things that I have done.

I know that we are all affected by our prior experiences, sometimes with more emphasis on the mistakes than successes. With this in mind, how can I objectively measure my experiences and skills? In other words- can I build a tool that will accurately quantify my achievements, skills and experience in a way that translates to a dollar amount?

This was a great idea! A model that can take in either a resume or a job posting as an input and the output would be a salary or salary range- all without me having to worry about the subjective nature of my experiences.

Original

Data Aqcuisition

I decided to use Google Jobs as my source of jobs. This is because they aggregate their jobs from a bunch of other job platforms. Additionally, they post salary estimates from ‘Glassdoor”, “Built in NYC”, and “PayScale”.

I downloaded a chrome driver to be used with Selenium. You can use this blog post by Atindra Bandito learn more about Selenium and how it can be used.

After inspecting the Google job platform, I was able to identify which data I would pull from the job posting. I wanted the:

Title
Company name
Job posting
Location
Estimated salary (From Glassdoor or the others)

I started by creating a function to take an input of a search term and Selenium would open up chrome and open a Google job search using the search term given. See the code here

Now that we have the proper job data, we can continue to pull more jobs from different search words. I created a function that can pull up to 150 jobs at a time but it will require a little bit of manual scrolling and input.

After running the function, you will be asked to enter a search word (no quotes required). Once the page is opened, scroll on the list of jobs until the page stops refreshing. Once you are done, go back to the script, and type a ‘y’ to proceed. The script will go through each job and pull the necessary data and it will return a DataFrame.

After pulling over 6,000 jobs, it was time to go through the data and clean it up. Our data had 2 issues: 1. Only about 3,000 columns had estimated salary data 2. The estimated salary data was a range of values (EX: $75,000 — $120,000)

Data Cleaning

In order to fix this, I dropped all rows that did not have a salary, and I created a function to go through each cell in the salary column and find the average salary for the range (EX: $75,000 - $120,000 = $97,500). Once it got all of the averages, I created a new column in the DataFrame to store all of these new numbers.

In order to run text through a statistic model, we would have to turn our text into numbers. The cleaning technique that I used for the text consisted of tokenizing, lemmatizing, removing stop words, and vectorizing. I decided to use SkLearn’s TFIDF-Vectorizer to turn the text into numbers based on the frequency of the word in the document, and the entire corpus (all of our documents).

The hyper-parameters that I decided to use for the vectorizer was:

- n_grams = (1,3)
- max_df = .85
- min_df = .15 
- binary = True

Statistical Modeling

Now that the data was ready to be put through a stats model, I performed a train_test_split on it and initially ran it through a LinearRegression. This was to get a “base score” that I can use to see the accuracy of my other models, and for inference. I winded up using Ridge, Lasso, Random Forest, Gradient Boost, and even a Neural Network.

For all of these models, my metric was Root Mean Square Error.

Glassdoor’s normal range ± $32,214

Linear Regression ± $30,595

Lasso ± $30,458

Ridge ± $30,560

Random Forest ± $19,290

Gradient Boost ± $18,379

Neural Network ± $28,736

The model predicts salaries with a lower error that Glassdoor!

Feature Engineered (Poly)

My next step was to do some feature engineering and feature selection.

I made a new notebook to keep this separated from my original work

Data Cleaning

In addition to what was done in the data acquisition process before, I used Sklearn's Polynomial Features. This multiplies each column by every other column including itself. This helps becuase words can have a different meaning when they are used with other words.

Becuase the words were being multiplied, I realized that the vectorizer would need to be tweeked and the Binary hyper-paramter would need to be set to False

After performing this, we ended up with over $39,000 features ($198^2$). This is a lot of features to throw into a model so I turned to Sklearn's Principal Component Analysis (PCA) function. This technique allows you to select the features that are important while still keeping 95% of variance. After running our 39,000 features through a PCA, we came out with 964 features.

Statistical Modeling

Like before, now that the data was ready to be put through a stats model, I performed a train_test_split on it and ran it through all of 6 of the models.

Glassdoor’s normal range ± $ 32,214

Linear Regression ± $ 25,390

Lasso ± $27,704

Ridge ± $24,862

Random Forest ± $18,057

Gradient Boost ± $18,033

Neural Network ± $15,194

Flask App

I worked on a Flask App that allows a user to get their estimated salary in 3 easy steps.

Either input text or drop in a resume/job posting (.txt, .pdf, or .docx).
Choose a model that you would like be used on your text.
Press SUBMIT

The app takes the text, prepares the text for the model, runs it through the desired model, and outputs an estimated salary and a range ( Salary ± RMSE ).

I then decided to place my resume inside of the model to see what it predicted.

	Linear Regression	Lasso	Ridge	Random Forest	Gradient Boost	Neural Net	Linear Regression Poly	Lasso Poly	Ridge Poly	Random Forest Poly	Gradient Boost Poly	Neural Net Poly
margin	29,566	30,009	29,512	19,087	19,043	27,505	25390	27,703	24,861	18,056	18,032	15,193
worth	84,534	82,372	78,737	95,026	97,914	93,068	87,647	76,896	82,104	116,140	115,486	87,074

This is a pretty accurate estimation for entry level jobs as a Data Scientist / Data Analyst in New York City.

You can see a video of the flask app being used here

Conclusion

As you can see, the Root Mean Squared Error for the models are over 18% better scores than our original work and over 33% better than Glassdoor's!

I am happy with how this project turned out and I see a lot of room to grow this project.

Some of my ideas are:

Running the model on other types of jobs (Not just tech)
Having the flask app output suggestions such as:
- Suggested job title based on your resume.
- Suggested words to use in your resume.
- Suggested job postings that best fit your resume.
- Suggested skills to acquire to make yourself more valuable.

Link to my presentation (Google Slides)

Link to presentation pdf (No animarions :( )

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.ipynb_checkpoints		.ipynb_checkpoints
flask_app		flask_app
get_data		get_data
run_models		run_models
Capstone Presentation.pdf		Capstone Presentation.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

flask_app

flask_app

get_data

get_data

run_models

run_models

Capstone Presentation.pdf

Capstone Presentation.pdf

README.md

README.md

Repository files navigation

Job & Resume Salary Predictor

Table of Contents

How much am I worth?

The hardest part of job hunting is when the recruiter turns to you and says
“So, what kind of salary are you looking for?”

Original

Data Aqcuisition

Data Cleaning

Statistical Modeling

Feature Engineered (Poly)

Data Cleaning

Statistical Modeling

Flask App

You can see a video of the flask app being used here

Conclusion

About

Releases

Packages

Contributors 2

Languages

jbibi1296/Resume-Salary-Predictor

Folders and files

Latest commit

History

Repository files navigation

Job & Resume Salary Predictor

Table of Contents

How much am I worth?

The hardest part of job hunting is when the recruiter turns to you and says “So, what kind of salary are you looking for?”

You can see a video of the flask app being used here

Conclusion

About

Resources

Stars

Watchers

Forks

Languages

The hardest part of job hunting is when the recruiter turns to you and says
“So, what kind of salary are you looking for?”