Spotify Natural Language Processing Project

Kalpana group 5

by: Daniel Ford, Glady Barrios, Kevin Smith

[Project Description] [Project Goal] [Project Planning] [Key Findings] [Data Dictionary] [Data Acquire and Prep] [Data Exploration] [Modeling] [Conclusion]

Project Description:

In our project we have collected 100 README files from GitHub and used Machine Learning classification Models to predict the primary programming language when using the search term "Spotify". Our main goal is to identify terms for predicting a readme's primary language on GitHub.

This project involves Data cleaning, Preparation, Exploration and Modeling.

[Back to top]

Project Goal:

The goal of this project is to use natural language processing and classification models to identify terms for predicting a readme's primary language on Github.

[Back to top]

Project Planning:

[Back to top]

Project Outline:

Initial Questions

What are the top 5 programming languages when searching for 'Spotify' repos on github?
What are the most common words we would see when searching for spotify README's
From these top 5 programing languages what are the most common words from these languages
What are some common bigrams in the languages when using these bigrams

Need to haves (Deliverables):

Here is a link to our Canva Presentaion
- This is a short 5 minuite presentaion
Our final Notebook contaning specific details of the code necessary for our presentaion

Data Dictionary

[Back to top]

Attribute	Definition	Data Type
Repo	The username of the REPO	Object
readme_contents	What is inside the readme	Object
language	the programming language	Object
lemmatized	prepared data	Object

Data Acquisition and Preparation

Acquiring the Data

To acquire the data first we needed to scrape the links for the individual repos from github by iterating through the pages of search results and grabbing the 10 links from each page.
Once this step was completed we were able to populate a dataframe with the name of the repo, the main coding language, and the contents of the README file associated with the repo by utilizing different functions that gathered those pieces of data from the github API.
Both process were wrapped in a TQDM progress bar because the acquisition took about 45 minutes altogether, and having a progress bar allowed us to know if the function was still working, or if it had timed out.

Preparing/Wrangling the Data

What we did to prepare the data lowercase the readme contents to avoid case sensitivity

remove any inconsistencies in unicode character encoding

remove special characters, such as non-alphanumeric characters that could cause extra noise

tokenize the data

stemming the data

apply lemmatization

remove unnecessary stopwords

remove where the readme contents were null

generate additional features for exploration and modeling such as README length and word counts

[Back to top]

Data Exploration:

[Back to top]

Python files used for exploration:
- acquire.py
- prepare.py

Takeaways from exploration:

The top 5 Programing languages are

JavaScript
Python
TypeScript
Shell
C#

The most commonly used top ten words have between 770 and nearly 3000 specific uses, respectively

Each of the top 5 programing languages have relatively specific bigrams when looking at most used

Modeling:

[Back to top]

Model Preparation:

Models Used:

Will run the following Classification models:
- Logistic Regression Model
- Random Forest
- Stochastic Gradient Descent (SGD)

Selecting the Best Model:

Use Table below as a template for all Modeling results for easy comparison:

Model	Accuracy (Train)	Accuracy (Test)	Diffrence
Logistic Regression Model	0.82	0.61	0.21
Stochastic Gradient Descent (SGD)	0.97	0.58	0.39
Random Forest	0.87	0.62	0.25

Model 1: Logistic Regression Model

Model 1 accuracy results: - 82% accuracy on Train

Model 2 : Stochastic Gradient Descent (SGD)

Model 2 results:
- 97% accuracy on Train

Model 3 : Random Forest

Model 3 results:
- 87% accuracy on Train
- 62% accuracy on Test

- Random Forest Performed the Best @ 62% Test Accuracy

Conclusion:

Key Items

After testing multiple models, our best performer turned out to be random forest at 62%
Through exploration we found that each programming language does have its own unique distribution of words

Recomendations

For the time being this model should be implemented until a better model can be implemented at a later date

What's Next?

With a many model approach based on each language as a target variable we may be able to create a model with much more accuracy at pinpointing select programming languages based on README text.
through the use of many models we may better be able to make models detect more obscure languages along with the more commonly used ones.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
Glady's_Notebooks		Glady's_Notebooks
.gitignore		.gitignore
Final_Notebook.ipynb		Final_Notebook.ipynb
Glady's_explore_modeling.ipynb		Glady's_explore_modeling.ipynb
README.md		README.md
acquire.py		acquire.py
dans_prepare.py		dans_prepare.py
dans_scratchpad.ipynb		dans_scratchpad.ipynb
dans_workbook.ipynb		dans_workbook.ipynb
dirty_spotify		dirty_spotify
gitignore		gitignore
kevins-notebook.ipynb		kevins-notebook.ipynb
prepare.py		prepare.py

Kalpana-group-5/NLP_project

Folders and files

Latest commit

History

Repository files navigation

Spotify Natural Language Processing Project

Kalpana group 5

Project Description:

Project Goal:

Project Planning:

Project Outline:

Initial Questions

Need to haves (Deliverables):

Data Dictionary

Data Acquisition and Preparation

Acquiring the Data

Preparing/Wrangling the Data

Data Exploration:

Takeaways from exploration:

Modeling:

Model Preparation:

Models Used:

Selecting the Best Model:

Use Table below as a template for all Modeling results for easy comparison:

Model 1: Logistic Regression Model

Model 2 : Stochastic Gradient Descent (SGD)

Model 3 : Random Forest

- Random Forest Performed the Best @ 62% Test Accuracy

Conclusion:

Key Items

Recomendations

What's Next?

About

Topics

Resources

Stars

Watchers

Forks

Languages