Skip to content

Kalpana-group-5/NLP_project

Repository files navigation

Spotify Natural Language Processing Project

Kalpana group 5

by: Daniel Ford, Glady Barrios, Kevin Smith

Daniel

Glady

Kevin


[Project Description] [Project Goal] [Project Planning] [Key Findings] [Data Dictionary] [Data Acquire and Prep] [Data Exploration] [Modeling] [Conclusion]


Project Description:

In our project we have collected 100 README files from GitHub and used Machine Learning classification Models to predict the primary programming language when using the search term "Spotify". Our main goal is to identify terms for predicting a readme's primary language on GitHub.

This project involves Data cleaning, Preparation, Exploration and Modeling.

[Back to top]


Project Goal:

The goal of this project is to use natural language processing and classification models to identify terms for predicting a readme's primary language on Github.

[Back to top]


Project Planning:

[Back to top]

Project Outline:

Initial Questions

  • What are the top 5 programming languages when searching for 'Spotify' repos on github?
  • What are the most common words we would see when searching for spotify README's
  • From these top 5 programing languages what are the most common words from these languages
  • What are some common bigrams in the languages when using these bigrams

Need to haves (Deliverables):

  • Here is a link to our Canva Presentaion
    • This is a short 5 minuite presentaion
  • Our final Notebook contaning specific details of the code necessary for our presentaion

Data Dictionary

[Back to top]

Attribute Definition Data Type
Repo The username of the REPO Object
readme_contents What is inside the readme Object
language the programming language Object
lemmatized prepared data Object

Data Acquisition and Preparation

Acquiring the Data

  • To acquire the data first we needed to scrape the links for the individual repos from github by iterating through the pages of search results and grabbing the 10 links from each page.

  • Once this step was completed we were able to populate a dataframe with the name of the repo, the main coding language, and the contents of the README file associated with the repo by utilizing different functions that gathered those pieces of data from the github API.

  • Both process were wrapped in a TQDM progress bar because the acquisition took about 45 minutes altogether, and having a progress bar allowed us to know if the function was still working, or if it had timed out.

Preparing/Wrangling the Data

What we did to prepare the data lowercase the readme contents to avoid case sensitivity

remove any inconsistencies in unicode character encoding

remove special characters, such as non-alphanumeric characters that could cause extra noise

tokenize the data

stemming the data

apply lemmatization

remove unnecessary stopwords

remove where the readme contents were null

generate additional features for exploration and modeling such as README length and word counts

[Back to top]


Data Exploration:

[Back to top]

  • Python files used for exploration:
    • acquire.py
    • prepare.py

Takeaways from exploration:

The top 5 Programing languages are

  • JavaScript
  • Python
  • TypeScript
  • Shell
  • C#

The most commonly used top ten words have between 770 and nearly 3000 specific uses, respectively

Each of the top 5 programing languages have relatively specific bigrams when looking at most used


Modeling:

[Back to top]

Model Preparation:

Models Used:

  • Will run the following Classification models:
    • Logistic Regression Model
    • Random Forest
    • Stochastic Gradient Descent (SGD)

Selecting the Best Model:

Use Table below as a template for all Modeling results for easy comparison:

Model Accuracy (Train) Accuracy (Test) Diffrence
Logistic Regression Model 0.82 0.61 0.21
Stochastic Gradient Descent (SGD) 0.97 0.58 0.39
Random Forest 0.87 0.62 0.25

Model 1: Logistic Regression Model

  • Model 1 accuracy results: - 82% accuracy on Train

Model 2 : Stochastic Gradient Descent (SGD)

  • Model 2 results:
    • 97% accuracy on Train

Model 3 : Random Forest

  • Model 3 results:
    • 87% accuracy on Train
    • 62% accuracy on Test

- Random Forest Performed the Best @ 62% Test Accuracy


Conclusion:

Key Items

  • After testing multiple models, our best performer turned out to be random forest at 62%

  • Through exploration we found that each programming language does have its own unique distribution of words

Recomendations

  • For the time being this model should be implemented until a better model can be implemented at a later date

What's Next?

  • With a many model approach based on each language as a target variable we may be able to create a model with much more accuracy at pinpointing select programming languages based on README text.

  • through the use of many models we may better be able to make models detect more obscure languages along with the more commonly used ones.

About

Predicting the main coding langauge used in repos dealing with Spotify based on NLP analysis of their README

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published