Skip to content

SolanaO/Elements_of_Data_Science

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Welcome!

The Data Scientist Nanodegree program tackles topics such as building machine learning models, running data pipelines, designing experiments and recommendation engines, communicate effectively and deploying data applications.


Write a Data Science Blog Post: Analysis of 2020 Stack Overflow Developers Survey

Project_1: The project consists of chosing a subject of interest, finding and analyzing the data and writing a nontechnical data science blog post. The process follows the CRISP-DM methodology. The 2020 Stack Overflow developers survey data is explored.

In the first part, the salaries between data and other developers are compared using a Z-test for independent means. In the second part a machine learning model based on a Random Forest Classifier is used to predict job satisfaction for data developers. The work is done in a Jupyter Notebook, the code is written in Python 3 using NumPy and Pandas.

Link: Project1


Disaster Response Pipeline and Web App

Project_2: Given a large set of text documents (disaster messages), perform multi-label classification, using supervised machine learning methods. The outcome provides a list of categories each message that is typed in an API belongs to.

A Random Forest Classifier is used as a benchmark model. The final model is based on an Ada Boost Classifier wrapped in a MultiOutput Classifier, tuned via grid search with cross validation. The work is done in Jupyter notebooks, using Python data science libraries Numpy and Pandas, visualizations are created in Matplotlib and Plotly, the text is analyzed with NLTK NLP library. A Flask web app is created and deployed on the Heroku platform.

Link: Project2


Recommentations with IBM

Project_3: We analyze the interactions that users have with articles on the IBM Watson Studio platform, and make recommendations to them about new articles. The following recommenders are built: rank-based, user-user based collaborative filtering, content based and matrix factorization.

The work is done in Jupyter notebooks, using Python data science libraries from Sklearn, visualizations are created in Matplotlib, the text is analyzed with NLTK NLP library.

Link: Project3


User Activity Based Churn Prediction With PySpark on an AWS-EMR Cluster

Project_4: We are investigating and predicting churn for a fictional music platform called Sparkify. This is a binary classification problem, in which the algorithm has to identify which users are most likely to churn. The best performing classifiers are a Multilayer Perceptron and a Gradient Boosted Tree. The results are further improved using a Meta Classifier Linear Regression stacking model.

The code is written on an Anaconda Jupyter Notebook with a Python3 kernel. Additional libraries and modules used are PySpark, Pandas, Numpy, Matplotlib, Seaborn. The full dataset is trained on an AWS-EMR cluster.

Link: Project4