GitHub - SolanaO/Elements_of_Data_Science: Projects completed for the Data Scientist Nanodegree with Udacity.

Welcome!

The Data Scientist Nanodegree program tackles topics such as building machine learning models, running data pipelines, designing experiments and recommendation engines, communicate effectively and deploying data applications.

Write a Data Science Blog Post: Analysis of 2020 Stack Overflow Developers Survey

Project_1: The project consists of chosing a subject of interest, finding and analyzing the data and writing a nontechnical data science blog post. The process follows the CRISP-DM methodology. The 2020 Stack Overflow developers survey data is explored.

In the first part, the salaries between data and other developers are compared using a Z-test for independent means. In the second part a machine learning model based on a Random Forest Classifier is used to predict job satisfaction for data developers. The work is done in a Jupyter Notebook, the code is written in Python 3 using NumPy and Pandas.

Link: Project1

Disaster Response Pipeline and Web App

Project_2: Given a large set of text documents (disaster messages), perform multi-label classification, using supervised machine learning methods. The outcome provides a list of categories each message that is typed in an API belongs to.

A Random Forest Classifier is used as a benchmark model. The final model is based on an Ada Boost Classifier wrapped in a MultiOutput Classifier, tuned via grid search with cross validation. The work is done in Jupyter notebooks, using Python data science libraries Numpy and Pandas, visualizations are created in Matplotlib and Plotly, the text is analyzed with NLTK NLP library. A Flask web app is created and deployed on the Heroku platform.

Link: Project2

Recommentations with IBM

Project_3: We analyze the interactions that users have with articles on the IBM Watson Studio platform, and make recommendations to them about new articles. The following recommenders are built: rank-based, user-user based collaborative filtering, content based and matrix factorization.

The work is done in Jupyter notebooks, using Python data science libraries from Sklearn, visualizations are created in Matplotlib, the text is analyzed with NLTK NLP library.

Link: Project3

User Activity Based Churn Prediction With PySpark on an AWS-EMR Cluster

Project_4: We are investigating and predicting churn for a fictional music platform called Sparkify. This is a binary classification problem, in which the algorithm has to identify which users are most likely to churn. The best performing classifiers are a Multilayer Perceptron and a Gradient Boosted Tree. The results are further improved using a Meta Classifier Linear Regression stacking model.

The code is written on an Anaconda Jupyter Notebook with a Python3 kernel. Additional libraries and modules used are PySpark, Pandas, Numpy, Matplotlib, Seaborn. The full dataset is trained on an AWS-EMR cluster.

Link: Project4

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

_config.yml

_config.yml

Repository files navigation

Welcome!

Write a Data Science Blog Post: Analysis of 2020 Stack Overflow Developers Survey

Disaster Response Pipeline and Web App

Recommentations with IBM

User Activity Based Churn Prediction With PySpark on an AWS-EMR Cluster

About

Releases

Packages

License

SolanaO/Elements_of_Data_Science

Folders and files

Latest commit

History

Repository files navigation

Welcome!

Write a Data Science Blog Post: Analysis of 2020 Stack Overflow Developers Survey

Disaster Response Pipeline and Web App

Recommentations with IBM

User Activity Based Churn Prediction With PySpark on an AWS-EMR Cluster

About

Topics

Resources

License

Stars

Watchers

Forks