Skip to content

hyunjoonbok/Python-Projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Portfolio

Portfolio of data science projects from either original work or revised for a learning purpose.

Portfolio in this repo is presented in the form of Jupyter Notebooks or .py files.

For a detailed code examples and breif explanation for each, please read through readme.md file below.

Note: Data used in the projects is for demo purposes only


Motivation

This repository was originally to have a record of personal project progress and my own learning process, but I found that potential data professionals would beneift from the collection of materials, as it contains a numerious real-life data science example and notebooks created by @hyunjoonbok.

As Python is nowadays a go-to for Data-Science. I have managed to use the best out of Python to use its full functionality for not only simple EDA, but building a complex ML/DL models.

Below examples include the intense usage of industry-hot frameworks (i.e. Pytorch, Fast.ai, H2O, Grandient Boosting, etc) to produce ready-to-use results.


Table of contents


Projects

In this notebbok, we measure the customer LTV (Lifetime Value) for any custom timeframe we want. We form an example using a real-world marketing dataset provided by Kaggle. We learn the concept of LTV, and how to preprocess and visualize the data to get the basic findings. And finally build a model that predicting 3-Month CLV for each customer group features.

Cohort Analysis

Cohort Analysis - Customer Retention (1)
Cohort Analysis - Customer Retention (2)
Cohort Analysis - Customer Segmentation (1)
Cohort Analysis - Customer Segmentation (2)

Suppose that we have a company that selling some of the product, and you want to know how well does the selling performance of the product. We have the data that can we analyze, but what kind of analysis that we can do? We can segment customers based on their buying behavior on the market. These notebook introduces several ways to segment the users and better understand their retention, using the K-Means algorithm in Python. Using a industry marketing data, We create cohorts to understand metrics like customer retention rates, the average quantity purchased, the average price, etc. The notebook covers a full analysis (preprocessing + visualization + interpretation) to do customer segmentation step-by-step.

In businses marketing scenarios, it is quite impoerntat to track the conversion from the user base through the advertisement money spent. But it is more valuable to know **how** each of those conversion are made, so that the further actions are taken in on-going basis (i.e. budget adjustment). The notebook introduces a concept of 'Markov Chains' to understand the attributions in marketing. By using these transition probabilities, we can identify the statistical impact a single channel has on our total conversions.

The notebook has the implementation of Basket anayiss in real-world data using Python. It goes over the concept, along with key terms and metrics aimed at giving a sense of what “association” in a rule means and some ways to quantify the strength of this association. The entire data mining process (preprocessing + visualization + interpretation) are clearly explained.

The purpose of the notebook is to get the very basic understading of the methods behind the common NLP project. It introduces 3 different approches that could be taken when performing a text-mining, from it's concept and actual implementation of codes: 1) Frequency of appearance of two words, 2) Statistical method of extracting connection, 3) Word2vec (DL). It also narrates 2 prerequisite steps to be taken before performing text-mining: 1) Select the target word 2) Choose the context: choose what is the sentence about

Machine Learning / Deep Learning with H2O

Complete guide to Machine Learning with H2O (AutoML)
Machine Learning Regression problem with H2O (XGBoost & Deeplearning)

In this notebook, we will use the subset of the Freddie Mac Single-Family dataset to try to predict the interest rate for a loan using H2O's XGBoost and Deep Learning models. We will explore how to use these models for a regression problem, and we will also demonstrate how to use H2O's grid search to tune the hyper-parameters of both models. We're going to use machine learning with H2O-3 to predict the interest rate for each loan. To do this, we will build two regression models: an XGBoost model and a Deep Learning model that will help us find the interest rate that a loan should be assigned. Complete this tutorial to see how we achieved those results. Also, we go over H2O's AutoML solution, which is an automated algorithm for automating the machine learning workflow, which includes some light data preparation such as imputing missing data, standardization of numeric features, and one-hot encoding categorical features. It also provides automatic training, hyper-parameter optimization, model search, and selection under time, space, and resource constraints. H2O's AutoML further optimizes model performance by stacking an ensemble of models. H2O AutoML trains one stacked ensemble based on all previously trained models and another one on the best model of each family.

Explainable Machine Learning with SHAP

Understand Classification Model with SHAP
Understand Regression Model with SHAP
SHAP Decision Plots in Depth

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. Here, we look at the implementation of Tree SHAP, a fast and exact algorithm to compute SHAP values for trees and ensembles of trees. We have 3 different basic examples (regression / classifcation / more in-depth graphics) that can be applied to visualizaing the model.

In this notebook, we utilize Apache Spark's machine learning library (MLlib) with PySpark to tackle NLP problem and how to simulate Doc2Vec inside Spark envioronment. Apache Spark is a famous distributed competiting system to to scale up any data processing solutions. Spark also provides a Machine-learning powered library called 'MLlib'. We utilize Spark Machine Learning Library (Spark MLlib) to look at 3297 labeled sentences, and classify them into 5 different categories.

Apache Spark is quickly gaining steam both in the headlines and real-world adoption, mainly because of its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for us to be able to stream and analyze it in real time. We use Spark Machine Learning Library (Spark MLlib) to classify crime rescription into 33 categories.

Mercari (Japan’s biggest shopping app) would like to offer pricing suggestions to sellers, but this is not easy because their sellers are enabled to put just about anything, or any bundle of things, on Mercari’s marketplace. In this machine learning project, we will build a model that automatically suggests the right product prices. Here we build a complete price recommendation model leveraging LightGBM.

We utilize Pytoch's embeddings layers to build a simple recommendation system. Our model will predict user ratings for specific movies.

We build a complete ML model (Binary Classification with Imbalanced Classes problem) leveraging Spark's computation. Full cycle of ML (EDA, feature engineering, model building) is covered. In-Memory computation and Parallel-Processing are some of the major reasons that Apache Spark has become very popular in the big data industry to deal with data products at large scale and perform faster analysis

EDA-focused regression model building to predict a ship's Crew Size. CSV Dataset included in a same folder.

This notebook shows 3 different approches that could be taken when performing a text-mining, from it's concept and actual implementation of codes. Text mining is an approach to find a relationship between two words in a given sentence. It could be found by using: 1) Frequency of appearance of two words 2) Statistical method of extracting connection 3) Word2vec (DL)

We build a text classifcation on Reuters News (available through sklearn) based on LSTM method using Tensorflow

Simple MLP model in Tensorflow to solve the text classification problem. Here, we will use the texts_to_matrx() function in Keras to perform text-classification.

Recommedation System - Collaborative Filtering

FastAI Implementation
Surprise Library Implementation

Experiment with the MovieLens 100K Data to provide movie recommendations for users based on different settings (Item-based, user-based, etc)

An introduction of TabNet, which is a neural-net based algorithm to be readily used in Tabular dataset Machine Learning problems (most common in Kaggle Competitions). A Pytorch Implementation with a Toy example (adult census income dataset and forest cover type dataset) are shown in this notebook, along with a basic architecture and workflow.

An image segmentation task with Oxford-IIIT Pet Dataset to build a model that genenarte masks around the pet images and eventaully segment the image itself. Built using MobileNetV2 pretrained on ImageNet.

CNN model using Tensorflow that recognizes Rock-Paper-Scissors. Built using MobileNetV2 pretrained on ImageNet.

Pytoch implementation of N-shot learning. We look at image classification of word image in many different languages (Omniglot Dataset) to and build the model that determines which of the evaluatiion set classes the sample belongs to.

Concept and codes for the fast unsupervised anomaly detection with generative adversarial networks (GAN), which is widely used for real-time anomaly detection applications. Uses "DCGAN" model, which is State-of-the-Art GAN model.

Transfer learning explained. Modify a few last layers to fit-in to my own dataset.

A simple walkthrough of training loops and metrics used in learning in Pytorch, follow by a complete example in the last using CIFAR-10 dataset.

A complete guide to recommendation system using Collaborative Filtering: Matrix Factorization. Concepts that are used in industry are explained, and compare model/metrics and build prediction algorithm.

Style transfer in practice using Pytorch using pretrained VGG19 model.

Going through a complete modeling step in Pytorch based on MNIST dataset. Can grasp a general idea of Pytorch concept.

How to use Tensorboard in Jupyter notebook when training a model in Pytorch.

3-way polarity (positive, neutral, negative) sentiment analysis system for Google-Play App reviews. Use Pytorch to get review in JSON, data-preprocess, Create pytorch dataloader , train/evaluate the model. Evaluate the errors and testing on the raw text data in the end.

Buiding a Fraud Detection model using a sample Credit Card transaction data from Kaggle. The data is highly imbalanced, so it shows how to adjust sampling to solve the problem. Then we check important metrics needed to be evalulated (fp/tp/precision/recall, etc)

Reference: [Kaggle CreditCard data](https://www.kaggle.com/mlg-ulb/creditcardfraud/)

Pytorch version of builing a CNN model to classify a image of a langauge. Complete model building from loading/defining/transforming data to create and train model. From [Bengali.AI Handwritten Grapheme Classification](https://www.kaggle.com/c/bengaliai-cv19) in Kaggle.

From Walmart sales data, forecast daily sales for the next 28 days. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. Pre-process (Feature Enginenering / Hyperparameter Optimization) given data and used LGB/XGB ensemble to generate a final submission. From [M5 Forecasting - Accuracy](https://www.kaggle.com/c/m5-forecasting-accuracy/overview) in Kaggle.

To forecast the outcomes of March-Madness during rest of 2020's NCAAW games. Covers all team-by-team season games results data. Pre-processing of tabular data and ensemble of LGB/XGB generates a final submission. From [Google Cloud & NCAA® ML Competition 2020-NCAAW](https://www.kaggle.com/c/google-cloud-ncaa-march-madness-2020-division-1-womens-tournament/overview) in Kaggle. *Update: this competition was cancelled in Mar.2020 due to the COVID-19.*

2-way polarity (positive, negative) classification system for tweets. Using Fast.ai framework to fine-tune a language model and build a classification model with close to 80% accuracy.


- ## Machine Learning Library / Tools: Keras, Tensorflow, fast.ai, pandas, numpy, xgboost, lightgbm, scikit-learn, optuna, Seaborn, Matplotlib

Finding a customer who's income level. Simple ML Classification problem tackled with Fast.ai API. Executable to almost all types of tabular data to naively achieve a good baseline model in a few lines of code. Also, collaborative filtering is when you're tasked to predict how much a user is going to like a certain item. Here I looked at "MovieLens" dataset to predict the rating a user would give a particular movie (from 0 to 5)

### [(Kaggle) Handwritten_Image_Classification (Grapheme language)](https://github.com/hyunjoonbok/Python-Projects/blob/master/Fast.ai/%5BKaggle%5D%20(Fast.ai)%20Handwritten_Image_Classification%20(Grapheme%20language).ipynb): 

Use Fast.ai to build a CNN model to classify a image of a langauge. From [Bengali.AI Handwritten Grapheme Classification](https://www.kaggle.com/c/bengaliai-cv19) in Kaggle. Includes Load image / Genearte custom loss function / Train & Test data using Fast.ai.

To Forecast total ridetime of taxi trips in New York City. Covers both Fast.ai and LGB version of solving the problem. From [New York City Taxi Trip Duration](https://www.kaggle.com/c/nyc-taxi-trip-duration) in Kaggle.



  • Time Series

     Library / Tools: Keras, Tensorflow, fast.ai, pandas, numpy, xgboost, lightgbm, scikit-learn, optuna, Seaborn, Matplotlib
    

    Using Fast.ai to expand a tabular data to utilize many of columns in order to predict sales on stroes based on different situations like promotion, seaons, holidays, etc. Insights are from [Rossmann Store Sales](https://www.kaggle.com/c/rossmann-store-sales)

    From Walmart sales data, forecast daily sales for the next 28 days. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. Pre-process (Feature Enginenering / Hyperparameter Optimization) given data and used LGB/XGB ensemble to generate a final submission. From [M5 Forecasting - Accuracy](https://www.kaggle.com/c/m5-forecasting-accuracy/overview) in Kaggle.


  • NLP/TextClassification

     Library / Tools: Pytorch, transformers, fast.ai, tqdm, pandas, numpy, pygments, google_play_scraper, albumentations, joblib, xgboost, lightgbm, scikit-learn, optuna, Seaborn, Matplotlib
    

    Used Pytorch to encode/tokenize/train/evaluate model. The most simple version

    Using large BERT (takes longer)


  • Micellenous

     Library / Tools: pandas, numpy, elasticsearch, datetime
    

    Use of Python language to pull data directly from ELK stack. Origianlly came in to JSON format, convert it to Dataframe and do simple EDA / Visualization.


Technologies

  • Fast.ai
  • Pytorch
  • Keras
  • CV2
  • tqdm
  • pandas
  • numpy
  • albumentations
  • datetime
  • xgboost
  • lightgbm
  • scikit-learn
  • optuna
  • Seaborn
  • Matplotlib
  • elasticsearch
  • And More...

Reference


Contact

Created by @hyunjoonbok - feel free to contact me!