Machine-Learning-Workshop

Semester 1

Week 1: 機器學習基石Overview `泰瑋`

https://hackmd.io/6YkQWaLARNWOMQmMxEnLHg

Week 2: Linear Regression `宛誼` (6/21)

機器學習基石

https://www.youtube.com/watch?v=qGzjYrLV-4Y&index=34&list=PLXVfgk9fNX2I7tB6oIINGBmW50rrmFTqf

https://www.youtube.com/watch?v=2LfdSCdcg1g&index=35&list=PLXVfgk9fNX2I7tB6oIINGBmW50rrmFTqf

https://www.youtube.com/watch?v=lj2jK1FSwgo&index=36&list=PLXVfgk9fNX2I7tB6oIINGBmW50rrmFTqf

https://www.youtube.com/watch?v=tF1HTirYbtc&index=37&list=PLXVfgk9fNX2I7tB6oIINGBmW50rrmFTqf

Scikit-learn

http://scikit-learn.org/stable/modules/linear_model.html#linear-model

Ordinary Least Squares

Ridge Regression

Lasso

投影片＆Ｃode

Link to PPT

Link to Demo Code(Data)

Link to Demo Code_fran's review. Feel free to contact me with any questions and further details.

Week 3: Logistic Regression `Rex`(6/25)

Link to PDF

Link to Demo Code(Data)

Week 4: Classification And Regression Tree (CART) `信賢Erik` (7/2)

[資料分析&機器學習] 第3.5講 : 決策樹(Decision Tree)以及隨機森林(Random Forest)介紹)
機器學習技法

1.Decision Tree Hypothesis 2.Decision Tree Algorithm 3.Decision Tree Heuristics in C&RT 4.Decision Tree in Action

決策樹如何剪枝參考
投影片＆Ｃode

Link to PPT

Link to code

Link to Demo Code_fran's review. Feel free to contact me with any questions and further details.

Week 5: Random Forest & Ensemble `fran&昱睿`(7/12)

lecture:https://www.youtube.com/watch?v=tH9FH1DH5n0&t= PDF:pdf

Ensemble: Bagging and Boosting
Two families of ensemble methods are usually distinguished:

In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced. Examples: Bagging methods, Forests of randomized trees

By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble. Examples: AdaBoost, Gradient Tree Boosting

Week 6: Gradient Boosting Machine (GBM) & eXtreme Gradient Boosting (XGBoost) `璧羽&芳妤` (7/16)

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/

GBDT

Lecture https://www.youtube.com/watch?v=aX6ZiIWLjdk&index=42&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2

PPT https://github.com/KPIxLILU/Machine-Learning-Workshop/blob/master/GBM.pdf

Link to Demo Code_fran's review_Titanic Feel free to contact me with any questions and further details.

XGBOOST
Link to PPT
前半段介紹ＧＢ

https://blog.csdn.net/u011094454/article/details/78948989

先看完這三篇

https://hk.saowen.com/a/e997166f37dc6022138607838ec7c83ba6f89b2d5d11fe248e0925968b410f33 https://hk.saowen.com/a/7214d5cc99d98d81736f766d77cd568dae07aadf85f027a1e5acdd57839e7f91 http://www.52cs.org/?p=429

最後再看這篇

https://medium.com/@cyeninesky3/xgboost-a-scalable-tree-boosting-system-%E8%AB%96%E6%96%87%E7%AD%86%E8%A8%98%E8%88%87%E5%AF%A6%E4%BD%9C-2b3291e0d1fe

論文原文：https://arxiv.org/pdf/1603.02754v1.pdf
作者ＰＰＴ：https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

Week 7: XGBoost and LightGBM `Jesse&宛誼` (7/23)

Article and Tutorial:Link1, Link2
LightGBM

Installation-Guide for LightGBM

Documentation for LightGBM

Link to PPT

Link to Demo Code(Data)

Link to Demo Code_fran's review_HomeCredit_lightGBM_GridSearch , Link to Demo Code_fran's review_HomeCredit_lightGBM_bayes_opt. Feel free to contact me with any questions and further details.

Week 8: Factorization Machines and KPIxLILU@Kaggle `fran&泰瑋` (7/30) `FM種子顧問：宛誼`

Factorization Machines

Franの課外小教室「看不懂Ｏ(MN)、Ｏ(N^2)什麼意思？」大Ｏ符號是一種漸進符號，用來描述數列級數的層級，無限大還是有很多不同層次的，取決於函數中高階項等因素。在電腦科學中引進這樣的概念描述演算法的時間複雜度，有助於頗析不同演算法之間的數量級差異。以下提供對我來說比較好的例子理解：(1)傳統協同過濾假設M個顧客對於N種產品有偏好，整體的計算維度就是Ｏ(MN)，當然這沒有考慮到資料的稀疏性。(2) Bubble Sort是一個無腦的依序倆倆比較，如果順序錯了就交換順序，並往前再比，最慘的情況就是比到第n個，結果發現最小，再往回比n次，才能排到第一個，整體的計算維度就是Ｏ(N^2)。(3)矩陣的乘法就是Ｏ(N^3)。參考資料：大Ｏ符號, 時間複雜度

Factorization Machines的論文Steffen Rendle。

貸款三少推薦FM好文

貸款三少推薦FFM好文

貸款三少推薦好文：深入FFM原理与实践

課前素材（有閒再看），初探產品推薦演算法之演進及其優缺點。Amazon.com Recommendations Item-to-Item Collaborative Filtering

Week 9: Sharing I(Feature engineering & Model tuning) `Rex&Erik` (8/6)

KPIxLILU@Kaggle

Data: 請使用Titanic Dataset或是Home Credit Dataset。關於鐵達尼的Data Sources，有人問gender_submission.csv是什麼呢？是只有根據gender是男女的資訊來示範預估生存與否。因此我們要做的就是把模型predict出來的Survived欄位灌到這張表，就可以上傳kaggle啦。(Fran以上補充)

Method: 不限(EX:Random Forest, XGBoost, LightGBM, Ensemble, Stacking...)

Demo: 於8/6大家一起討論與分享，主講人可以先跟大家說要用什麼資料

Link to Demo by Rex

Erik與大家分享的Titanic特徵工程作法參考網站

Week 10: Sharing II(Vision API Guild & EDM feature engineering) `Fran` (8/13)

Subject Extraction with Jieba

Jieba&GOOG_TransAPI_demo

demo_data

Label Annotations of Image Materials with GOOGLE VISION API

Demo code_VisionAPI_demo_images2DataFrame

參考網站

Textual information of Image Materials with GOOGLE DRIVE API

Titanic sharing and discussion.

Demo code. Feel free to contact me with any questions and further details.

Week 11: Sharing III (Unauthorized_CC_TXN) `Peggy` (8/24)

1.Google Colab introduction\2.Esun Toydatasets sharing

Instruction and Demo code

3.Kaggle Api upload

step-by-step Guide

PCA reference: https://medium.com/@chih.sheng.huang821/%E6%A9%9F%E5%99%A8-%E7%B5%B1%E8%A8%88%E5%AD%B8%E7%BF%92-%E4%B8%BB%E6%88%90%E5%88%86%E5%88%86%E6%9E%90-principle-component-analysis-pca-58229cd26e71

Week 12: Introduction to Principal Component Analysis (PCA) (8/27) `Yurei`

1.Wikipedia-PCA:https://en.wikipedia.org/wiki/Principal_component_analysis

2.An intuitive explanation of PCA(provided by Jesse Wu):http://mengnote.blogspot.com/2013/05/an-intuitive-explanation-of-pca.html

3.教科書：資料科學家手冊

4.Gram_Schmidt Process:https://en.wikipedia.org/wiki/Gram–Schmidt_process

5.sklearn.pca:http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Q1:有沒有可能很多筆資料算出來的主成份變數的值完全一樣？

A:有可能。1.因為我們在跑PCA時可能只取其中部分變數，使得某些觀測點放到PCA後，剛好所選取的變數是相同的；而不同的地方可能因為變異數小而被忽略。2.因為PCA是利用正交線性映射(orthogonal transform)，而正交映射一定是一對一函數，所以主成份變數完全相同的話代表原本的資料根本一樣。

Q2:PCA所算出來的各個向量是否確定都互相垂直？

A:是，因為在推導PCA的過程中將eigenvector都使用Gram-Schmidt Process，所以確定都是互相垂直的。

Q3:隨機森林的feature importance與PCA變數篩選意義是一樣的嗎？

A:不太一樣。隨機森林的feature importance是計算每個變數與目標y的重要程度，是一種監督式學習；而PCA是尋找一個座標系，將對應過去的資料選取變異數大的變數（方向）當作座標軸，而不考慮與何種y之間的關係，僅考慮資料間的差異。

fran's review & demo code. Feel free to contact me with any questions and further details.

Week 13: Sharing IV(Fraud Detection) `芳妤` (9/3)

TBA

Week 14: Discussion (9/10)

https://docs.google.com/spreadsheets/d/1wpOsiMSn2PTUX4KsdIOibhelGktq6vB5H1pypc9tYr8/edit#gid=0

Semester 2

Week 15: Introduction to Linear Algebra `Yurei` (9/17) (OPT)

MIT OPENCOURSEWARE

Topics mentioned: bases and orthogonal bases, eigenvalues and eigenvectors, vector space, etc.

3Blue1Brown - YouTube

Week 16: An Introduction to Neural Network `Yurei` (9/25)

Brief Introduction of Deep Learning

video here

[Backpropagation] (http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf)

video here

Deep_Learning看ch04的two_layer_net還有dataset。參考連結點此

Yurei's Demo Code

Keras Demo Code

fran's review_TwoLayerNN_demo. Feel free to contact me with any questions and further details.

Week 17: Introduction to Inner Product Space and Hash Table `Yurei&Fran` (10/1)

Inner Product Space

Hash Table

Intro

白話的 Hash Table 簡介

fran's demo abt Hash Table . Feel free to contact me with any questions and further details.

Week 18: lightGBM參數細緻說明及對預測的影響 `Fran` (10/8)

lightGBM

Categorical Feature Support - Optimal Split for Categorical Features. 官方說明文件點此

與label encoding相比，lightGBM不會對每個類別都給一個編號，而是有embedding的概念

與catboost相比，在fit階段，lightGBM 預設是categorical_feature='auto'，自動處理類型為category的x ;但catboost需要標示類別型變數的index才可以

與catboost相比，lightGBM類別型變數可以為空值，catboost不行。

feature_histogram(Categorical Feature被label成正整數標籤的依據) - LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram. 官方說明的code點此、wikipedia關於Hessian matrix

Fran's demo code，重點項目如下：Sampling Version 、Threshold Version

內含一個cell print出多的結果的code

import seaborn as sns做出時間與盜刷的分佈

predict出機率後，轉化成成dataframe再設定門檻(threshold)的code。預設來說，一般的predict是機率>=0.5的話就歸到那一類。(我在Logistic Regression有驗證過)

兩種feature importance: 'gain','split'. Feel free to contact me with any questions and further details.

Week 19: xlearn實作 `Jesse` (10/18)

TBA

Franの課外小教室

demo_abt_Arguments&KeywordArguments

透過 F1 cumulative results檢視成效(從第22個cell開始)

Week 20: catboost是如何針對category變數做處理以及其參數調整的影響 `Rex` (10/22)

TBA

Franの課外小教室

demo_abt_chunksize_read

demo_abt_SLEEP()

Week 21: An introduction to CNN with Keras and Pytorch `泰瑋` (10/29)

Hung-yi Lee's CNN、video here

[資料分析&機器學習] 第5.1講卷積神經網絡介紹

fran's review & demo code_CNN_MNIST. Feel free to contact me with any questions and further details.

Week 22: Kaggle Share : What's Cooking? & Introduction to Regular Expression `Erik` (11/5)

TF-IDF使用

nltk

What's cooking

Re正規表達式練習題

folder of fran's review. Feel free to contact me with any questions and further details.

Week 23: Utilizing Embedding Techniques with proNet and Xlearn `宛誼` (11/14)

proNet

Link to PPT

Link to Code

Week 24: How Recurrent Neural Networks and Long Short-Term Memory Work `芳妤` (11/23)

RNN

Link to PPT

Hung-yi Lee's CNN Part I (topic this week)

Hung-yi Lee's CNN Part II

RNN & LSTM

how_rnns_lstm_work

Break

Week 26: how to do EDA ? (資料探索與作圖) `Peggy` (12/7)

Link to demo code

EDA重點整理

參考資料：

https://www.itl.nist.gov/div898/handbook/eda/section1/eda17.htm

https://python-graph-gallery.com/

Week 27: Imbalanced-learn `瑞河` (12/12)

Python packages_imbalanced-learn

kaggle_resampling-strategies-for-imbalanced-datasets

Other materials

https://medium.com/anomaly-detection-with-python-and-r/sampling-techniques-for-extremely-imbalanced-data-part-i-under-sampling-a8dbc3d8d6d8

https://blog.csdn.net/kizgel/article/details/78553009

https://www.itread01.com/content/1540987816.html

https://zhuanlan.zhihu.com/p/34782497

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
Reference		Reference
Tutorial		Tutorial
.gitignore		.gitignore
Demo_NN.ipynb		Demo_NN.ipynb
PCA_0827.ipynb		PCA_0827.ipynb
README.md		README.md

KPIxLILU/Machine-Learning-Workshop

Folders and files

Latest commit

History

Repository files navigation

Machine-Learning-Workshop

Semester 1

Week 1: 機器學習基石Overview 泰瑋

Week 2: Linear Regression 宛誼 (6/21)

Week 3: Logistic Regression Rex(6/25)

Week 4: Classification And Regression Tree (CART) 信賢Erik (7/2)

Week 5: Random Forest & Ensemble fran&昱睿(7/12)

Week 6: Gradient Boosting Machine (GBM) & eXtreme Gradient Boosting (XGBoost) 璧羽&芳妤 (7/16)

Week 7: XGBoost and LightGBM Jesse&宛誼 (7/23)

Week 8: Factorization Machines and KPIxLILU@Kaggle fran&泰瑋 (7/30) FM種子顧問：宛誼

Week 9: Sharing I(Feature engineering & Model tuning) Rex&Erik (8/6)

Week 10: Sharing II(Vision API Guild & EDM feature engineering) Fran (8/13)

Week 11: Sharing III (Unauthorized_CC_TXN) Peggy (8/24)

Week 12: Introduction to Principal Component Analysis (PCA) (8/27) Yurei

Week 13: Sharing IV(Fraud Detection) 芳妤 (9/3)

Week 14: Discussion (9/10)

Semester 2

Week 15: Introduction to Linear Algebra Yurei (9/17) (OPT)

Week 16: An Introduction to Neural Network Yurei (9/25)

Week 17: Introduction to Inner Product Space and Hash Table Yurei&Fran (10/1)

Inner Product Space

Hash Table

Week 18: lightGBM參數細緻說明及對預測的影響 Fran (10/8)

lightGBM

Week 19: xlearn實作 Jesse (10/18)

Franの課外小教室

Week 20: catboost是如何針對category變數做處理以及其參數調整的影響 Rex (10/22)

Franの課外小教室

Week 21: An introduction to CNN with Keras and Pytorch 泰瑋 (10/29)

Week 22: Kaggle Share : What's Cooking? & Introduction to Regular Expression Erik (11/5)

Week 23: Utilizing Embedding Techniques with proNet and Xlearn 宛誼 (11/14)

Week 24: How Recurrent Neural Networks and Long Short-Term Memory Work 芳妤 (11/23)

RNN

RNN & LSTM

Break

Week 26: how to do EDA ? (資料探索與作圖) Peggy (12/7)

Week 27: Imbalanced-learn 瑞河 (12/12)

About

Topics

Resources

Stars

Watchers

Forks

Languages

Week 1: 機器學習基石Overview `泰瑋`

Week 2: Linear Regression `宛誼` (6/21)

Week 3: Logistic Regression `Rex`(6/25)

Week 4: Classification And Regression Tree (CART) `信賢Erik` (7/2)

Week 5: Random Forest & Ensemble `fran&昱睿`(7/12)

Week 6: Gradient Boosting Machine (GBM) & eXtreme Gradient Boosting (XGBoost) `璧羽&芳妤` (7/16)

Week 7: XGBoost and LightGBM `Jesse&宛誼` (7/23)

Week 8: Factorization Machines and KPIxLILU@Kaggle `fran&泰瑋` (7/30) `FM種子顧問：宛誼`

Week 9: Sharing I(Feature engineering & Model tuning) `Rex&Erik` (8/6)

Week 10: Sharing II(Vision API Guild & EDM feature engineering) `Fran` (8/13)

Week 11: Sharing III (Unauthorized_CC_TXN) `Peggy` (8/24)

Week 12: Introduction to Principal Component Analysis (PCA) (8/27) `Yurei`

Week 13: Sharing IV(Fraud Detection) `芳妤` (9/3)

Week 15: Introduction to Linear Algebra `Yurei` (9/17) (OPT)

Week 16: An Introduction to Neural Network `Yurei` (9/25)

Week 17: Introduction to Inner Product Space and Hash Table `Yurei&Fran` (10/1)

Week 18: lightGBM參數細緻說明及對預測的影響 `Fran` (10/8)

Week 19: xlearn實作 `Jesse` (10/18)

Week 20: catboost是如何針對category變數做處理以及其參數調整的影響 `Rex` (10/22)

Week 21: An introduction to CNN with Keras and Pytorch `泰瑋` (10/29)

Week 22: Kaggle Share : What's Cooking? & Introduction to Regular Expression `Erik` (11/5)

Week 23: Utilizing Embedding Techniques with proNet and Xlearn `宛誼` (11/14)

Week 24: How Recurrent Neural Networks and Long Short-Term Memory Work `芳妤` (11/23)

Week 26: how to do EDA ? (資料探索與作圖) `Peggy` (12/7)

Week 27: Imbalanced-learn `瑞河` (12/12)