Skip to content

We come from different major backgrounds, such as computer science, statistics, mathematics and management science. The difference between us sparks the inspiring communications that lead to creation of various and creative application scenario, and that enhance further understanding on the incredible methodology behind those algorithms. Our goa…

KPIxLILU/Machine-Learning-Workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine-Learning-Workshop

Semester 1

Week 1: 機器學習基石Overview 泰瑋

  1. https://hackmd.io/6YkQWaLARNWOMQmMxEnLHg

Week 2: Linear Regression 宛誼 (6/21)

  • 機器學習基石
  1. https://www.youtube.com/watch?v=qGzjYrLV-4Y&index=34&list=PLXVfgk9fNX2I7tB6oIINGBmW50rrmFTqf
  2. https://www.youtube.com/watch?v=2LfdSCdcg1g&index=35&list=PLXVfgk9fNX2I7tB6oIINGBmW50rrmFTqf
  3. https://www.youtube.com/watch?v=lj2jK1FSwgo&index=36&list=PLXVfgk9fNX2I7tB6oIINGBmW50rrmFTqf
  4. https://www.youtube.com/watch?v=tF1HTirYbtc&index=37&list=PLXVfgk9fNX2I7tB6oIINGBmW50rrmFTqf
  • Scikit-learn

http://scikit-learn.org/stable/modules/linear_model.html#linear-model

  1. Ordinary Least Squares
  2. Ridge Regression
  3. Lasso
  • 投影片&Code
  1. Link to PPT
  2. Link to Demo Code(Data)
  3. Link to Demo Code_fran's review. Feel free to contact me with any questions and further details.

Week 3: Logistic Regression Rex(6/25)

  1. Link to PDF
  2. Link to Demo Code(Data)

Week 4: Classification And Regression Tree (CART) 信賢Erik (7/2)

1.Decision Tree Hypothesis 2.Decision Tree Algorithm 3.Decision Tree Heuristics in C&RT 4.Decision Tree in Action

  1. Link to PPT
  2. Link to code
  3. Link to Demo Code_fran's review. Feel free to contact me with any questions and further details.

Week 5: Random Forest & Ensemble fran&昱睿(7/12)

lecture:https://www.youtube.com/watch?v=tH9FH1DH5n0&t= PDF:pdf

  • Ensemble: Bagging and Boosting
  • Two families of ensemble methods are usually distinguished:
  1. In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced. Examples: Bagging methods, Forests of randomized trees
  1. By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble. Examples: AdaBoost, Gradient Tree Boosting

Week 6: Gradient Boosting Machine (GBM) & eXtreme Gradient Boosting (XGBoost) 璧羽&芳妤 (7/16)

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/

  • GBDT
  1. Lecture https://www.youtube.com/watch?v=aX6ZiIWLjdk&index=42&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2
  2. PPT https://github.com/KPIxLILU/Machine-Learning-Workshop/blob/master/GBM.pdf
  3. Link to Demo Code_fran's review_Titanic Feel free to contact me with any questions and further details.

https://blog.csdn.net/u011094454/article/details/78948989

  • 先看完這三篇

https://hk.saowen.com/a/e997166f37dc6022138607838ec7c83ba6f89b2d5d11fe248e0925968b410f33 https://hk.saowen.com/a/7214d5cc99d98d81736f766d77cd568dae07aadf85f027a1e5acdd57839e7f91 http://www.52cs.org/?p=429

  • 最後再看這篇

https://medium.com/@cyeninesky3/xgboost-a-scalable-tree-boosting-system-%E8%AB%96%E6%96%87%E7%AD%86%E8%A8%98%E8%88%87%E5%AF%A6%E4%BD%9C-2b3291e0d1fe

Week 7: XGBoost and LightGBM Jesse&宛誼 (7/23)

  1. Installation-Guide for LightGBM
  2. Documentation for LightGBM
  3. Link to PPT
  4. Link to Demo Code(Data)
  5. Link to Demo Code_fran's review_HomeCredit_lightGBM_GridSearch , Link to Demo Code_fran's review_HomeCredit_lightGBM_bayes_opt. Feel free to contact me with any questions and further details.

Week 8: Factorization Machines and KPIxLILU@Kaggle fran&泰瑋 (7/30) FM種子顧問:宛誼

  • Factorization Machines
  1. Franの課外小教室「看不懂O(MN)、O(N^2)什麼意思?」大O符號是一種漸進符號,用來描述數列級數的層級,無限大還是有很多不同層次的,取決於函數中高階項等因素。在電腦科學中引進這樣的概念描述演算法的時間複雜度,有助於頗析不同演算法之間的數量級差異。以下提供對我來說比較好的例子理解:(1)傳統協同過濾假設M個顧客對於N種產品有偏好,整體的計算維度就是O(MN),當然這沒有考慮到資料的稀疏性。(2) Bubble Sort是一個無腦的依序倆倆比較,如果順序錯了就交換順序,並往前再比,最慘的情況就是比到第n個,結果發現最小,再往回比n次,才能排到第一個,整體的計算維度就是O(N^2)。(3)矩陣的乘法就是O(N^3)。參考資料:大O符號, 時間複雜度
  1. Factorization Machines的論文Steffen Rendle
  2. 貸款三少推薦FM好文
  3. 貸款三少推薦FFM好文
  4. 貸款三少推薦好文:深入FFM原理与实践
  5. 課前素材(有閒再看),初探產品推薦演算法之演進及其優缺點。Amazon.com Recommendations Item-to-Item Collaborative Filtering

Week 9: Sharing I(Feature engineering & Model tuning) Rex&Erik (8/6)

  • KPIxLILU@Kaggle
  1. Data: 請使用Titanic Dataset或是Home Credit Dataset。 關於鐵達尼的Data Sources,有人問gender_submission.csv是什麼呢?是只有根據gender是男女的資訊來示範預估生存與否。因此我們要做的就是把模型predict出來的Survived欄位灌到這張表,就可以上傳kaggle啦。(Fran以上補充)
  2. Method: 不限(EX:Random Forest, XGBoost, LightGBM, Ensemble, Stacking...)
  3. Demo: 於8/6大家一起討論與分享,主講人可以先跟大家說要用什麼資料
  4. Link to Demo by Rex
  5. Erik與大家分享的Titanic特徵工程作法參考網站

Week 10: Sharing II(Vision API Guild & EDM feature engineering) Fran (8/13)

  1. Subject Extraction with Jieba
  1. Label Annotations of Image Materials with GOOGLE VISION API
  1. Textual information of Image Materials with GOOGLE DRIVE API
  2. Titanic sharing and discussion.
  • Demo code. Feel free to contact me with any questions and further details.

Week 11: Sharing III (Unauthorized_CC_TXN) Peggy (8/24)

1.Google Colab introduction\2.Esun Toydatasets sharing

3.Kaggle Api upload

PCA reference: https://medium.com/@chih.sheng.huang821/%E6%A9%9F%E5%99%A8-%E7%B5%B1%E8%A8%88%E5%AD%B8%E7%BF%92-%E4%B8%BB%E6%88%90%E5%88%86%E5%88%86%E6%9E%90-principle-component-analysis-pca-58229cd26e71

Week 12: Introduction to Principal Component Analysis (PCA) (8/27) Yurei

1.Wikipedia-PCA:https://en.wikipedia.org/wiki/Principal_component_analysis

2.An intuitive explanation of PCA(provided by Jesse Wu):http://mengnote.blogspot.com/2013/05/an-intuitive-explanation-of-pca.html

3.教科書:資料科學家手冊

4.Gram_Schmidt Process:https://en.wikipedia.org/wiki/Gram–Schmidt_process

5.sklearn.pca:http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Q1:有沒有可能很多筆資料算出來的主成份變數的值完全一樣?

A:有可能。1.因為我們在跑PCA時可能只取其中部分變數,使得某些觀測點放到PCA後,剛好所選取的變數是相同的;而不同的地方可能因為變異數小而被忽略。2.因為PCA是利用正交線性映射(orthogonal transform),而正交映射一定是一對一函數,所以主成份變數完全相同的話代表原本的資料根本一樣。

Q2:PCA所算出來的各個向量是否確定都互相垂直?

A:是,因為在推導PCA的過程中將eigenvector都使用Gram-Schmidt Process,所以確定都是互相垂直的。

Q3:隨機森林的feature importance與PCA變數篩選意義是一樣的嗎?

A:不太一樣。隨機森林的feature importance是計算每個變數與目標y的重要程度,是一種監督式學習;而PCA是尋找一個座標系,將對應過去的資料選取變異數大的變數(方向)當作座標軸,而不考慮與何種y之間的關係,僅考慮資料間的差異。

Week 13: Sharing IV(Fraud Detection) 芳妤 (9/3)

  • TBA

Week 14: Discussion (9/10)

https://docs.google.com/spreadsheets/d/1wpOsiMSn2PTUX4KsdIOibhelGktq6vB5H1pypc9tYr8/edit#gid=0

Semester 2

Week 15: Introduction to Linear Algebra Yurei (9/17) (OPT)

  1. MIT OPENCOURSEWARE
  • Topics mentioned: bases and orthogonal bases, eigenvalues and eigenvectors, vector space, etc.
  1. 3Blue1Brown - YouTube

Week 16: An Introduction to Neural Network Yurei (9/25)

  1. Brief Introduction of Deep Learning
  1. [Backpropagation] (http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf)
  1. Deep_Learning看ch04的two_layer_net還有dataset。參考連結點此
  1. Yurei's Demo Code
  1. Keras Demo Code
  1. fran's review_TwoLayerNN_demo. Feel free to contact me with any questions and further details.

Week 17: Introduction to Inner Product Space and Hash Table Yurei&Fran (10/1)

Inner Product Space
Hash Table

Week 18: lightGBM參數細緻說明及對預測的影響 Fran (10/8)

lightGBM
  1. Categorical Feature Support - Optimal Split for Categorical Features. 官方說明文件點此
  • 與label encoding相比,lightGBM不會對每個類別都給一個編號,而是有embedding的概念
  • 與catboost相比,在fit階段,lightGBM 預設是categorical_feature='auto',自動處理類型為category的x ;但catboost需要標示類別型變數的index才可以
  • 與catboost相比,lightGBM類別型變數可以為空值,catboost不行。
  1. feature_histogram(Categorical Feature被label成正整數標籤的依據) - LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram. 官方說明的code點此wikipedia關於Hessian matrix
  1. Fran's demo code,重點項目如下:Sampling VersionThreshold Version
  • 內含一個cell print出多的結果的code
  • import seaborn as sns做出時間與盜刷的分佈
  • predict出機率後,轉化成成dataframe再設定門檻(threshold)的code。預設來說,一般的predict是機率>=0.5的話就歸到那一類。(我在Logistic Regression有驗證過)
  • 兩種feature importance: 'gain','split'. Feel free to contact me with any questions and further details.

Week 19: xlearn實作 Jesse (10/18)

  • TBA
Franの課外小教室
  1. demo_abt_Arguments&KeywordArguments
  1. 透過 F1 cumulative results檢視成效(從第22個cell開始)

Week 20: catboost是如何針對category變數做處理以及其參數調整的影響 Rex (10/22)

  • TBA
Franの課外小教室
  1. demo_abt_chunksize_read
  1. demo_abt_SLEEP()

Week 21: An introduction to CNN with Keras and Pytorch 泰瑋 (10/29)

  1. Hung-yi Lee's CNNvideo here
  1. [資料分析&機器學習] 第5.1講 卷積神經網絡介紹
  1. fran's review & demo code_CNN_MNIST. Feel free to contact me with any questions and further details.

Week 22: Kaggle Share : What's Cooking? & Introduction to Regular Expression Erik (11/5)

  1. TF-IDF使用
  1. nltk
  1. What's cooking
  1. Re正規表達式練習題
  1. folder of fran's review. Feel free to contact me with any questions and further details.

Week 23: Utilizing Embedding Techniques with proNet and Xlearn 宛誼 (11/14)

  1. proNet
  1. Link to PPT
  1. Link to Code

Week 24: How Recurrent Neural Networks and Long Short-Term Memory Work 芳妤 (11/23)

RNN
  1. Link to PPT
  1. Hung-yi Lee's CNN Part I (topic this week)
  1. Hung-yi Lee's CNN Part II
RNN & LSTM
  1. how_rnns_lstm_work

Break

Week 26: how to do EDA ? (資料探索與作圖) Peggy (12/7)

  1. Link to demo code
  2. EDA重點整理
  3. 參考資料:

Week 27: Imbalanced-learn 瑞河 (12/12)

  1. Python packages_imbalanced-learn
  1. kaggle_resampling-strategies-for-imbalanced-datasets
  1. Other materials

About

We come from different major backgrounds, such as computer science, statistics, mathematics and management science. The difference between us sparks the inspiring communications that lead to creation of various and creative application scenario, and that enhance further understanding on the incredible methodology behind those algorithms. Our goa…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published