- 機器學習基石
- https://www.youtube.com/watch?v=qGzjYrLV-4Y&index=34&list=PLXVfgk9fNX2I7tB6oIINGBmW50rrmFTqf
- https://www.youtube.com/watch?v=2LfdSCdcg1g&index=35&list=PLXVfgk9fNX2I7tB6oIINGBmW50rrmFTqf
- https://www.youtube.com/watch?v=lj2jK1FSwgo&index=36&list=PLXVfgk9fNX2I7tB6oIINGBmW50rrmFTqf
- https://www.youtube.com/watch?v=tF1HTirYbtc&index=37&list=PLXVfgk9fNX2I7tB6oIINGBmW50rrmFTqf
- Scikit-learn
http://scikit-learn.org/stable/modules/linear_model.html#linear-model
- Ordinary Least Squares
- Ridge Regression
- Lasso
- 投影片&Code
- Link to PPT
- Link to Demo Code(Data)
- Link to Demo Code_fran's review. Feel free to contact me with any questions and further details.
1.Decision Tree Hypothesis 2.Decision Tree Algorithm 3.Decision Tree Heuristics in C&RT 4.Decision Tree in Action
-
投影片&Code
- Link to PPT
- Link to code
- Link to Demo Code_fran's review. Feel free to contact me with any questions and further details.
lecture:https://www.youtube.com/watch?v=tH9FH1DH5n0&t= PDF:pdf
- Ensemble: Bagging and Boosting
- Two families of ensemble methods are usually distinguished:
- In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced. Examples: Bagging methods, Forests of randomized trees
- By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble. Examples: AdaBoost, Gradient Tree Boosting
- GBDT
- Lecture https://www.youtube.com/watch?v=aX6ZiIWLjdk&index=42&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2
- PPT https://github.com/KPIxLILU/Machine-Learning-Workshop/blob/master/GBM.pdf
- Link to Demo Code_fran's review_Titanic Feel free to contact me with any questions and further details.
-
XGBOOST
-
前半段介紹GB
- 先看完這三篇
https://hk.saowen.com/a/e997166f37dc6022138607838ec7c83ba6f89b2d5d11fe248e0925968b410f33 https://hk.saowen.com/a/7214d5cc99d98d81736f766d77cd568dae07aadf85f027a1e5acdd57839e7f91 http://www.52cs.org/?p=429
- 最後再看這篇
- Installation-Guide for LightGBM
- Documentation for LightGBM
- Link to PPT
- Link to Demo Code(Data)
- Link to Demo Code_fran's review_HomeCredit_lightGBM_GridSearch , Link to Demo Code_fran's review_HomeCredit_lightGBM_bayes_opt. Feel free to contact me with any questions and further details.
- Factorization Machines
- Franの課外小教室「看不懂O(MN)、O(N^2)什麼意思?」大O符號是一種漸進符號,用來描述數列級數的層級,無限大還是有很多不同層次的,取決於函數中高階項等因素。在電腦科學中引進這樣的概念描述演算法的時間複雜度,有助於頗析不同演算法之間的數量級差異。以下提供對我來說比較好的例子理解:(1)傳統協同過濾假設M個顧客對於N種產品有偏好,整體的計算維度就是O(MN),當然這沒有考慮到資料的稀疏性。(2) Bubble Sort是一個無腦的依序倆倆比較,如果順序錯了就交換順序,並往前再比,最慘的情況就是比到第n個,結果發現最小,再往回比n次,才能排到第一個,整體的計算維度就是O(N^2)。(3)矩陣的乘法就是O(N^3)。參考資料:大O符號, 時間複雜度
- Factorization Machines的論文Steffen Rendle。
- 貸款三少推薦FM好文
- 貸款三少推薦FFM好文
- 貸款三少推薦好文:深入FFM原理与实践
- 課前素材(有閒再看),初探產品推薦演算法之演進及其優缺點。Amazon.com Recommendations Item-to-Item Collaborative Filtering
- KPIxLILU@Kaggle
- Data: 請使用Titanic Dataset或是Home Credit Dataset。 關於鐵達尼的Data Sources,有人問gender_submission.csv是什麼呢?是只有根據gender是男女的資訊來示範預估生存與否。因此我們要做的就是把模型predict出來的Survived欄位灌到這張表,就可以上傳kaggle啦。(Fran以上補充)
- Method: 不限(EX:Random Forest, XGBoost, LightGBM, Ensemble, Stacking...)
- Demo: 於8/6大家一起討論與分享,主講人可以先跟大家說要用什麼資料
- Link to Demo by Rex
- Erik與大家分享的Titanic特徵工程作法參考網站
- Subject Extraction with Jieba
- Label Annotations of Image Materials with GOOGLE VISION API
- Textual information of Image Materials with GOOGLE DRIVE API
- Titanic sharing and discussion.
1.Google Colab introduction\2.Esun Toydatasets sharing
3.Kaggle Api upload
1.Wikipedia-PCA:https://en.wikipedia.org/wiki/Principal_component_analysis
2.An intuitive explanation of PCA(provided by Jesse Wu):http://mengnote.blogspot.com/2013/05/an-intuitive-explanation-of-pca.html
3.教科書:資料科學家手冊
4.Gram_Schmidt Process:https://en.wikipedia.org/wiki/Gram–Schmidt_process
5.sklearn.pca:http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Q1:有沒有可能很多筆資料算出來的主成份變數的值完全一樣?
A:有可能。1.因為我們在跑PCA時可能只取其中部分變數,使得某些觀測點放到PCA後,剛好所選取的變數是相同的;而不同的地方可能因為變異數小而被忽略。2.因為PCA是利用正交線性映射(orthogonal transform),而正交映射一定是一對一函數,所以主成份變數完全相同的話代表原本的資料根本一樣。
Q2:PCA所算出來的各個向量是否確定都互相垂直?
A:是,因為在推導PCA的過程中將eigenvector都使用Gram-Schmidt Process,所以確定都是互相垂直的。
Q3:隨機森林的feature importance與PCA變數篩選意義是一樣的嗎?
A:不太一樣。隨機森林的feature importance是計算每個變數與目標y的重要程度,是一種監督式學習;而PCA是尋找一個座標系,將對應過去的資料選取變異數大的變數(方向)當作座標軸,而不考慮與何種y之間的關係,僅考慮資料間的差異。
- fran's review & demo code. Feel free to contact me with any questions and further details.
- TBA
https://docs.google.com/spreadsheets/d/1wpOsiMSn2PTUX4KsdIOibhelGktq6vB5H1pypc9tYr8/edit#gid=0
- Topics mentioned: bases and orthogonal bases, eigenvalues and eigenvectors, vector space, etc.
- [Backpropagation] (http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf)
- Deep_Learning看ch04的two_layer_net還有dataset。參考連結點此
- fran's review_TwoLayerNN_demo. Feel free to contact me with any questions and further details.
- Intro
- 白話的 Hash Table 簡介
- fran's demo abt Hash Table . Feel free to contact me with any questions and further details.
- Categorical Feature Support - Optimal Split for Categorical Features. 官方說明文件點此
- 與label encoding相比,lightGBM不會對每個類別都給一個編號,而是有embedding的概念
- 與catboost相比,在fit階段,lightGBM 預設是categorical_feature='auto',自動處理類型為category的x ;但catboost需要標示類別型變數的index才可以
- 與catboost相比,lightGBM類別型變數可以為空值,catboost不行。
- feature_histogram(Categorical Feature被label成正整數標籤的依據) - LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram. 官方說明的code點此、wikipedia關於Hessian matrix
- Fran's demo code,重點項目如下:Sampling Version 、Threshold Version
- 內含一個cell print出多的結果的code
- import seaborn as sns做出時間與盜刷的分佈
- predict出機率後,轉化成成dataframe再設定門檻(threshold)的code。預設來說,一般的predict是機率>=0.5的話就歸到那一類。(我在Logistic Regression有驗證過)
- 兩種feature importance: 'gain','split'. Feel free to contact me with any questions and further details.
- TBA
- TBA
- fran's review & demo code_CNN_MNIST. Feel free to contact me with any questions and further details.
- folder of fran's review. Feel free to contact me with any questions and further details.
- Link to demo code
- EDA重點整理
- 參考資料:
- Other materials