Decision tree is popular and there are many free and open implementation of tree-based algorithms:
- scikit-garden,
- Treelite : model compiler for decision tree ensembles,
- bonsai-dt
- xgboost,
- catBoost,
- LightGBM,
- Ytk-learn
- thunderGBM.
The discussion Who invented the decision tree? gives some clues of the history of decision tree. Like deep learning, it has a long history.
Decision tree holds some attractive properties:
- hierarchical representation for us to understand its hidden structure;
- inherent adaptive computation;
- logical expression implementation.
Observe that we train and prune a decision tree according to a specific training data set while we representation decision tree as a collection of logical expression.
CART in statistics is a historical review on decision tree.
- Recursive Partitioning and Tree-based Methods
- https://medicine.yale.edu/profile/heping_zhang/
- https://www.stat.berkeley.edu/~breiman/RandomForests/
Now consider how to determine the structure of the decision tree.
Even for a fixed number of nodes in the tree, the problem of determining the optimal structure
(including choice of input variable for each split as well as the corresponding thresholds)
to minimize the sum-of-squares error is usually computationally infeasible
due to the combinatorially large number of possible solutions.
Instead, a greedy optimization is generally done by starting with a single root node,
corresponding to the whole input space, and then growing the tree by adding nodes one at a time.
At each step there will be some number of candidate regions in input space that can be split,
corresponding to the addition of a pair of leaf nodes to the existing tree.
For each of these, there is a choice of which of the
Note that
Note that
The decision tree is really a discriminant model
Note that the tree structure is a special type of directed acyclic graph structure. Here we focus on the connection between the decision trees and probabilistic graphical models.
Regression trees and soft decision trees are extensions of the decision tree induction technique, predicting a numerical output, rather than a discrete class.
For simplicity, we suppose that all variables are numerical and the decision tree can be expressed in a polynomials form:
The product of step functions is the indicator functions.
For example,
For categorical variables equalsto(==)
test, i.e., if the variable
In MARS, the recursive partitioning regression, as binary regression tree, is viewed in a more conventional light as a stepwise regression procedure:
Delta Function is the derivative of step function, an example of generalized function;
the unit step function is the derivative of the Ramp Function defined by
Kronecker Delta is defined as
$$\delta_{ij}=\begin{cases}1, \text{if
The indicator functions is defined as $$\mathbb{1}_{ {condition} }=\begin{cases}1, &\text{if condition is hold }\ 0, &\text{otherwise}.\end{cases}$$
Given a subset characteristic function
A simple function is a finite sum
The fundamental limitation of decision tree includes:
- its lack of
continuity
, - lack
smooth [decision boundary]
or axe-aligned splits, - inability to provide good approximations to certain classes of simple often-occurring functions,
- highly instability with respect to minor perturbations in the training data.
The decision tree is simple function essentially.
Based on above observation, decision tree is considered to project the raw data into distinct decision region in an adaptive approach
and select the mode of within-node samples ground truth as their label
.
We can replace the unit step function with truncated power series as in MARS to tackle the discontinuity to invent continuos models with continuous derivatives.
MARS is an extension of recursive partitioning regression:
Note that
There is no transformation of raw input. The decision tree is just to guide the input to a proper terminal nodes associated with some specific tests.
It is difficult to imply some properties of the input according to the test results.
To classify a new instance, one starts at the top node and applies sequentially the dichotomous tests encountered to select the
appropriate successor.
Finally, a unique path
is followed, a unique terminal
node is reached and the output estimation stored there is assigned to this instance.
A natural modification is to make the step function continuous such as MARS. However, MARS is still of axis-aligned type basis expansion.
Competitive Neural Trees for Pattern Classification contains m-ary nodes and grows during learning by using inheritance to initialize new nodes.
The fuzzy decision tree introduce membership function to replace the unit step function.
In another word, the basis function is
In soft decision trees,
the unit step function
Probabilistic Boosting-Tree is the probabilistic boosting-tree that automatically constructs a tree in which each node combines a number of weak classifiers (evidence, knowledge) into a strong classifier (a conditional posterior probability).
To find smooth decision boundary of decision tree inspired methods, another technique is to transform the data at each non-terminal node.
Adaptive Neural Trees gain the complementary benefits of neural networks and decision trees.
Many multiple adaptive regression methods are specializations of a general multivariate
regression algorithm that builds hierarchical models
using a set of basis
functions and stepwise selection
:
Let us compare hierarchical models including decision tree, multiple adaptive regression spline and Recursive partitioning regression.
Decision Tree | MARS | Multivariate adaptive polynomial synthesis (MAPS) |
---|---|---|
Discontinuity | Smoothness | ---- |
Unit step function | ReLU |
The MARS overcomes the (1) and (3); The fuzzy decision tree overcomes the last one. The above methods are only for regression tasks. From now we turn to the classification tasks.
As shown in decision boundary,
In general, a pattern classifier carves up (or tesselates or partitions) the feature space into volumes called
decision regions
. All feature vectors in a decision region are assigned to the same category. The decision regions are often simply connected, but they can be multiply connected as well, consisting of two or more non-touching regions.
The decision regions are separated by surfaces called the
decision boundaries
. These separating surfaces represent points where there are ties between two or more categories.
The Manifold Hypothesis
states that real-world high-dimensional data lie on low-dimensional manifolds embedded within the high-dimensional space, namely natural high dimensional data concentrates close to a low-dimensional manifold.
Cluster hypothesis in information retrieval is to say that documents in the same cluster behave similarly with respect to relevance to information needs.
The key idea behind the unsupervised learning
of disentangled representations is that real-world data is generated by a few explanatory factors of variation which can be recovered by unsupervised learning algorithms.
- The Cluster Hypothesis: A Visual/Statistical Analysis
- https://ie.technion.ac.il/~kurland/clustHypothesisTutorial.pdf
- https://muse.jhu.edu/article/20058/pdf
- http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
- Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
The success of decision trees is a strong evidence that both hypotheses are true.
The goal of deep learning is to learn the manifold structure in data and the probability distribution associated with the manifold.
From this geometric perspective, the unsupervised learning is to find certain properties of the manifold structure in data,
such as the Intrinsic Dimension.
What is the goal of regression tree from this geometric perspective?
Is the goal of classification tree to probability distribution
associated with the manifold from this geometric perspective?
- http://www.mit.edu/people/mitter/publications/C50_sample_complexity.pdf
- http://www.mit.edu/~mitter/publications/121_Testing_Manifold.pdf
- Geometric Understanding of Deep Learning
- http://web.mit.edu/cocosci/Papers/man_nips.pdf
- Topology and Data
- UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
- https://www.krishnaswamylab.org/
- http://www.vision.jhu.edu/assets/ElhamifarNIPS11.pdf
- http://www.cs.virginia.edu/~jdl/bib/manifolds/souvenir05.pdf
- SemiBoost: Boosting for Semi-supervised Learning
- https://alexgkendall.com/media/papers/alex_kendall_phd_thesis_compressed.pdf
In some sense, a good classifier can describe the decision boundary well. And classification is connected with the manifold approximation. In practice, the problem is that the attributes are of diverse types. Although categorical attributes can be embedded into real space, it is always different to deal with categorical values and numerical values.
The decision boundary of decision trees is axis-aligned
and therefor not smooth.
In fact, the product
The axis-aligned decision boundary restricts the expressivity of decision tree , which can be solved by data preprocessing and feature transformation such as kernel technique. Or we can modify the unit step function.
Template Matching is a natural approach to pattern classification. This can be done in a couple of equivalent ways:
- Count the number of agreements. Pick the class that has the maximum number of agreements. This is a maximum correlation approach.
- Count the number of disagreements. Pick the class with the minimum number of disagreements. This is a minimum error approach.
Template matching works well when the variations within a class are due to "additive noise." Clearly, it works for this example because there are no other distortions of the characters -- translation, rotation, shearing, warping, expansion, contraction or occlusion. It will not work on all problems, but when it is appropriate it is very effective. It can also be generalized in useful ways.
Template matching can easily be expressed mathematically.
Let minimum-distance classifier
.
Clearly, a template matching system is a minimum-distance classifier.
If a simple minimum-distance classifier is satisfactory, there is no reason to use anything more complicated. However, it frequently happens that such a classifier makes too many errors. There are several possible reasons for this:
- The features may be inadequate to distinguish the different classes
- The features may be highly correlated
- The decision boundary may have to be curved
- There may be distinct subclasses in the data
- The feature space may simply be too complex
Linear Discriminant Analysis (LDA)
is most commonly used as dimensionality reduction technique in the pre-processing step for pattern-classification and machine learning applications.
The goal is to project a dataset onto a lower-dimensional space with good class-separability in order avoid overfitting (“curse of dimensionality”) and also reduce computational costs.
LDA
assumes that the observations within each class are drawn from a multivariate Gaussian distribution
and the covariance of the predictor variables are common across all
Like LDA, the QDA
classifier assumes that the observations from each class of Y are drawn from a Gaussian distribution. However, unlike LDA, QDA assumes that each class has its own covariance matrix.
- https://sebastianraschka.com/Articles/2014_python_lda.html
- http://washstat.org/presentations/20150604/loh_slides.pdf
- http://uc-r.github.io/discriminant_analysis
- https://pubmed.ncbi.nlm.nih.gov/12365038/
The boundaries can often be approximated by linear ones connected
by a low-dimensional nonlinear manifold.
while it is difficult to determine the number of linear approximations and their `convergence region'.
For example, we can use the multiple linear combination of features to substitute the univariate function:
We will generalize the decision boundary into decision manifold.
As decision tree consists of several leaves where k-nearest neighbor classifier perform the actual classification task by a voting scheme, model selection is related to tree technique.
The goal of decision manifold is to approximate the decision boundaries rather than the data points using similar techniques like local approximation as manifold learning.
This algorithm is based upon Self-Organizing Maps and linear classification.
Mathematically, a decision surface is a hyper-surface (i.e. of
dimension
The Decision Manifold
consists of
The representatives
- The data samples are assigned to the closest local classifier representative
as the Template Matching. And compute data centroid
$n_j$ of the partition$j$ . - Once all the samples have been assigned, it is to test if samples of both classes are present at certain partition.
-
- If true, a linear classifier is trained for the partition
$j$ to obtain a separating hyperplane$\tilde{w}_j$ .
- If true, a linear classifier is trained for the partition
-
- compute a new preliminary representative and store the information for classification in the normalized vector:
$$c_j^{\prime}=\pi_{\tilde{w}_j}(n_J), v_j^{\prime}=\frac{\tilde{w}_j}{|\tilde{w}j|}$$
where $\pi{\tilde{w}_j}(\cdot)$ denotes orthogonal projection onto the hyperplane specified by
$\tilde{w}_j$
- compute a new preliminary representative and store the information for classification in the normalized vector:
$$c_j^{\prime}=\pi_{\tilde{w}_j}(n_J), v_j^{\prime}=\frac{\tilde{w}_j}{|\tilde{w}j|}$$
where $\pi{\tilde{w}_j}(\cdot)$ denotes orthogonal projection onto the hyperplane specified by
-
- Updated local classifiers subjected to the neighborhood kernel smoothing and weighted according to topological distance.
In its most simple form, classification is performed by assigning a data sample to its closest linear classifier in the same way as during training, and
then classifying it according to its position relative to the hyperplane, and is defined as
we recapitulate the properties of this method:
- The algorithm is a stochastic supervised learning method for two-class problems.
- Computation is very efficient.
- The topology induced by adjacency matrix
$A$ defines the ordering and alignment of the local classifiers; it is also exploited for optimizing classification accuracy by a weighted voting scheme. - As the topology of the decision hyper-surface is unknown, we apply a heuristic model selection that trains several classifiers with different topologies.
- The classifier performs well in case of multiple noncontiguous decision surfaces and non-linear classification problems such as XOR.
It is not a recursive partitioning method because the number of local classifiers are set as a hyperparameter determined before the training.
- http://www.ifs.tuwien.ac.at/~andi/publications/pdf/dit_ijcnn05.pdf
- http://www.ifs.tuwien.ac.at/~poelzlbauer/publications/Poe05IJCNN.pdf
- https://www.sba-research.org/team/andreas-rauber/
Suppose that we have
Here we follow the framework in Recursive Partitioning and Applications. Random forest is generated in the following way:
- Draw a bootstrap sample. Namely, sample
$n$ observations with replacement from the original sample. - Apply recursive partitioning to the bootstrap sample. At each node,
randomly select
$q$ of the$p$ predictors and restrict the splits based on the random subset of the$q$ variables. Here,$q$ should be much smaller than$p$ . - Let the recursive partitioning run to the end and generate a tree.
- Repeat Steps 1 to 3 to form a forest. The forest-based classification
is made by
majority vote
from all trees.
If Step 2 is skipped, the above algorithm is called bagging (bootstraping and aggregating).
Different from random forest and bagging, boosting is aimed to find the better model via combining weaker models.
- Apply recursive partitioning to the training set and and generate a tree., wholely or partially.
- Depending on the training set errors of this predictor, change the weights and grow the next predictor.
- Repeat Steps 1 to 2 to form a forest. The forest-based classification
is made by
majority vote
from all trees.
In forest construction, there are some practical questions including the number of trees, the number of predictors, the way to train a tree.
- https://www.stat.berkeley.edu/~breiman/wald2002-1.pdf
- Interpreting random forest classification models using a feature contribution method
- iterative Random Forests to discover predictive and stable high-order interactions
- Building more accurate decision trees with the additive tree
- https://jowel.gitlab.io/welbl/
- https://erwanscornet.github.io/
- https://phys.org/news/2018-03-science-machine-method-forests-trees.html
- https://zero-lab-pku.github.io/publication/
- https://kogalur.github.io/randomForestSRC/theory.html
- https://github.com/MingchaoZhu/InterpretableMLBook
- http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf
Decision trees apply the ancient wisdom "因地制宜""因材施教" to learn from data. The problem is that we should know the the training data set well and we should evaluate how well our model fit the training data set correctly. Usually the statistical prediction methods are discriminant or generative. Although the decision trees are unified in conditional framework such as Torsten Hothorn, Kurt Hornik & Achim Zeileis, it is also possible to apply tree related methods to estimate the density of the data.
- Naive Bayes, Discriminant Analysis and Generative Methods
- Comparison of Discriminant Analysis and Decision Trees for the Detection of Subclinical Keratoconus
It is natural to extend the decision boundary to piece-wise polynomial.
B-spline and Bézier Curve can be used to describe the decision boundary.
Observe that
- http://blog.pluskid.org/?p=533
- Explaining The Success Of Nearest Neighbor Methods In Prediction
- Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds
The ideal model is to learn the manifold structure of data automatically.
- https://eeecon.uibk.ac.at/~zeileis/blog/
- https://arxiv.org/pdf/1909.00900.pdf
- The power of unbiased recursive partitioning
- Unbiased Recursive Partitioning I: A Non-parametric Conditional Inference Framework
- http://www.saedsayad.com/docs/gbm2.pdf
- http://www.saedsayad.com/author.htm
- https://epub.ub.uni-muenchen.de/
We carry the integration of parametric models into trees one step further and provide a rigorous theoretical foundation by introducing a new unified framework that embeds recursive partitioning into statistical model estimation and variable selection.
Generalized linear mixed-effects model tree (GLMM tree) algorithm allows for the detection of treatment-subgroup interactions while accounting for the clustered structure of a dataset:
where
- Model-based Recursive Partitioning
- MPT trees published in BRM
- Spatial lag model trees
- Generalised Linear Model Trees with Global Additive Effects
- Partially additive (generalized) linear model trees
- party: A Laboratory for Recursive Partytioning
- https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1906005
- Distributional regression forests on arXiv
- Network model trees
- Network Model Trees project
- Generalized structured additive regression based on Bayesian P-splines
- Glm tree
- https://eeecon.uibk.ac.at/~zeileis/news/bamlss/
In representation learning it is often assumed that real-world observations
The key idea behind this model is that the high-dimensional data
An instance starts a path from the root node all the way down to a leaf node according to its real feature value in a single decision tree. All the instances in the training data will fall into several nodes and different nodes have quite different label distributions of the instances in them in the additive tree models. Every step after passing a node, the probability of being the positive class changes with the label distributions. All the features along the path contribute to the final prediction of a single tree.
There is a trend to find a balance between the accuracy and explanation by combining decision trees and deep neural networks.
The main difference from the soft decision trees are that they are designed to learn the representation of the input samples.
Instead of combining neural networks and decision trees explicitly, several works borrow ideas from decision trees for neural networks and vice versa. These models are so-called neural decision tree in this context.
Generally speaking, tree means that the instances would take different processing methods according to their feature values.
One drawback of decision tree is the lack of feature transformation
.
Neural-Backed Decision Trees
(NBDT) are the first to combine interpretability of a decision tree with accuracy of a neural network.
Training an NBDT occurs in 2 phases: First, construct the hierarchy for the decision tree. Second, train the neural network with a special loss term. To run inference, pass the sample through the neural network backbone. Finally, run the final fully-connected layer as a sequence of decision rules.
- Construct a hierarchy for the decision tree, called the Induced Hierarchy.
- This hierarchy yields a particular loss function, which we call the Tree Supervision Loss.
- Start inference by passing the sample through the neural network backbone. The backbone is all neural network layers before the final fully-connected layer.
- Finish inference by running the final fully-connected layer as a sequence of decision rules, which we call
Embedded Decision Rules
. These decisions culminate in the final prediction.
- http://nbdt.alvinwan.com/
- https://arxiv.org/pdf/2004.00221.pdf
- Making decision trees competitive with neural networks on CIFAR10, CIFAR100, TinyImagenet200, Imagenet
- Making Decision Trees Accurate Again: Explaining What Explainable AI Did Not
Deep Neural Decision Trees (DNDT) are tree models which are realised by neural networks. A DNDT is intrinsically interpretable, as it is a tree. Yet as it is also a neural network (NN), it can be easily implemented in NN toolkits, and trained with gradient descent rather than greedy splitting.
A soft binning function is used to make the split decisions in DNDT. Given our binning function, the key idea is to construct the decision tree via Kronecker product.
- https://github.com/wOOL/DNDT
- https://arxiv.org/abs/1806.06988
- https://github.com/Nicholasli1995/VisualizingNDF
- https://sites.google.com/view/whi2018/home
- http://www.deeplearningpatterns.com/doku.php?id=deep_neural_decision_tree
Deep neural networks and decision trees operate on largely separate paradigms; typically, the former performs representation learning with pre-specified architectures, while the latter is characterised by learning hierarchies over pre-specified features with data-driven architectures. We unite the two via adaptive neural trees (ANTs), a model that incorporates representation learning into edges, routing functions and leaf nodes of a decision tree, along with a backpropagation-based training algorithm that adaptively grows the architecture from primitive modules (eg, convolutional layers).
- https://topos-theory.github.io/deep-neural-decision-forests/
- http://www.robots.ox.ac.uk/~tvg/publications/talks/deepNeuralDecisionForests.pdf
- https://www.microsoft.com/en-us/research/publication/adaptive-neural-trees/
- Deep Neural Decision Forests
- https://github.com/Microsoft/EdgeML/wiki/Bonsai
- https://www.microsoft.com/en-us/research/publication/resource-efficient-machine-learning-2-kb-ram-internet-things/
DeepGBM
integrates the advantages of the both NN and GBDT by using two corresponding NN components:
(1) CatNN, focusing on handling sparse categorical features.
(2) GBDT2NN, focusing on dense numerical features with distilled knowledge from GBDT.
- Implementation for the paper "DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks"
- https://tracholar.github.io/wiki/machine-learning/deep-gbm.html
- http://www.arvinzyy.cn/2019/08/13/kdd-2019-paper-notes/
- DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks
- https://www.msra.cn/zh-cn/news/features/kdd-2019
- Unpack Local Model Interpretation for GBDT
- Optimal Action Extraction for Random Forests and Boosted Trees
- https://github.com/benedekrozemberczki/awesome-gradient-boosting-papers
Deep neural network models usually make an effort to learn a new feature space and employ a multi-label classifier on the top.
Deep Forest
is an alternative of deep learning.
gcForest
is the first deep learning model
which is NOT based on NNs
and which does NOT rely on BP.
- http://www.lamda.nju.edu.cn/code_gcForest.ashx
- Talk: Deep Forest Towards An Alternative to Deep Neural Networks
Different multi-label forests are ensembled in each layer of MLDF.
From layer
The critical idea of measure-aware feature reuse is to partially reuse the better representation in the previous layer on the current layer if the confidence on the current layer is lower than a threshold determined in training. Therefore, the challenge lies in defining the confidence of specific multi-label measures on demand.
- https://explainai.net/
- Deep Residual Decision Making
- Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
- https://www.compstat.statistik.uni-muenchen.de/publications/
- Deep Embedding Forest: Forest-based Serving with Deep Embedding Features
- Multi-Label Learning with Deep Forest
- Talk: An exploration to non-NN deep models based on non-differentiable modules
Recursive partitioning is a very simple idea for clustering. It is the inverse of hierarchical clustering. This article provides an introduction to recursive partitioning techniques; that is, methods that predict the value of a response variable by forming subgroups of subjects within which the response is relatively homogeneous on the basis of the values of a set of predictor variables. Recursive partitioning are essentially fairly simple nonparametric techniques for prediction and classification. When used in the standard way, they provide trees which display the succession of rules that need to be followed to derive a predicted value or class. This simple visual display explains the appeal to decision makers from many disciplines. Recursive partitioning methods have become popular and widely used tools for non-parametric regression and classification in many scientific fields.
It works by building a forest of
In each tree, the set of training points is recursively partitioned into smaller and smaller subsets until a leaf node of at most M points is reached. Each partition is based on the cosine of the angle the points make with a randomly drawn hyperplane: points whose angle is smaller than the median angle fall in the left partition, and the remaining points fall in the right partition.
The resulting tree has predictable leaf size (no larger than M) and is approximately balanced because of median splits, leading to consistent tree traversal times.
Querying the model is accomplished by traversing each tree to the query point's leaf node to retrieve ANN candidates from that tree, then merging them and sorting by distance to the query point.
We present such a method for multi-dimensional image compression called Compression via Adaptive Recursive Partitioning (CARP). In Nearest Neighbor Search, recursive partitioning is referred as an approximate methods related with k-d tree as shown in Nearest neighbors and vector models – part 2 – algorithms and data structures.
Nice! We end up with a binary tree that partitions the space.
The nice thing is that points that are close to each other in the space are more likely to be close to each other in the tree
.
In other words, if two points are close to each other in the space,
it's unlikely that any hyperplane will cut them apart.
To search for any point in this space, we can traverse the binary tree from the root. Every intermediate node (the small squares in the tree above) defines a hyperplane, so we can figure out what side of the hyperplane we need to go on and that defines if we go down to the left or right child node. Searching for a point can be done in logarithmic time since that is the height of the tree.
- k-d Tree Nearest Neighbor Search
- 14.2 - Recursive Partitioning
- Recursive Partitioning
- http://meng.rice.edu/research/
- https://core.ac.uk/display/6786050
- Metric Indexes based on Recursive Voronoi Partitioning
Spatial Trees are a recursive space partitioning data structure that can help organize high-dimensional data.
They can assist in analyzing the underlying data density
,
perform fast nearest-neighbor searches
,
and do high quality vector-quantization
.
There are several instantiations of spatial trees.
The most popular include KD trees, RP trees (random projection trees), PD trees (principal direction trees).
- https://github.com/rpav/spatial-trees
- https://www.cliki.net/spatial-trees
- http://www.sai.msu.su/~megera/
- Random projection trees and low dimensional manifolds
- Random Projections of Smooth Manifolds
- https://github.com/vioshyvo/mrpt
- https://github.com/nmslib/nmslib
- Fast Nearest Neighbor Search through Sparse Random Projections and Voting
- https://github.com/lmcinnes/pynndescent
- https://github.com/yahoojapan/NGT
- http://ann-benchmarks.com/
- https://zilliz.com/cn/
- https://www.milvus.io/
With the availability of large databases and recent improvements in deep learning methodology, the performance of AI systems is reaching, or even exceeding, the human level on an increasing number of complex tasks.
Impressive examples of this development can be found in domains such as image classification
, sentiment analysis
, speech understanding
or strategic game playing
.
However, because of their nested non-linear structure
, these highly successful machine learning and artificial intelligence models are usually applied in a black-box manner, i.e. no information is provided about what exactly makes them arrive at their predictions.
Since this lack of transparency can be a major drawback, e.g. in medical applications, the development of methods for visualizing, explaining and interpreting deep learning
models has recently attracted increasing
attention.
The decision boundary of decision trees are clear and parameterized. The visualization, explaining and interpreting decision tree models seems muck easier than deep models.
Like geometry, there are diverse forms of decision trees:
- Graphical form;
- Logical form;
- Analytic form.
And analytic form of decision trees are close to neural network. Neural decision trees,
- https://github.com/wOOL/DNDT
- Interpretable Decision Sets: A Joint Framework for Description and Prediction
- http://yosinski.com/deepvis
- https://sites.google.com/corp/view/whi2018
- http://airesearch.com/
- https://www.itu.int/en/journal/001/Documents/itu2017-5.pdf
- Interpretable Machine Learning
- https://christophm.github.io/interpretable-ml-book/
- https://beenkim.github.io/
- https://zhuanlan.zhihu.com/p/78822770
- https://zhuanlan.zhihu.com/p/74542589
- https://www.jiqizhixin.com/articles/0211https://www.jiqizhixin.com/articles/0211https://www.jiqizhixin.com/articles/0211
- https://www.unite.ai/what-is-a-decision-tree/
- https://github.com/conan7882/CNN-Visualization
The recommdener sysytem is different from search because there is no explicit query in the recommder system. The recommdener learn the users' preference or interest in some items under some specific context. In an abstract way, the recommder system is to map the information of the user into the sorted order of the items.
We have shown that the essence of the decision tree is template matching. Tree-based techniques can speed up the search of items such as the vector serach model.
- https://www.infoq.cn/article/y95dtkfr2_lhBus40Ksi
- https://blog.csdn.net/b0Q8cpra539haFS7/article/details/79722374
- https://arxiv.org/pdf/1902.07565.pdf
- https://github.com/imsheridan/DeepRec
- https://arxiv.org/abs/1801.02294
- https://github.com/LRegan666/Tree_Deep_Model
The sum-product network is the general form in Tractable Deep Learning.
Essentially all tractable graphical models can be cast as SPNs, but SPNs are also strictly more general. We then propose learning algorithms for SPNs, based on backpropagation and EM.
The compactness of graphical models can often be greatly
increased by postulating the existence of hidden variables
The SPN is defined as following
A SPN is rooted DAG whose leaves are
$x_1, \cdots , x_n$ and $\bar{x}_1, \cdots, \bar{x}n$ with internal sum and product nodes, where each edge $(i, j)$ emanating from sum node $i$ has a weight $w{i} \geq 0$.
Advantages of SPNs:
- Unlike graphical models, SPNs are tractable over high treewidth models
- SPNs are a deep architecture with full probabilistic semantics
- SPNs can incorporate features into an expressive model without requiring approximate inference.
Decision tree is flexible with categorical and numerical variable. Besides the dummy variables, another generalization of decision tree is the statistical relational learning based on sum-product network.
Conventional machine learning techniques have two primary assumptions that limit their application in relational domains.
First, algorithms for propositional data assume that data instances are recorded in homogeneous structures
(i.e., a fixed number of attributes for each entity) but relational data instances are usually more varied and complex (e.g., molecules have different numbers of atoms and bonds).
Second, the algorithms assume that data instances are independent but relational data often violate this assumption---dependencies
may occur either as a result of direct relations or through chaining multiple relations together.
For example, scientific papers have dependencies through both citations (direct) and authors (indirect).
- Introduction to Statistical Relational Learning
- https://data-science-blog.com/blog/2016/08/17/statistical-relational-learning/
- https://martint.blog/
- Robert Peharz publication
- Talk: Tractable Models and Deep Learning: A Love Marriage (Robert Peharz)
Pairwise Learning refers to learning tasks with the associated loss functions depending on pairs of examples.
Circle Loss provides a pair similarity optimization viewpoint on deep feature learning, aiming to maximize the within-class similarity
Optimizing unified loss
function is defined as
$$\mathcal{L}{uni}(x)=\log[1+\exp(\sum{i}\sum_{j}\gamma((s_n^j − s_p^i+m)))]$$
in which
We consider to enhance the optimization flexibility by allowing each similarity score to learn at its own pace, depending on its current optimization status.
The unified loss function is transferred into the proposed Circle loss
by
$$\mathcal{L}{circle}(x)=\log[1+\exp(\sum{i}\sum_{j}\gamma((\alpha_p^i s_n^j − \alpha_p^i s_p^i)))]$$
in which
- Learning pairwise image similarities for multi-classification using Kernel Regression Trees
- https://arxiv.org/pdf/1612.02295.pdf
- https://arxiv.org/pdf/1703.07464.pdf
- https://www.salford-systems.com/products/mars
- MARS
- Discussion
- Tree-classifier via Nearest Neighbor Graph Partitioning
- Recursive Partitioning for Personalization using Observational Data
- https://statmodeling.stat.columbia.edu/2019/11/26/machine-learning-under-a-modern-optimization-lens-under-a-bayesian-lens/
- https://arxiv.org/abs/1908.01755
- https://arxiv.org/abs/1904.12847
- http://akyrillidis.github.io/
————————
- https://github.com/jwasham/coding-interview-university
- https://github.com/jackfrued/Python-100-Days
- https://tlverse.org/tmle3/index.html
Contrastive methods, as the name implies, learn representations by contrasting positive and negative examples. Although not a new paradigm, they have led to great empirical success in computer vision tasks with unsupervised contrastive pre-training.
Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes.
More formally, for any data point
- here
$x^+$ is data point similar or congruent to$x$ , referred to as a positive sample. -
$x^-$ is a data point dissimilar to$x$ , referred to as a negative sample. - the
$\textrm{score}$ function is a metric that measures the similarity between two features.
Here the
- https://paperswithcode.com/
- https://paperswithcode.com/sota
- https://github.com/HobbitLong/SupContrast
- Supervised Contrastive Learning
- https://ankeshanand.com/blog/2020/01/26/contrative-self-supervised-learning.html
Super Learner is motivated by this use of cross validation as a weighted combination of many candidate learners.
- https://tlverse.org/tmle3/index.html
- https://tlverse.org/sl3/
- https://pubmed.ncbi.nlm.nih.gov/17910531/
Tree is a special kind of Directed Acyclic Graph.
- https://en.wikipedia.org/wiki/Network_calculus
- https://www.nps.edu/web/math/network-science
- https://www.siam.org/conferences/cm/conference/ns18
There are four key network properties to characterize a graph: degree distribution
, path length
, clustering coefficient
, and connected components
.
The connectivity of a graph measures the size
of the largest connected component.
The largest connected component is the largest set where any two vertices can be joined by a path.
The clustering coefficient (for undirected graphs) measures what proportion of node
- https://github.com/GraphBLAS/graphblas-pointers
- http://graphblas.org/index.php?title=Graph_BLAS_Forum
- Graph Algorithms in the Language of Linear Algebra
- https://www.cs.yale.edu/homes/spielman/
- https://github.com/alibaba/euler
- https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/gelly/
- http://tinkerpop.apache.org/
The oridnary database stores the tabular data, where each record makes up a row like a vector. And Structured Query Language (SQL) is the standard language to query the vector-like data. And many search engines are based on the vector search.
- https://www.cnblogs.com/myitroad/p/7727570.html
- https://blog.csdn.net/javeme/article/details/82631834
- http://tinkerpop.apache.org/
A computational graph is defined as
a directed graph where the nodes
correspond to mathematical operations. Computational graphs are a way of expressing and evaluating a mathematical expression.
[To create a computational graph, we make each of these operations, along with the input variables, into nodes. When one node’s value is the input to another node, an arrow goes from one to another.](- http://colah.github.io/posts/2015-08-Backprop/)
- https://deepnotes.io/tensorflow
- https://sites.google.com/a/uclab.re.kr/lhparkdeep/feeding-habits
- https://kharshit.github.io/blog/
- http://rll.berkeley.edu/cgt/
- http://www.charuaggarwal.net/
- https://www.cl.cam.ac.uk/~ey204/teaching/ACS/R244_2017_2018/presentation/S2/Nat_TF.pdf
- DEEP LEARNING WITH DYNAMIC COMPUTATION GRAPHS
One application of computational graph is to explain the automatic differentiation
or auto-diff
.
- http://www.autodiff.org/
- https://github.com/apache/incubator-tvm
- https://tvm.apache.org/
- https://github.com/google/jax
- https://github.com/TimelyDataflow/differential-dataflow
- http://www.optimization-online.org/DB_FILE/2020/02/7640.pdf
The Dataflow Model is to deal with the distributed machine learning which focus on the data rather than the operation.
Map-Reduce | Iterative Map-reduce | Parameter sever | Data flow |
---|---|---|---|
Hadoop | Twister | Flink | Tensorflow |
- http://iterativemapreduce.org/
- https://github.com/dmlc/ps-lite
- https://ps-lite.readthedocs.io/en/latest/
The dataflow model use the Directed Acyclic Graph (DAG) to describe the programs.
- The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
- http://cs.brown.edu/~ugur/vldbj03.pdf
- https://www.cl.cam.ac.uk/~ey204/teaching/ACS/R244_2017_2018/presentation/S5/Thomas_DFM.pdf
- https://www.cl.cam.ac.uk/~ey204/teaching/ACS/R244_2017_2018/
- https://id2221kth.github.io/
- https://yq.aliyun.com/articles/64911
- https://cloud.google.com/dataflow/
- https://nifi.apache.org/index.html
- https://my.oschina.net/taogang/blog/1819665
- Awesome Flow-Based Programming (FBP) Resources
- Naiad: A Timely Dataflow System
Airflow is a platform to programmatically author, schedule and monitor workflows.
Use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!).
- http://oozie.apache.org/
- http://airflow.apache.org/index.html
- https://azkaban.github.io/
- https://www.xuxueli.com/xxl-job/
- https://open-data-automation.readthedocs.io/en/latest/fme/job-scheduling.html
- https://wexflow.github.io/
Graph Learning is to learn from the graph data: Simply said, it’s the application of machine learning techniques on graph-like data. Deep learning techniques (neural networks) can, in particular, be applied and yield new opportunities which classic algorithms cannot deliver.
- https://snap-stanford.github.io/cs224w-notes/machine-learning-with-networks/graph-neural-networks
- https://graphsandnetworks.com/blog/
- http://web.stanford.edu/class/cs224w/
- https://networkdatascience.ceu.edu/
- http://tensorlab.cms.caltech.edu/users/anima/
- https://graphvite.io/
- https://github.com/alibaba/graph-learn
- https://www.mindspore.cn/install
- http://www.norbertwiener.umd.edu/
- http://david-wild-knzl.squarespace.com/
- https://github.com/stanford-ppl/Green-Marl
- https://arxiv.org/abs/2001.00426
- https://sigport.org/
- https://sigport.org/sites/default/files/graphSP_prob.pdf
- http://biron.usc.edu/wiki/index.php/EE_599_Graph_Signal_Processing
- http://gspworkshop.org/
- https://lts2.epfl.ch/
- https://pygsp.readthedocs.io/en/stable/https://pygsp.readthedocs.io/en/stable/
- https://epfl-lts2.github.io/gspbox-html/
- http://2019.ieeeglobalsip.org/pages/symposium-graph-signal-processing
- http://spark.apache.org/graphx/
- http://giraph.apache.org/
- https://www.graphengine.io/
- http://web.media.mit.edu/~xdong/resource.html
- https://epfl-lts2.github.io/gspbox-html/
- https://www.usenix.org/conference/fast17/technical-sessions/presentation/liu
- https://github.com/twitter/cassovary
- https://info.sice.indiana.edu/~dingying/Teaching/S604/LODBook.pdf
- http://cheminfov.informatics.indiana.edu:8080/c2b2r/
- Linked Data Glossary
- http://events.linkeddata.org/ldow2018/
- http://schlegel.github.io/balloon/
- https://www.w3.org/standards/semanticweb/data
- https://wiki.lyrasis.org/
- http://www.iaria.org/conferences2018/ALLDATA18.html
- https://hunglvosu.github.io/res/lect10-social-net.pdf
- https://socialnetwork.readthedocs.io/en/latest/index.html
- https://www.cs.purdue.edu/homes/neville/
- https://www.knime.com/network-mining
- https://www.knime.com/knime-labs
- http://gautambhat.github.io/gautambhat.github.io/
- http://aimlab.cs.uoregon.edu/smash/
- http://monajalal.github.io/
Graph neural networks are aimed to process the graph data with the deep learning models.
Generally, deep neural networks are aimed at the tensor inputs.
- https://grlplus.github.io/
- http://tkipf.github.io/
- Deep learning with graph-structured representations
- Natural Language Processing and Text Mining with Graph-structured Representations
- The resurgence of structure in deep neural networks
- https://nlp.stanford.edu/~socherr/thesis.pdf
- https://www.cl.cam.ac.uk/~pl219/
- https://cqf.io/papers/Dynamic_Hierarchical_Mimicking_CVPR2020.pdf
Transformers are Graph Neural Networks.
At a high level, all neural network architectures build representations of input data as vectors/embeddings, which encode useful statistical and semantic information about the data. These latent or hidden representations can then be used for performing something useful, such as classifying an image or translating a sentence.
We update the hidden feature of the $j$th word in a sentence from layer
$$\text{where} \ w_{ij} = \text{softmax}j \left( Q^{\ell} h{i}^{\ell} \cdot K^{\ell} h_{j}^{\ell} \right),$$
where