Skip to content

Latest commit

 

History

History
301 lines (221 loc) · 20.5 KB

ML-SVM.md

File metadata and controls

301 lines (221 loc) · 20.5 KB

Support Vector Machine (SVM)

Overview

  • Support Vector Machines (SVM)

    • never a good bet for Artificial Intelligence tasks that need good representations
    • SVM: just a clever reincarnation of Perceptrons with kernel function
    • viewpoint 1:
      • expanding the input to a (very large) layer of non-linear non-adaptive features; like perceptrons w/ big layers of features
      • only one layer of adaptive weights, the weights from the features to the decision unit
      • a very efficient way of fitting the weights that controls overfitting by maximum margin hyperplane in a high dimensional space
    • viewpoint 2:
      • using each input vector in the training set to define a non-adaptive "pheature"
      • global match btw a test input and that training input, i.e., how similar the test input is to a particular training case
      • a clever way of simultaneously doing feature selection and finding weights on the remaining features
    • Limitation:
      • only for non-adaptive features and one layer of adaptive weights
      • unable to learn multiple layers of representation
  • kernel function

    • kernel function:

      [ K(\vec{v}, \vec{w}) \text{ with } K: \Bbb{R}^N \times \Bbb{R}^K \to \Bbb{R} ]

    • function computed a dot product btw $\vec{v}$ and $\vec{w}$, ie, a measure of 'similarity' btw $\vec{v}$ and $\vec{w}$

  • Binary vs. multiclass classification

    • binary classifier: simple
    • multiclass classification problem
      • more than 2 possible outcomes
      • example: train face verification system to detect the identity of a photograph from a pool of N people (where N > 2)

Modeling SVM

  • hinge loss function

  • Support vector machine: Objective

    [\min_\theta C \cdot \sum_{j=1}^m \left[y^{(i)} \text{cost}_1 (\theta^T x^{(i)}) + (1 - y^{(i)}) \text{cost}0 (\theta^T x^{(i)}) \right] + \dfrac{1}{2} \sum{j=1}^n \theta_j^2]

  • cost functions


    • If $y=1$, we want $\theta^T x \geq 1$ (not just $\geq 0$)
      If $y=0$, we want $\theta^T x \leq -1$ (not just $< 0$)
  • Logistic regression vs SVMs

    • logistic regression or SVM $n =;$ number of features ($x \in \mathbb{R}^{n+1}$), $m = ;$ number of training examples
      if $n$ is large (relative to $m$):
      Use logistic regression, or SVM without a kernel ("linear kernel")
      if $n$ is mall, $m$ is intermediate: (e.g, n = 1~1,000, m = 10~10,000)
      Use SVM with Gaussian kernel

      if $n$ is small, $m$ is large: (e.g., n = 1~1,000, m = 50,000+)
      Create/add more features, then use logistic regression or SVM without a kernel

    • Neural network likely to work well for most of these settings, but may be slower to train

  • Multi-class classification

    • classes: $y ;\in; {1, 2, 3, \ldots, K}$
    • Many SVM packages already have built-in multi-class classification functionality
    • Otherwise, use one-vs-all method. (Train $K$ SVMs, one to distinguish $y=i$ from the rest, for $i=1, 2, \ldots, K$), get $\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(K)}$. Pick class $i$ with largest $(\theta^{(i)})^Tx$

Decision Boundary

  • Simplification for Decision Boundary

    • Objective:

      [ \min_\theta C ;\underbrace{\sum_{i=1}^m \left[ y^{(i)} \text{cost}1(\theta^Tx^{(i)}) + (1 - y^{(i)}) \text{cost}0(\theta^Tx^{(i)}) \right]}{(A)} + \dfrac{1}{2} \sum{j=1}^n \theta_j^2 ]

    • $C \gg 0$, $(A) = 0;$ to minimize the cost function

    • Wherever $y^{(i)} = 1;: \theta^T x^{(i)} \geq 1$
      Wherever $y^{(i)} = 0;: \theta^T x^{(i)} \leq -1$

      [\begin{array}{rl} \min_\theta & C \cdot 0 + \dfrac{1}{2} \sum_{j=1}^n \theta^2_j \\ \text{s.t.} & \theta^T x^{(i)} \geq 1 \quad \text{if } y^{(i)} = 1 \ & \theta^T x^{(i)} \leq -1 \quad \text{if } y^{(i)} = 0 \end{array}]

  • SVM decision boundary

    • Objective

      [\begin{array}{ll} \displaystyle \min_\theta & \dfrac{1}{2} \sum_{j=1}^n \theta^2_j \\ \text{s.t. } & \theta^T x^{(i)} \geq 1 \quad \text{if } y^{(i)} = 1 \ & \theta^T x^{(i)} \leq -1 \quad \text{if } y^{(i)} = 0 \end{array}]

    • Projections and hypothesis

      [\begin{array}{ll} \displaystyle \min_\theta & \dfrac{1}{2} \displaystyle \sum_{j=1}^n \theta^2_j = \dfrac{1}{2} \parallel \theta \parallel^2 \\ \text{s.t. } & p^{(i)} \cdot \parallel \theta \parallel \geq 1 \quad \text{if } y^{(i)} = 1 \ & p^{(i)} \cdot \parallel \theta \parallel \leq -1 \quad \text{if } y^{(i)} = 0 \end{array}]

      where $p^{(i)}$ is the projection of $x^{(i)}$ onto the vector $\theta$.

      • Simplification: $\theta_0 = 0$ - When $\theta_0 = 0$, the vector passes through the origin.
      • $\theta$ projection: always $90^o$ to the decision boundary

Kernels

  • Gaussian kernel

    • Given $x$, compute new feature depending on proximity to landmarks $l^{(1)}, l^{(2)}, l^{(3)}, \ldots$

      [\begin{array}{rcl} f_1 & = & similarity(x, l^{(1)}) = \exp \left( -\dfrac{\parallel x - l^{(1)} \parallel^2}{2 \sigma^2} \right) \ f_2 & = & similarity(x, l^{(2)}) = \exp \left( -\dfrac{\parallel x - l^{(2)} \parallel^2}{2 \sigma^2} \right) \ f_3 & = & similarity(x, l^{(3)}) = \exp \left( -\dfrac{\parallel x - l^{(3)} \parallel^2}{2 \sigma^2} \right) \ & \cdots \end{array}]

    • manually pick 3 landmarks

    • given an example $x$, define the features as a measure of similarity between $x$ ans the landmarks

      [\begin{array}{rcl} f_1 &=& similarity(x, l^{(1)}) \ f_2 &=& similarity(x, l^{(2)}) \ f_3 &=& similarity(x, l^{(3)}) \end{array}]

    • kernel: $k(x, l^{(i)}) = similarity(x, l^{(i)})$

    • The similarity functions are Gaussian kernels, $\exp\left( - \dfrac{\parallel x - l^{(i)} \parallel^2}{2\sigma^2} \right)$.

  • Kernels and Similarity

    [f_1 = similarity(x, l^{(1)}) = exp \left(-\dfrac{\parallel x - l^{(1)} \parallel^2}{2\sigma^2} \right) = \exp \left( -\dfrac{\sum_{j=1}^n (x_j - l_j^{(1)})^2}{2 \sigma^2} \right)]

    • If $x \approx l^{(1)}: f_1 \approx \exp \left( -\dfrac{0^2}{2\sigma^2} \right) \approx 1$
    • If $x$ is far from $l^{(1)}: f_1 = \exp \left( - \dfrac{(\text{large number})^2}{2\sigma^2} \right) \approx 0$
  • SVM with kernels

    • Given $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(m)}, y^{(m)})$, choose $l^{(1)} = x^{(1)}, l^{(2)} = x^{(2)}, \ldots, l^{(m)} = x^{(m)}$

    • Given example $x$:

      [\begin{array}{lcl} f_0 = 1 \f_1 = similarity(x, l^{(1)}) \ f_1 = similarity(x, l^{(2)}) \ \cdots \end{array} \implies f = \begin{bmatrix} f_0 \ f_1 \ \vdots \ f_m \end{bmatrix}]

    • For training example $(x^{(i)}, y^{(i)});$:

      [x^{(i)} \quad\implies\quad \begin{array}{rcl} f_0^{(i)} &=& 1 \ f_1^{(i)} &=& sim(x^{(i)}, l^{(1)}) \ f_2^{(i)} &=& sim(x^{(i)}, l^{(2)}) \ &\cdots& \ f_i^{(i)} &=& sim(x^{(i)}, l^{(i)}) = \exp \left( -\dfrac{0}{2\sigma^2} \right) \ &\cdots& \ f_m^{(i)} &=& sim(x^{(i)}, l^{(m)}) \end{array} \implies f^{(i)} = \begin{bmatrix} f_0^{(i)} \ f_1^{(1)} \ \vdots \ f_m^{(i)} \end{bmatrix}]

    • Hypothesis: Given $x$, compute features $f \in \mathbb{R}^{m+1}$

      Predict "y=1" if $\theta^Tf = \theta_0 f_0 + \theta_1 f_1 + \ldots + \theta_m f_m \geq 0, \quad \theta \in \mathbb{R}^{m+1}$

    • Training

      [min_\theta C \cdot \sum_{i=1}^m \left[ y^{(i)} \text{cost}_1(\theta^T f^{(i)}) + (1 - y^{(i)}) \text{cost}0(\theta^T f^{(i)}) \right] + \dfrac{1}{2} \sum{j=1}^{n (=m)} \theta_j^2]

      [\begin{array}{crl} \sum_{j} \theta_j^2 &=& \theta^T \theta = \begin{bmatrix} \theta_1 & \theta_2 & \cdots & \theta_m \end{bmatrix} \begin{bmatrix} \theta_1 \ \theta_2 \ \vdots \ \theta_m \end{bmatrix} = \parallel \theta \parallel^2 \\ &=& \theta^TM\theta = \begin{bmatrix} \theta_0 & \theta_1 & \cdots & \theta_m \end{bmatrix} \begin{bmatrix} 0 & 0 & 0 & \cdots & 0 \ 0 & 1 & 0 & \cdots & 0 \ 0 & 0 & 1 & \cdots & 0 \ \vdots & \vdots & \vdots & \ddots & \vdots \ 0 & 0 & 0 & \cdots & 1 \end{bmatrix} \begin{bmatrix} \theta_0 \ \theta_1 \ \vdots \ \theta_m \end{bmatrix} \end{array}]

    • applying kernel's idea to other algorithms

      • able to do so by applying the kernel's idea and define the source of features using landmarks
      • unable to generalize SVM's computational tricks to other algorithms
  • SVM parameters

    • $C (= 1/\lambda)$
      • Large C (small $\lambda$): lower bias, high variance
      • Small C (large $\lambda$): higher bias, lower variance
    • $\sigma^2$
      • Large $\sigma^2;$: feature $f_i$ vary more smoothly $\implies$ higher bias, lower variance
      • Small $\sigma^2;$: feature $f_i$ vary less smoothly $\implies$ lower bias, higher variance
  • Other choice of kernel

    • Note: not all similarity functions $similarity(x, l)$ make valid kernels. (Need to satisfy technical condition called "Mercer's Theorem" to make sure SVM packages' optimizations run correctly, and do not diverge).
    • Many off-the-shelf kernels available
      • Polynomial kernel: $k(x, l) = (x^Tl + \text{constant})^{\text{degree}}$ such as $(x^T l)^2, (x^T l)^3, (x^T l) + 1^3, (x^T l)^4, \ldots$
      • More esoteric: String kernel, chi-square kernel, histogram intersection kernel, ...

Binary Classification

  • Linear SVM, Binary classification
    • $\exists$ a two-class dataset $D$
    • training a classifier $C$ to predict the class labels of future data points
    • binary classification problem: yes/no question
      • medicine: given a patient's vital data, patient $\stackrel{?}{\to}$ cold
      • computer vision: image containing a person?

Multiclass Classification

  • Main approaches to the multiclass problem
    • directly add a multiclass extension to a binary classifier
      • pros: a principled way of solving the multiclass problem
      • cons: much more complicated $\to$ significantly longer training and test procedures
    • combine multiple binary classifiers to create a 'mega' multiclass classifier
      • pros: simple idea, easy to implement faster than multiclass extensions
      • cons: ad-hoc method for solving the multiclass problem $\to$ probably exist datasets
        • OVO/OVA performing poorly on
        • general multiclass classifier performing well on
    • recommend to use OVO (one-vs-One) / OVA (One-vs-All) rather than more complicated generalized multiclass classifiers

Linear SVM

  • Linear SVM

    • find a hyperplane $\vec{w}$ s.t. best separating the data points in the training set by class labels, $\vec{w}$ $\implies$ decision boundary
    • classify a point $x_i \in X$ (dataset) by simply seeing which 'side' of $\vec{w}$ that $x$ lies (Fig. 1)
    • the hyperplane $\vec{w}$ (a line in $\Bbb{R}^2$) separating the space into two halves (Fig. 2)
      • the Decision Boundary (solid line) of a Linear SVM on a linearly-separable dataset
      • SVM trained on 75% of the dataset, and evaluated on the remaining 25% (circled data points from the test set)

Nonlinear SVM

  • Nonlinear dataset

    • no line in $\Bbb{R}^2$ (Fig. 3)
    • both random classifier and linear SVM perform poorly (Fig. 4)
  • Dealing w/ non-separable data

    • assumption: separating hyperplane $\vec{w}$ as the decision boundary
    • generalizing the constraint of decision boundary, the line in the original feature space (here $\Bbb{R}^2$)
    • explicitly discover decision boundaries w/ arbitrary shape
  • Separable in higher-dimension

    • representing a 2D version of the true dataset lives in $\Bbb{R}^3$
    • the $\Bbb{R}^3$ dataset is easily linear separable by a hyperplane
    • train a linear SVM classifier that successfully finds a good decision boundary
    • given the dataset in $\Bbb{R}^2$, find a transformation $T: \Bbb{R}^2 \to \Bbb{R}^3$ s.t. transformed dataset linearly separable in $\Bbb{R}^3$
    • assume a transformation $\phi$, the new classification pipeline
      1. transform the training set $X$ to $X'$ w/ $\phi$
      2. train a linear SVM on $X'$ to get classifier $f_{svm}$
      3. test: a new sample $\vec{x}$ to $\vec{x'} = \phi(\vec{x}) \to$ output class label determined by $f_{svm}(\vec{x'})$
    • the hyperplane learned in $\Bbb{R}^3$ is nonlinear when projected back to $\Bbb{R}^2$
    • improving the expressiveness of the linear SVM classifier by working a high-dimensional space
  • Procedures for non-linear SVM

    • a dataset $D$, not linearly separable in a high-dimensional space $\Bbb{R}^M (M > N)$
    • $\exists$ a transformation $\phi$ that lifts the dataset $D$ to a higher-dimensional $D^\prime$ to find a decision boundary $\vec{w}$ that separates the classes in $D^\prime$
    • train a linear SVM on $D^\prime$ to find a decision boundary $\vec{w}$ that separates the classes in $D^\prime$
    • projecting the decision boundary $\vec{w}$ found in $\Bbb{R}^M$ back to the original space $\Bbb{R}^N$
  • Caveat: impractical for large dimensions

    • consider the computational consequences of increasing the dimensional consequences of increasing the dimensionality from $\Bbb{R}^N$ to $\Bbb{R}^M$ (M > N)
    • $M$ grows very quickly w.r.t. $N$ (e.g., $M \in \mathcal{O}(2^N)$) $\implies$ learning SVMs via dataset transformations will incurr serious computational and memory problem
    • in general, a $d$-dimensional polynomial kernel maps from $\Bbb{R}^N$ to an $\binom{N+d}{d}$-dimensional space

Kernel trick

  • Dot products

    • the SVM has no need to explicitly work in the higher-dimensional space at training or testing time
    • during training, the optimization problem only uses the training samples to compute pair-wise dot products $(\vec{x_i}, \vec{x_j})$, where $\vec{x_i}, \vec{x_j} \in \Bbb{R}^N$
  • Kernel trick

    • $\exists$ kernel functions, $K(\vec{v}, \vec{w}), ;\vec{v}, \vec{w} \in \Bbb{R}^N$ compute the dot product btw $\vec{v}$ and $\vec{w}$ in a higher-dimensional $\Bbb{R}^M$ w/o explicitly transform $\vec{v}$ and $\vec{w}$ to $\Bbb{R}^M$
    • implication: by using a kernel $K(\vec{x_i}, \vec{x_j})$, implicitly transform dataset to a higher-dimensional $\Bbb{R}^M$ w/o using extra memory and w/ a minimal effect on computation time
  • Kernel functions

    • kernel function $K(\vec{v}, \vec{w})$: $K(\Bbb{R}^N \times \Bbb{R}^M) \to \Bbb{R}$
    • a kernel $K$ effectively computes dot products in a high-dimensional space $\Bbb{R}^M$ while remaining in $\Bbb{R}^N$
    • $\forall ,\vec{x_i}, \vec{x_j} \in \Bbb{R}^N, K(\vec{x_i}, \vec{x_j}) = \left(\phi(\vec{x_i}), \phi(\vec{x_j})\right)_M$ where $(\cdot, \cdot)_M$ = inner product of $\Bbb{R}^M, M>N$, $\phi(\vec{x})$ transforms $\vec{x}$ to $\Bbb{R}^M$; i.e., $\phi: \Bbb{R}^N \to \Bbb{R}^M$

Popular Kernel Functions

  • Popular kernels & sklearn library
    • popular kernels: polynomial, radial basis fucntion, and sigmoid kernel
    • sklearn's SVM implementation svm.svc: kernel parameter - linear, poly, rbf, or sigmoid
    • let $\vec{x_i}, \vec{x_j} \in \Bbb{R}^N$ be rows from dataset $X$
      1. polynomial kernel: $(\gamma \cdot \langle \vec{x_i} , \vec{x_j} \rangle + r)^d$
      2. Radial Basis Function (RBF) Kernel: $\exp\left(-\gamma \cdot \lvert \vec{x_i} - \vec{x_j} \rvert ^2\right)$, where $\gamma > 0$
      3. Sigmoid Kernel: $\tanh(\langle \vec{x_i}, \vec{x_j} \rangle + r)$
    • sklearn's svm.svc uses both gamma and coef0 parameters for the kernel = 'sigmoid' despite the above definition only having $\gamma$
    • choosing the 'correct' kernel is a nontrivial task, and may depend on the specific task at hand
    • true the kernel parameters to get good performance from classifier
    • popular parameter-tuning techniques including K-fold cross validation