15

Sto cercando di capire l'intuizione dietro SVM del kernel. Ora capisco come funziona SVM lineare, per cui viene presa una linea decisionale che divide i dati nel miglior modo possibile. Comprendo anche il principio alla base del porting dei dati in uno spazio di dimensioni superiori e come ciò possa rendere più semplice trovare una linea di decisione lineare in questo nuovo spazio. Quello che non capisco è come un kernel viene utilizzato per proiettare i dati verso questo nuovo spazio.

Quello che so di un kernel è che rappresenta effettivamente la "somiglianza" tra due punti dati. Ma come si collega alla proiezione?

machine-learning svm kernel-trick

— Karnivaurus
fonte

3

Se si va in uno spazio dimensionale abbastanza elevato, tutti i punti dei dati di allenamento possono essere perfettamente separati da un piano. Ciò non significa che avrà alcun potere predittivo di sorta. Penso che andare in uno spazio dimensionale molto elevato sia l'equivalente morale (una forma di) del sovradimensionamento.

— Mark L. Stone,

@Mark L. Stone: è corretto (+1) ma potrebbe essere comunque una buona domanda chiedersi come può un kernel mappare nello spazio dimensionale infinito? Come funziona? Ho provato, vedi la mia risposta

Starei attento a chiamare la feature mapping una "proiezione". La mappatura delle caratteristiche è generalmente una trasformazione non lineare.

— Paul,

Un post molto utile sul trucco del kernel visualizza lo spazio interno del prodotto del kernel e descrive come vengono utilizzati i vettori di caratteristiche ad alta dimensione per raggiungere questo obiettivo, si spera che questo risponda concisamente alla domanda: eric-kim.net/eric-kim-net/ posts / 1 / kernel_trick.html

— JStrahl

6

Let $h(x)$ è la proiezione di elevata dimensione spaziale $\mathcal{F}$ . Fondamentalmente la funzione nucleo $K(x_1,x_2)=\langle h(x_1),h(x_2)\rangle$ , che è l'interno-prodotto. Quindi non viene utilizzato per proiettare punti dati, ma piuttosto un risultato della proiezione. Può essere considerato una misura di somiglianza, ma in un SVM è più di questo.

L'ottimizzazione per trovare l'iperpiano di separazione migliore in $\mathcal{F}$ coinvolge $h(x)$ solo attraverso la forma del prodotto interno. Vale a dire, se conosci $K(\cdot,\cdot)$ , non hai bisogno di conoscere la forma esatta di $h(x)$ , che semplifica l'ottimizzazione.

Ogni kernel $K(\cdot,\cdot)$ ha anche una corrispondente $h(x)$ . Quindi se stai usando un SVM con quel kernel, allora stai implicitamente trovando la linea di decisione lineare nello spazio in cui $h(x)$ mappa.

Il capitolo 12 di Elements of Statistical Learning fornisce una breve introduzione a SVM. Questo fornisce maggiori dettagli sulla connessione tra kernel e mappatura delle caratteristiche: http://statweb.stanford.edu/~tibs/ElemStatLearn/

— Lii
fonte

vuoi dire che per un kernel

K (x, y)

$K(x,y)$ esiste un unico

sottostante

h (x)

$h(x)$ ?

2

@fcoppens No; per un banale esempio, considera

h

$h$ e

- h

$-h$ . Tuttavia, esiste un unico spazio di riproduzione del kernel Hilbert corrispondente a quel kernel.

— Dougal,

@Dougal: Allora posso essere d'accordo con te, ma nella risposta sopra è stato detto 'una

corrispondente

h

$h$ ' quindi volevo essere sicuro. Per l'RKHS vedo, ma pensi che sia possibile spiegare in modo 'intuitivo' come appare questa trasformazione

h

$h$ per un Kernel

K (x, y)

$K(x,y)$ ?

@fcoppens In generale, no; è difficile trovare rappresentazioni esplicite di queste mappe. Per alcuni kernel, tuttavia, non è né troppo difficile né fatto prima.

— Dougal,

1

@fcoppens hai ragione, la h (x) non è unica. Puoi facilmente apportare modifiche a h (x) mantenendo lo stesso prodotto interno <h (x), h (x ')>. Tuttavia, è possibile considerarli come funzioni di base e lo spazio che occupano (ovvero RKHS) è unico.

— Lii,

4

Le proprietà utili del kernel SVM non sono universali - dipendono dalla scelta del kernel. Per ottenere l'intuizione è utile guardare uno dei kernel più comunemente usati, il kernel gaussiano. Sorprendentemente, questo kernel trasforma SVM in qualcosa di molto simile a un classificatore k vicino più vicino.

Questa risposta spiega quanto segue:

Perché una separazione perfetta di dati di allenamento positivi e negativi è sempre possibile con un kernel gaussiano di larghezza di banda sufficientemente piccola (a costo di overfitting)
Come questa separazione può essere interpretata come lineare in uno spazio di caratteristiche.
Come viene utilizzato il kernel per costruire la mappatura dallo spazio dati allo spazio funzioni. Spoiler: lo spazio delle caratteristiche è un oggetto matematicamente astratto, con un insolito prodotto interno astratto basato sul kernel.

1. Raggiungere una separazione perfetta

Una separazione perfetta è sempre possibile con un kernel gaussiano a causa delle proprietà di localizzazione del kernel, che portano a un limite di decisione arbitrariamente flessibile. Per una larghezza di banda del kernel sufficientemente piccola, il limite della decisione sembrerà che tu abbia appena disegnato piccoli cerchi attorno ai punti ogni volta che sono necessari per separare gli esempi positivi e negativi:

(Credito: corso di apprendimento automatico online di Andrew Ng ).

Quindi, perché questo accade da una prospettiva matematica?

Considera l'installazione standard: hai un kernel gaussiano e dati di addestramento $K(\mathbf{x},\mathbf{z}) = \exp(- ||\mathbf{x}-\mathbf{z}||^2 / \sigma^2)$ dove valori sono . Vogliamo imparare una funzione di classificazione $(\mathbf{x}^{(1)},y^{(1)}), (\mathbf{x}^{(2)},y^{(2)}), \ldots, (\mathbf{x}^{(n)},y^{(n)})$ $y^{(i)}$ $\pm 1$

\hat{y} (x) = \sum_{i} w_{i} y^{(i)} K (x^{(i)}, x)

$\hat{y}(\mathbf{x}) = \sum_i w_i y^{(i)} K(\mathbf{x}^{(i)},\mathbf{x})$

Ora come faremo mai assegnare i pesi ? Abbiamo bisogno di spazi dimensionali infiniti e un algoritmo di programmazione quadratica? No, perché voglio solo dimostrare che posso separare perfettamente i punti. Quindi rendo un miliardo di volte più piccolo della separazione più piccola tra due esempi di allenamento e ho appena impostato . Ciò significa che tutti i punti di formazione sono un miliardo di sigma a parte per quanto il kernel è interessato, e ogni punto controlla completamente il segno della $w_i$ $\sigma$ $||\mathbf{x}^{(i)} - \mathbf{x}^{(j)}||$ $w_i = 1$ $\hat{y}$ nel suo quartiere. Formalmente, abbiamo

\hat{y} (x^{(k)}) = \sum_{i = 1}^{n} y^{(k)} K (x^{(i)}, x^{(k)}) = y^{(k)} K (x^{(k)}, x^{(k)}) + \sum_{i \neq k} y^{(i)} K (x^{(i)}, x^{(k)}) = y^{(k)} + ϵ

$\hat{y}(\mathbf{x}^{(k)}) = \sum_{i=1}^n y^{(k)} K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = y^{(k)} K(\mathbf{x}^{(k)},\mathbf{x}^{(k)}) + \sum_{i \neq k} y^{(i)} K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = y^{(k)} + \epsilon$

dove un valore arbitrariamente minuscolo. Sappiamo è piccolo perché è un miliardo di sigma lontano da qualsiasi altro punto, quindi per tutti abbiamo $\epsilon$ $\epsilon$ $\mathbf{x}^{(k)}$ $i \neq k$

K (x^{(i)}, x^{(k)}) = \exp (- | | x^{(i)} - x^{(k)} | |^{2} / σ^{2}) \approx 0.

$K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = \exp(- ||\mathbf{x}^{(i)} - \mathbf{x}^{(k)}||^2 / \sigma^2) \approx 0.$

Since $\epsilon$ is so small, $\hat{y}(\mathbf{x}^{(k)})$ definitely has the same sign as $y^{(k)}$ , and the classifier achieves perfect accuracy on the training data. In practice this would be terribly overfitting but it shows the tremendous flexibility of the Gaussian kernel SVM, and how it can act very similar to a nearest neighbor classifier.

2. Kernel SVM learning as linear separation

The fact that this can be interpreted as "perfect linear separation in an infinite dimensional feature space" comes from the kernel trick, which allows you to interpret the kernel as an abstract inner product some new feature space:

K (x^{(i)}, x^{(j)}) = ⟨ Φ (x^{(i)}), Φ (x^{(j)}) ⟩

$K(\mathbf{x}^{(i)},\mathbf{x}^{(j)}) = \langle\Phi(\mathbf{x}^{(i)}),\Phi(\mathbf{x}^{(j)})\rangle$

where $\Phi(\mathbf{x})$ is the mapping from the data space into the feature space. It follows immediately that the $\hat{y}(\mathbf{x})$ function as a linear function in the feature space:

\hat{y} (x) = \sum_{i} w_{i} y^{(i)} ⟨ Φ (x^{(i)}), Φ (x) ⟩ = L (Φ (x))

$\hat{y}(\mathbf{x}) = \sum_i w_i y^{(i)} \langle\Phi(\mathbf{x}^{(i)}),\Phi(\mathbf{x})\rangle = L(\Phi(\mathbf{x}))$

where the linear function $L(\mathbf{v})$ is defined on feature space vectors $\mathbf{v}$ as

L (v) = \sum_{i} w_{i} y^{(i)} ⟨ Φ (x^{(i)}), v ⟩

$L(\mathbf{v}) = \sum_i w_i y^{(i)} \langle\Phi(\mathbf{x}^{(i)}),\mathbf{v}\rangle$

This function is linear in $\mathbf{v}$ because it's just a linear combination of inner products with fixed vectors. In the feature space, the decision boundary $\hat{y}(\mathbf{x}) = 0$ is just $L(\mathbf{v}) = 0$ , the level set of a linear function. This is the very definition of a hyperplane in the feature space.

3. How the kernel is used to construct the feature space

Kernel methods never actually "find" or "compute" the feature space or the mapping $\Phi$ explicitly. Kernel learning methods such as SVM do not need them to work; they only need the kernel function $K$ . It is possible to write down a formula for $\Phi$ but the feature space it maps to is quite abstract and is only really used for proving theoretical results about SVM. If you're still interested, here's how it works.

Basically we define an abstract vector space $V$ where each vector is a function from $\mathcal{X}$ to $\mathbb{R}$ . A vector $f$ in $V$ is a function formed from a finite linear combination of kernel slices:

f (x) = \sum_{i = 1}^{n} α_{i} K (x^{(i)}, x)

$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i K(\mathbf{x}^{(i)},\mathbf{x})$ (Here the

x^{(i)}

$\mathbf{x}^{(i)}$ are just an arbitrary set of points and need not be the same as the training set.) It is convenient to write

f

$f$ more compactly as

f = \sum_{i = 1}^{n} α_{i} K_{x^{(i)}}

$f = \sum_{i=1}^n \alpha_i K_{\mathbf{x}^{(i)}}$ where

K_{x} (y) = K (x, y)

$K_\mathbf{x}(\mathbf{y}) = K(\mathbf{x},\mathbf{y})$ is a function giving a "slice" of the kernel at

x

$\mathbf{x}$ .

The inner product on the space is not the ordinary dot product, but an abstract inner product based on the kernel:

⟨ \sum_{i = 1}^{n} α_{i} K_{x^{(i)}}, \sum_{j = 1}^{n} β_{j} K_{x^{(j)}} ⟩ = \sum_{i, j} α_{i} β_{j} K (x^{(i)}, x^{(j)})

$\langle \sum_{i=1}^n \alpha_i K_{\mathbf{x}^{(i)}}, \sum_{j=1}^n \beta_j K_{\mathbf{x}^{(j)}} \rangle = \sum_{i,j} \alpha_i \beta_j K(\mathbf{x}^{(i)},\mathbf{x}^{(j)})$

This definition is very deliberate: its construction ensures the identity we need for linear separation, $\langle \Phi(\mathbf{x}), \Phi(\mathbf{y}) \rangle = K(\mathbf{x},\mathbf{y})$ .

With the feature space defined in this way, $\Phi$ is a mapping $\mathcal{X} \rightarrow V$ , taking each point $\mathbf{x}$ to the "kernel slice" at that point:

Φ (x) = K_{x}, where K_{x} (y) = K (x, y) .

$\Phi(\mathbf{x}) = K_\mathbf{x}, \quad \text{where} \quad K_\mathbf{x}(\mathbf{y}) = K(\mathbf{x},\mathbf{y}).$

You can prove that $V$ is an inner product space when $K$ is a positive definite kernel. See this paper for details.

— Paul
fonte

Great explanation, but I think you have missed a minus for the definition of the gaussian kernel. K(x,z)=exp(-||x−z||2/σ2) . As it's written, it does not make sense with the ϵ found in the part (1)

— hqxortn

1

For the background and the notations I refer to How to calculate decision boundary from support vectors?.

So the features in the 'original' space are the vectors $x_i$ , the binary outcome $y_i \in \{-1, +1\}$ and the Lagrange multipliers are $\alpha_i$ .

As said by @Lii (+1) the Kernel can be written as $K(x,y)=h(x) \cdot h(y)$ (' $\cdot$ ' represents the inner product.

I will try to give some 'intuitive' explanation of what this $h$ looks like, so this answer is no formal proof, it just wants to give some feeling of how I think that this works. Do not hesitate to correct me if I am wrong.

I have to 'transform' my feature space (so my $x_i$ ) into some 'new' feature space in which the linear separation will be solved.

For each observation $x_i$ , I define functions $\phi_i(x)=K(x_i,x)$ , so I have a function $\phi_i$ for each element of my training sample. These functions $\phi_i$ span a vector space. The vector space spanned by the $\phi_i$ , note it $V=span(\phi_{i, i=1,2,\dots N})$ .

I will try to argue that is the vector space in which linear separation will be possible. By definition of the span, each vector in the vector space $V$ can be written as as a linear combination of the $\phi_i$ , i.e.: $\sum_{i=1}^N \gamma_i \phi_i$ , where $\gamma_i$ are real numbers.

$N$ is the size of the training sample and therefore the dimension of the vector space $V$ can go up to $N$ , depending on whether the $\phi_i$ are linear independent. As $\phi_i(x)=K(x_i,x)$ (see supra, we defined $\phi$ in this way), this means that the dimension of $V$ depends on the kernel used and can go up to the size of the training sample.

The transformation, that maps my original feature space to $V$ is defined as

$\Phi: x_i \to \phi(x)=K(x_i, x)$ .

This map $\Phi$ maps my original feature space onto a vector space that can have a dimension that goed up to the size of my training sample.

Obviously, this transformation (a) depends on the kernel, (b) depends on the values $x_i$ in the training sample and (c) can, depending on my kernel, have a dimension that goes up to the size of my training sample and (d) the vectors of $V$ look like $\sum_{i=1}^N \gamma_i \phi_i$ , where $\gamma_i$ , $\gamma_i$ are real numbers.

Looking at the function $f(x)$ in How to calculate decision boundary from support vectors? it can be seen that $f(x)=\sum_i y_i \alpha_i \phi_i(x)+b$ .

In other words, $f(x)$ is a linear combination of the $\phi_i$ and this is a linear separator in the V-space : it is a particular choice of the $\gamma_i$ namely $\gamma_i=\alpha_i y_i$ !

The $y_i$ are known from our observations, the $\alpha_i$ are the Lagrange multipliers that the SVM has found. In other words SVM find, through the use of a kernel and by solving a quadratic programming problem, a linear separation in the $V$ -spave.

This is my intuitive understanding of how the 'kernel trick' allows one to 'implicitly' transform the original feature space into a new feature space $V$ , with a different dimension. This dimension depends on the kernel you use and for the RBF kernel this dimension can go up to the size of the training sample.

So kernels are a technique that allows SVM to transform your feature space , see also What makes the Gaussian kernel so magical for PCA, and also in general?

— Community
fonte

"for each element of my training sample" -- is element here referring to a row or column (i.e. feature )

— user1761806

what is x and x_i? If my X is an input of 5 columns, and 100 rows, what would x and x_i be?

— user1761806

@user1761806 an element is a row. The notation is explained in the link at the beginning of the answer

1

Trasforma i predittori (dati di input) in uno spazio di caratteristiche ad alta dimensione. È sufficiente specificare il kernel per questo passaggio e i dati non vengono mai trasformati esplicitamente nello spazio delle funzionalità. Questo processo è comunemente noto come trucco del kernel.

Lascia che te lo spieghi. Il trucco del kernel è la chiave qui. Si consideri il caso di un kernel con funzione di base radiale (RBF) qui. Trasforma l'input in uno spazio dimensionale infinito. La trasformazione dell'input $x$ per $\phi(x)$ può essere rappresentato come mostrato di seguito (tratto da http://www.csie.ntu.edu.tw/~cjlin/talks/kuleuven_svm.pdf )

Lo spazio di input è di dimensione finita ma lo spazio trasformato è di dimensione infinita. Trasformare l'input in uno spazio dimensionale infinito è qualcosa che accade a seguito del trucco del kernel. Qui $x$ which is the input and $\phi$ is the transformed input. But $\phi$ is not computed as it is, instead the product $\phi(x_i)^T\phi(x)$ is computed which is just the exponential of the norm between $x_i$ and $x$ .

There is a related question Feature map for the Gaussian kernel to which there is a nice answer /stats//a/69767/86202.

The output or decision function is a function of the kernel matrix $K(x_i,x)=\phi(x_i)^T\phi(x)$ and not of the input $x$ or transformed input $\phi$ directly.

— prashanth
fonte

0

La mappatura su una dimensione superiore è semplicemente un trucco per risolvere un problema definito nella dimensione originale; quindi preoccupazioni come il sovradimensionamento dei dati entrando in una dimensione con troppi gradi di libertà non sono un sottoprodotto del processo di mappatura, ma sono inerenti alla definizione del problema.

Fondamentalmente, tutto ciò che fa la mappatura è convertire la classificazione condizionale nella dimensione originale in una definizione del piano nella dimensione superiore e poiché esiste una relazione 1 a 1 tra il piano nella dimensione superiore e le condizioni nella dimensione inferiore, è sempre possibile spostati tra i due.

Prendendo il problema del sovradimensionamento, chiaramente, puoi sovralimentare qualsiasi serie di osservazioni definendo condizioni sufficienti per isolare ogni osservazione nella sua classe, che equivale a mappare i tuoi dati su (n-1) D dove n è il numero delle tue osservazioni .

Prendendo il problema più semplice, in cui le tue osservazioni sono [[1, -1], [0,0], [1,1]] [[feature, value]], spostandosi nella dimensione 2D e separando i tuoi dati con una linea , stai semplicemente trasformando la classificazione condizionale di feature < 1 && feature > -1 : 0definire una linea che passa attraverso (-1 + epsilon, 1 - epsilon). Se avevi più punti dati e avevi bisogno di più condizioni, dovevi solo aggiungere un ulteriore grado di libertà alla tua dimensione superiore per ogni nuova condizione che definisci.

You can replace the process of mapping to a higher dimension with any process that provides you with a 1 to 1 relationship between the conditions and the degrees of freedom of your new problem. Kernel tricks simply do that.

— Hou
fonte

1

As a different example, take the problem where the phenomenon results in observations of the form of [x, floor(sin(x))]. Mapping your problem into a 2D dimension is not helpful here at all; in fact, mapping to any plane will not be helpful here, which is because defining the problem as a set of x < a && x > b : z is not helpful in this case. The simplest mapping in this case is mapping into a polar coordinate, or into the imaginary plane.

— Hou

Kernel SVM: Voglio una comprensione intuitiva della mappatura su uno spazio di caratteristiche di dimensione superiore e su come ciò renda possibile la separazione lineare

1. Raggiungere una separazione perfetta

2. Kernel SVM learning as linear separation

3. How the kernel is used to construct the feature space