Chapter | 2 Statistical Machine Learning for HCI
13
shows how both expected and empirical risks behave for a fixed
number of training examples n, when varying the capacity h
F
. Hence,
selecting the optimal set of function is a trade-off between having an
expressive enough set of functions (large h
F
) such that it obtains a
small empirical risk, and a constrained enough set of functions (small
h
F
) such that it obtains a small expected risk.
In practice, most if not all machine learning models proposed in
the literature have at least one knob that can be used to control the
capacity, without explicitly computing it. For instance, the capacity
of multi-layer perceptrons can be controlled by the number of hidden
units and hidden layers; for SVMs, the capacity is controlled by the
choice of the kernel; in HMMs, it is controlled by the number of states
and the capacity of each emission distribution model; etc.
2.3 SUPPORT VECTOR MACHINES FOR BINARY
CLASSIFICATION
The most well-known machine learning approach is the Support Vec-
tor Machine (SVM) [1]. It has been widely used in the last decade in
various applications. Although its main successes are for classifica-
tion problems [2, 3], extensions of the main approach to regression,
density estimation, ranking and many others have been proposed.
We describe in the following the simplest two-class classification
framework.
Let us assume we are given a training set of n examples T
train
=
{(x
i
, y
i
)}
n
i=1
where x
i
R
d
is a d-dimensional input vector and y
i
{−1, 1}is the target class. The simplest binary classifier one can think
of is the linear classifier, where we are looking for parameters (w
R
d
, b R) such that
ˆy(x) =sign(w ·x +b). (2.3)
When the training set is said to be linearly separable, there is poten-
tially an infinite number of solutions (w
R
d
, b R) that satisfy (2.3).
Hence, the SVM approach looks for the one that maximises the mar-
gin between the two classes, where the margin can be defined as the
sum of the smallest distances between the separating hyper-plane and
points of each class. This concept is illustrated in Figure 2.3.
14
PART|I Signal Processing, Modelling and Related Mathematical Tools
MarginMargin
FIGURE 2.3 Illustration of the notion of margin.
It can be shown that maximising the margin is equivalent to mini-
mising the norm of w, hence it can be expressed by the following
optimisation problem:
min
w,b
1
2
w
2
(2.4)
subject to iy
i
(w ·x
i
+b) 1
where the constraints express the classification problem to solve.
To solve this problem, we usually start by writing the Lagrangian
of (2.4), which introduces new variables α
i
for each of the constraints
in (2.4). Then instead of solving directly the problem, we solve its
Chapter | 2 Statistical Machine Learning for HCI
15
dual formulation, that computationally is more efficient:
max
α
n
i=1
α
i
1
2
n
i=1
n
j=1
y
i
y
j
α
i
α
j
x
i
·x
j
(2.5)
subject to
i α
i
0
n
i=1
α
i
y
i
=0.
One problem with this formulation is that if the problem is not linearly
separable, there might be no solution to it. Hence, one can relax the
constraints by allowing errors with an additional hyper-parameter
C that controls the trade-off between maximising the margin and
minimising the number of training errors [4] as follows:
min
w,b
1
2
w
2
+C
i
ξ
i
(2.6)
subject to
iy
i
(w ·x
i
+b) 1ξ
i
i ξ
i
0
which dual becomes
max
α
n
i=1
α
i
1
2
n
i=1
n
j=1
y
i
y
j
α
i
α
j
x
i
·x
j
(2.7)
subject to
i 0 α
i
C
n
i=1
α
i
y
i
=0.
To look for non-linear solutions, one can easily replace x by some non-
linear function φ(x). It is interesting to note that x only appears in dot
products in (2.7). It has thus been proposed to replace all occurrences
of φ(x
i
) ·φ(x
j
) by some kernel function k(x
i
, x
j
). As long as k(·, ·)
lives in a reproducing kernel Hilbert space (RKHS), one can guarantee
that there exists some function φ(·) such that
k(x
i
, x
j
) =φ(x
i
) ·φ(x
j
).
Thus, even if φ(x) projects x in a very high (possibly infinite) dimen-
sional space, k(x
i
, x
j
) can still be efficiently computed. Kernelised

Get Multi-Modal Signal Processing now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.