Chapter | 2 Statistical Machine Learning for HCI

13

shows how both expected and empirical risks behave for a ﬁxed

number of training examples n, when varying the capacity h

F

. Hence,

selecting the optimal set of function is a trade-off between having an

expressive enough set of functions (large h

F

) such that it obtains a

small empirical risk, and a constrained enough set of functions (small

h

F

) such that it obtains a small expected risk.

In practice, most if not all machine learning models proposed in

the literature have at least one knob that can be used to control the

capacity, without explicitly computing it. For instance, the capacity

of multi-layer perceptrons can be controlled by the number of hidden

units and hidden layers; for SVMs, the capacity is controlled by the

choice of the kernel; in HMMs, it is controlled by the number of states

and the capacity of each emission distribution model; etc.

2.3 SUPPORT VECTOR MACHINES FOR BINARY

CLASSIFICATION

The most well-known machine learning approach is the Support Vec-

tor Machine (SVM) [1]. It has been widely used in the last decade in

various applications. Although its main successes are for classiﬁca-

tion problems [2, 3], extensions of the main approach to regression,

density estimation, ranking and many others have been proposed.

We describe in the following the simplest two-class classiﬁcation

framework.

Let us assume we are given a training set of n examples T

train

=

{(x

i

, y

i

)}

n

i=1

where x

i

∈R

d

is a d-dimensional input vector and y

i

∈

{−1, 1}is the target class. The simplest binary classiﬁer one can think

of is the linear classiﬁer, where we are looking for parameters (w ∈

R

d

, b ∈R) such that

ˆy(x) =sign(w ·x +b). (2.3)

When the training set is said to be linearly separable, there is poten-

tially an inﬁnite number of solutions (w ∈

R

d

, b ∈R) that satisfy (2.3).

Hence, the SVM approach looks for the one that maximises the mar-

gin between the two classes, where the margin can be deﬁned as the

sum of the smallest distances between the separating hyper-plane and

points of each class. This concept is illustrated in Figure 2.3.

14

PART|I Signal Processing, Modelling and Related Mathematical Tools

MarginMargin

FIGURE 2.3 Illustration of the notion of margin.

It can be shown that maximising the margin is equivalent to mini-

mising the norm of w, hence it can be expressed by the following

optimisation problem:

min

w,b

1

2

w

2

(2.4)

subject to ∀iy

i

(w ·x

i

+b) ≥1

where the constraints express the classiﬁcation problem to solve.

To solve this problem, we usually start by writing the Lagrangian

of (2.4), which introduces new variables α

i

for each of the constraints

in (2.4). Then instead of solving directly the problem, we solve its

Chapter | 2 Statistical Machine Learning for HCI

15

dual formulation, that computationally is more efﬁcient:

max

α

n

i=1

α

i

−

1

2

n

i=1

n

j=1

y

i

y

j

α

i

α

j

x

i

·x

j

(2.5)

subject to

⎧

⎪

⎨

⎪

⎩

∀i α

i

≥0

n

i=1

α

i

y

i

=0.

One problem with this formulation is that if the problem is not linearly

separable, there might be no solution to it. Hence, one can relax the

constraints by allowing errors with an additional hyper-parameter

C that controls the trade-off between maximising the margin and

minimising the number of training errors [4] as follows:

min

w,b

1

2

w

2

+C

i

ξ

i

(2.6)

subject to

∀iy

i

(w ·x

i

+b) ≥1−ξ

i

∀i ξ

i

≥0

which dual becomes

max

α

n

i=1

α

i

−

1

2

n

i=1

n

j=1

y

i

y

j

α

i

α

j

x

i

·x

j

(2.7)

subject to

⎧

⎪

⎨

⎪

⎩

∀i 0 ≤α

i

≤C

n

i=1

α

i

y

i

=0.

To look for non-linear solutions, one can easily replace x by some non-

linear function φ(x). It is interesting to note that x only appears in dot

products in (2.7). It has thus been proposed to replace all occurrences

of φ(x

i

) ·φ(x

j

) by some kernel function k(x

i

, x

j

). As long as k(·, ·)

lives in a reproducing kernel Hilbert space (RKHS), one can guarantee

that there exists some function φ(·) such that

k(x

i

, x

j

) =φ(x

i

) ·φ(x

j

).

Thus, even if φ(x) projects x in a very high (possibly inﬁnite) dimen-

sional space, k(x

i

, x

j

) can still be efﬁciently computed. Kernelised

Get *Multi-Modal Signal Processing* now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.