﻿Kernel Methods for Deep Learning
Youngmin Cho and Lawrence K. Saul
Department of Computer Science and Engineering
University of California, San Diego
9500 Gilman Drive, Mail Code 0404
La Jolla, CA 92093-0404
{yoc002,saul}@cs.ucsd.edu
Abstract
We introduce a new family of positive-definite kernel functions that mimic the
computation in large, multilayer neural nets. These kernel functions can be used
in shallow architectures, such as support vector machines (SVMs), or in deep
kernel-based architectures that we call multilayer kernel machines (MKMs). We
evaluate SVMs and MKMs with these kernel functions on problems designed to
illustrate the advantages of deep architectures. On several problems, we obtain
better results than previous, leading benchmarks from both SVMs with Gaussian
kernels as well as deep belief nets.
1 Introduction
Recent work in machine learning has highlighted the circumstances that appear to favor deep architectures,
such as multilayer neural nets, over shallow architectures, such as support vector machines
(SVMs) [1]. Deep architectures learn complex mappings by transforming their inputs through multiple
layers of nonlinear processing [2]. Researchers have advanced several motivations for deep
architectures: the wide range of functions that can be parameterized by composing weakly nonlinear
transformations, the appeal of hierarchical distributed representations, and the potential for
combining unsupervised and supervised methods. Experiments have also shown the benefits of
deep learning in several interesting applications [3, 4, 5].
Many issues surround the ongoing debate over deep versus shallow architectures [1, 6]. Deep architectures
are generally more difficult to train than shallow ones. They involve difficult nonlinear
optimizations and many heuristics. The challenges of deep learning explain the early and continued
appeal of SVMs, which learn nonlinear classifiers via the “kernel trick”. Unlike deep architectures,
SVMs are trained by solving a simple problem in quadratic programming. However, SVMs cannot
seemingly benefit from the advantages of deep learning.
Like many, we are intrigued by the successes of deep architectures yet drawn to the elegance of kernel
methods. In this paper, we explore the possibility of deep learning in kernel machines. Though
we share a similar motivation as previous authors [7], our approach is very different. Our paper
makes two main contributions. First, we develop a new family of kernel functions that mimic the
computation in large neural nets. Second, using these kernel functions, we show how to train multilayer
kernel machines (MKMs) that benefit from many advantages of deep learning.
The organization of this paper is as follows. In section 2, we describe a new family of kernel
functions and experiment with their use in SVMs. Our results on SVMs are interesting in their own
right; they also foreshadow certain trends that we observe (and certain choices that we make) for the
MKMs introduced in section 3. In this section, we describe a kernel-based architecture with multiple
layers of nonlinear transformation. The different layers are trained using a simple combination of
supervised and unsupervised methods. Finally, we conclude in section 4 by evaluating the strengths
and weaknesses of our approach.
12 Arc-cosine kernels
In this section, we develop a new family of kernel functions for computing the similarity of vector
inputs x, y ∈ ℜd . As shorthand, let Θ(z) = 1
2
(1 + sign(z)) denote the Heaviside step function. We
define the nth order arc-cosine kernel function via the integral representation:
∫
‖w‖2
e− 2
kn(x, y) = 2 dw
(2π) d/2 Θ(w · x) Θ(w · y) (w · x)n (w · y) n
The integral representation makes it straightforward to show that these kernel functions are positivesemidefinite.
The kernel function in eq. (1) has interesting connections to neural computation [8]
that we explore further in sections 2.2–2.3. However, we begin by elucidating its basic properties.
2.1 Basic properties
We show how to evaluate the integral in eq. (1) analytically in the appendix. The final result is most
easily expressed in terms of the angle θ between the inputs:
θ = cos −1
( )
x · y
. (2)
‖x‖‖y‖
The integral in eq. (1) has a simple, trivial dependence on the magnitudes of the inputs x and y, but
a complex, interesting dependence on the angle between them. In particular, we can write:
kn(x, y) = 1
π ‖x‖n ‖y‖ n Jn(θ) (3)
where all the angular dependence is captured by the family of functions Jn(θ). Evaluating the
integral in the appendix, we show that this angular dependence is given by:
Jn(θ) = (−1) n (sin θ) 2n+1
( ) n ( )
1 ∂ π − θ
. (4)
sin θ ∂θ sin θ
For n = 0, this expression reduces to the supplement of the angle between the inputs. However, for
n>0, the angular dependence is more complicated. The first few expressions are:
J0(θ) = π − θ (5)
J1(θ) = sin θ + (π − θ) cos θ (6)
J2(θ) = 3 sin θ cos θ + (π − θ)(1 + 2 cos 2 θ) (7)
We describe eq. (3) as an arc-cosine kernel because for n = 0, it takes the simple form
k0(x, y) = 1− 1
π cos−1
x·y
‖x‖‖y‖
. In fact, the zeroth and first order kernels in this family are strongly
motivated by previous work in neural computation. We explore these connections in the next section.
Arc-cosine kernels have other intriguing properties. From the magnitude dependence in eq. (3),
we observe the following: (i) the n = 0 arc-cosine kernel maps inputs x to the unit hypersphere
in feature space, with k0(x, x) = 1; (ii) the n = 1 arc-cosine kernel preserves the norm of inputs,
with k1(x, x) = ‖x‖ 2 ; (iii) higher order (n>1) arc-cosine kernels expand the dynamic range of the
inputs, with kn(x, x) ∼ ‖x‖ 2n . Properties (i)–(iii) are shared respectively by radial basis function
(RBF), linear, and polynomial kernels. Interestingly, though, the n = 1 arc-cosine kernel is highly
nonlinear, also satisfying k1(x, −x) = 0 for all inputs x. As a practical matter, we note that arccosine
kernels do not have any continuous tuning parameters (such as the kernel width in RBF
kernels), which can be laborious to set by cross-validation.
2.2 Computation in single-layer threshold networks
Consider the single-layer network shown in Fig. 1 (left) whose weights Wij connect the jth input
unit to the ith output unit. The network maps inputs x to outputs f(x) by applying an elementwise
nonlinearity to the matrix-vector product of the inputs and the weight matrix: f(x) = g(Wx). The
nonlinearity is described by the network’s so-called activation function. Here we consider the family
of one-sided polynomial activation functions gn(z) = Θ(z)z n illustrated in the right panel of Fig. 1.
(1)
2f 1
. . .
f 2 f 3 f i
. . . fm Step (n=0)
Ramp (n=1)
Quarter−pipe (n=2)
W
1
0.5
1
0.5
1
0.5
. . .
x 1 x 2 x j
. . . xd 0
−1 0 1
0
−1 0 1
0
−1 0 1
Figure 1: Single layer network and activation functions
For n = 0, the activation function is a step function, and the network is an array of perceptrons. For
n = 1, the activation function is a ramp function (or rectification nonlinearity [9]), and the mapping
f(x) is piecewise linear. More generally, the nonlinear (non-polynomial) behavior of these networks
is induced by thresholding on weighted sums. We refer to networks with these activation functions
as single-layer threshold networks of degree n.
Computation in these networks is closely connected to computation with the arc-cosine kernel function
in eq. (1). To see the connection, consider how inner products are transformed by the mapping
in single-layer threshold networks. As notation, let the vector wi denote ith row of the weight
matrix W. Then we can express the inner product between different outputs of the network as:
f(x) · f(y) =
m∑
Θ(wi · x)Θ(wi · y)(wi · x) n (wi · y) n , (8)
i=1
where m is the number of output units. The connection with the arc-cosine kernel function emerges
in the limit of very large networks [10, 8]. Imagine that the network has an infinite number of
output units, and that the weights Wij are Gaussian distributed with zero mean and unit variance.
In this limit, we see that eq. (8) reduces to eq. (1) up to a trivial multiplicative factor:
limm→∞ 2
mf(x) · f(y) = kn(x, y). Thus the arc-cosine kernel function in eq. (1) can be viewed
as the inner product between feature vectors derived from the mapping of an infinite single-layer
threshold network [8].
Many researchers have noted the general connection between kernel machines and neural networks
with one layer of hidden units [1]. The n = 0 arc-cosine kernel in eq. (1) can also be derived from
an earlier result obtained in the context of Gaussian processes [8]. However, we are unaware of any
previous theoretical or empirical work on the general family of these kernels for degrees n≥0.
Arc-cosine kernels differ from polynomial and RBF kernels in one especially interesting respect.
As highlighted by the integral representation in eq. (1), arc-cosine kernels induce feature spaces
that mimic the sparse, nonnegative, distributed representations of single-layer threshold networks.
Polynomial and RBF kernels do not encode their inputs in this way. In particular, the feature vector
induced by polynomial kernels is neither sparse nor nonnegative, while the feature vector induced
by RBF kernels resembles the localized output of a soft vector quantizer. Further implications of
this difference are explored in the next section.
2.3 Computation in multilayer threshold networks
A kernel function can be viewed as inducing a nonlinear mapping from inputs x to feature
vectors Φ(x). The kernel computes the inner product in the induced feature space:
k(x, y) = Φ(x)·Φ(y). In this section, we consider how to compose the nonlinear mappings induced
by kernel functions. Specifically, we show how to derive new kernel functions
k (ℓ) (x, y) = Φ(Φ(...Φ(x))) · Φ(Φ(...Φ(y))) (9)
} {{ } } {{ }
ℓ times
ℓ times
which compute the inner product after ℓ successive applications of the nonlinear mapping Φ(·). Our
motivation is the following: intuitively, if the base kernel function k(x, y) = Φ(x) · Φ(y) mimics
the computation in a single-layer network, then the iterated mapping in eq. (9) should mimic the
computation in a multilayer network.
3test set. SVMs with arc cosine kernels have error rates from 22.36–25.64%. Results are s
kernels of varying degree (n) and levels of recursion (ℓ). The best previous results are 24
SVMs with RBF kernels and 22.50% for deep belief nets [2]. See text for details.
26
24
22
Test error rate (%)
21
20
19
18
17
1 2 3 4 5 6
Step (n=0)
Test error rate (%)
1 2 3 4 5 6
SVM−RBF Ramp (n=1)
1 2 3 4 5 6
Quarter!pipe (n=2)
Figure 3: Left: examples from the convex data set. Right: classification error rates on th
DBN−3
SVMs with arc cosine kernels have error rates from 17.15–20.51%. Results are shown fo
1 2 3 4 5 of 6 varying 1 2 degree 3 4 (n) 5 6and levels 1 2 of3 recursion 4 5 6 (ℓ). The best previous results are 19.13% f
Step (n=0) with RBF kernels Ramp (n=1) and 18.63% Quarter−pipe for deep belief (n=2) nets [2]. See text for details.
Figure 2: Left: examples from the rectangles-image data set. Right: classification error rates on the
test set. SVMs with arc-cosine kernels have error rates from 22.36–25.64%. Results are shown for
kernels of varying degree (n) and levels of 2000 recursion training (ℓ). examples The best asprevious a validation results set to arechoose 24.04% thefor
margin penalty parameter; after
SVMs with RBF kernels and 22.50% for deep thisbelief parameter nets by [11]. cross-validation, See text for details. we then retrained each SVM using all the training exam
reference, we also report the best results obtained previously from three layer deep belief ne
3) and SVMs with RBF kernels (SVM-RBF). These references are representative of th
state-of-the-art for deep and shallow architectures on these data sets.
We first examine the results of this procedure for widely used kernels. Here we find that the iterated
mapping in eq. (9) does not yield particularly The interesting right panels results. of figures Consider 2 andthe 3 show two-fold the test composition set error rates from arc cosine kernels o
that maps x to Φ(Φ(x)). For linear kernels degree k(x, y) (n)= and x · levels y, theofcomposition recursion (ℓ). isWe trivial: experimented we obtainwith kernels of degree n = 0
the identity map Φ(Φ(x)) = Φ(x) = x. For corresponding homogeneous to single polynomial layer threshold kernels k(x, networks y) = with (x ·“step”, y) “ramp”, and “quarter-pipe”
functions. We also experimented with the multilayer kernels described in section 2.3, c
from one to six levels of recursion. Overall, the figures show that on these two data se
different arc cosine kernels outperform the best results previously reported for SVMs w
kernels and deep belief nets. We give more details on these experiments below. At a h
though, we note that SVMs with arc cosine kernels are very straightforward to train; unli
with RBF kernels, they do not require tuning a kernel width parameter, and unlike deep b
they do not require solving a difficult nonlinear optimization or searching over possible arch
In our experiments, we quickly discovered that the multilayer kernels only performed w
n = 1 kernels were used at higher (ℓ > 1) levels in the recursion. Figs. 2 and 3 therefore s
these sets of results; in particular, each group of bars shows the test error rates when a
kernel (of degree n = 0, 1, 2) was used at the first layer of nonlinearity, while the n = 1 k
used at successive layers. We do not have a formal explanation for this effect. However, r
only the n = 1 arc cosine kernel preserves the norm of its inputs: the n = 0 kernel maps
onto a unit hypersphere in feature space, while higher-order (n > 1) kernels may induc
spaces with severely distorted dynamic ranges. Therefore, we hypothesize that only n=1 a
kernels preserve sufficient information about the magnitude of their inputs to work effe
composition with other kernels.
Finally, the results on both data sets reveal an interesting trend: the multilayer arc cosin
often perform better than their single layer counterparts. Though SVMs are shallow arch
5
d ,
the composition yields:
Φ(Φ(x)) · Φ(Φ(y)) = (Φ(x) · Φ(y)) d = ((x · y) d ) d = (x · y) d2
. (10)
The above result is not especially interesting: the kernel implied by this composition is also polynomial,
just of higher degree (d2 versus d) than the one from which it was constructed. Likewise, for
RBF kernels k(x, y) = e−λ‖x−y‖2, the composition yields:
Φ(Φ(x)) · Φ(Φ(y)) = e −λ‖Φ(x)−Φ(y)‖2
= e −2λ(1−k(x,y)) . (11)
Though non-trivial, eq. (11) does not represent a particularly interesting computation. Recall that
RBF kernels mimic the computation of soft vector quantizers, with k(x, y) ≪ 1 when ‖x−y‖ is
large compared to the kernel width. It is hard to see how the iterated mapping Φ(Φ(x)) would
generate a qualitatively different representation than the original mapping Φ(x).
Next we consider the ℓ-fold composition in eq. (9) for arc-cosine kernel functions. We state the
result in the form of a recursion. The base case is given by eq. (3) for kernels of depth ℓ = 1 and
degree n. The inductive step is given by:
k (l+1)
n (x, y) = 1
[
k
π
(l)
n (x, x) k (l)
] n/2 (
n (y, y) Jn θ (ℓ)
)
n , (12)
where θ (ℓ)
n is the angle between the images of x and y in the feature space induced by the ℓ-fold
composition. In particular, we can write:
θ (ℓ)
n = cos −1
(
k (ℓ)
[
n (x, y) k (ℓ)
n (x, x) k (ℓ)
] )
−1/2
n (y, y) . (13)
The recursion in eq. (12) is simple to compute in practice. The resulting kernels mimic the computations
in large multilayer threshold networks. Above, for simplicity, we have assumed that the
arc-cosine kernels have the same degree n at every level (or layer) ℓ of the recursion. We can also
use kernels of different degrees at different layers. In the next section, we experiment with SVMs
whose kernel functions are constructed in this way.
2.4 Experiments on binary classification
We evaluated SVMs with arc-cosine kernels on two challenging data sets of 28 × 28 grayscale pixel
images. These data sets were specifically constructed to compare deep architectures and kernel
machines [11]. In the first data set, known as rectangles-image, each image contains an occluding
rectangle, and the task is to determine whether the width of the rectangle exceeds its height; examples
are shown in Fig. 2 (left). In the second data set, known as convex, each image contains a
white region, and the task is to determine whether the white region is convex; examples are shown
4test set. SVMs with arc cosine kernels have error rates from 22.36–25.64%. Results are s
kernels of varying degree (n) and levels of recursion (ℓ). The best previous results are 24
SVMs with RBF kernels and 22.50% for deep belief nets [2]. See text for details.
21
20
19
18
17
Test error rate (%)
21
20
19
18
17
1 2 3 4 5 6
Step (n=0)
Test error rate (%)
SVM−RBF 1 2 3 4 5 6
Ramp (n=1)
DBN−3
1 2 3 4 5 6
Quarter!pipe (n=2)
Figure 3: Left: examples from the convex data set. Right: classification error rates on th
SVMs with arc cosine kernels have error rates from 17.15–20.51%. Results are shown fo
1 2 3 4 5 of 6 varying 1 2 degree 3 4 (n) 5 6and levels 1 2 of3 recursion 4 5 6 (ℓ). The best previous results are 19.13% f
Step (n=0) with RBF kernels Ramp (n=1) and 18.63% Quarter−pipe for deep belief (n=2) nets [2]. See text for details.
Figure 3: Left: examples from the convex data set. Right: classification error rates on the test set.
SVMs with arc-cosine kernels have error rates from 17.15–20.51%. Results are shown for kernels
of varying degree (n) and levels of recursion 2000 (ℓ). training The best examples previous as aresults validation are set 19.13% to choose for SVMs the margin penalty parameter; after
with RBF kernels and 18.63% for deep belief thisnets parameter [11]. See by cross-validation, text for details. we then retrained each SVM using all the training exam
reference, we also report the best results obtained previously from three layer deep belief ne
3) and SVMs with RBF kernels (SVM-RBF). These references are representative of th
state-of-the-art for deep and shallow architectures on these data sets.
in Fig. 3 (left). The rectangles-image data set has 12000 training examples, while the convex data
set has 8000 training examples; both dataThe setsright havepanels 50000 of test figures examples. 2 and 3 show These thedata test setserror have rates from arc cosine kernels o
been extensively benchmarked by previousdegree authors (n)[11]. and levels Our experiments of recursion in (ℓ). binary We experimented classification with kernels of degree n = 0
focused on these data sets because in previously corresponding reported to benchmarks, single layer threshold they exhibited networksthe with biggest “step”, “ramp”, and “quarter-pipe” a
performance gap between deep architectures functions. (e.g., deep Webelief also experimented nets) and traditional with theSVMs.
multilayer kernels described in section 2.3, c
from one to six levels of recursion. Overall, the figures show that on these two data se
We followed the same experimental methodology different as previous arc cosineauthors kernels[11]. outperform SVMs were the best trained results using previously reported for SVMs w
libSVM (version 2.88) [12], a publicly available kernels software and deep package. belief nets. For each We give SVM, more wedetails used the on these last experiments below. At a h
2000 training examples as a validation set to though, choose wethe notemargin that SVMs penalty withparameter; arc cosine kernels after choosing are very straightforward to train; unli
this parameter by cross-validation, we thenwith retrained RBF kernels, each SVM they dousing not require all thetuning training a kernel examples. width parameter, and unlike deep be
For reference, we also report the best results
they
obtained
do not require
previously
solving
from
a difficult
three-layer
nonlinear
deep
optimization
belief nets
or searching over possible arch
(DBN-3) and SVMs with RBF kernels (SVM-RBF). In our experiments, These references we quickly appear discovered to be representative that the multilayer of kernels only performed w
the current state-of-the-art for deep and shallow n = 1architectures kernels were used on these at higher data (ℓ sets. > 1) levels in the recursion. Figs. 2 and 3 therefore s
these sets of results; in particular, each group of bars shows the test error rates when a
Figures 2 and 3 show the test set error rates kernel from arc-cosine (of degree kernels n = 0, 1, of 2) varying was useddegree at the (n) firstand layer levels of nonlinearity, while the n = 1 k
of recursion (ℓ). We experimented with kernels used at ofsuccessive degree n = layers. 0, 1 We anddo 2, not corresponding have a formal toexplanation thresh- for this effect. However, r
old networks with “step”, “ramp”, and “quarter-pipe” only the n = activation 1 arc cosine functions. kernel preserves We also theexperimented
norm of its inputs: the n = 0 kernel maps
with the multilayer kernels described in section onto a2.3, unit composed hypersphere from in feature one to six space, levels while of higher-order recursion. (n > 1) kernels may induc
Overall, the figures show that many SVMsspaces with with arc-cosine severelykernels distorted outperform dynamic ranges. traditional Therefore, SVMs, we hypothesize that only n=1 a
and a certain number also outperform deepkernels belief preserve nets. In sufficient addition to information their solidabout performance, the magnitude we of their inputs to work effe
note that SVMs with arc-cosine kernels arecomposition very straightforward with other kernels. to train; unlike SVMs with RBF
kernels, they do not require tuning a kernel width Finally, parameter, the resultsand on both unlike data deep setsbelief reveal nets, an interesting they do not trend: the multilayer arc cosin
require solving a difficult nonlinear optimization often perform or searching betterover thanpossible their single architectures. layer counterparts. Though SVMs are shallow arch
Our experiments with multilayer kernels revealed that these SVMs only performed well when arccosine
kernels of degree n = 1 were used at higher (ℓ > 1) levels in the recursion. Figs. 5 2 and
3 therefore show only these sets of results; in particular, each group of bars shows the test error
rates when a particular kernel (of degree n = 0, 1, 2) was used at the first layer of nonlinearity,
while the n = 1 kernel was used at successive layers. We hypothesize that only n = 1 arc-cosine
kernels preserve sufficient information about the magnitude of their inputs to work effectively in
composition with other kernels. Recall that only the n = 1 arc-cosine kernel preserves the norm of
its inputs: the n = 0 kernel maps all inputs onto a unit hypersphere in feature space, while higherorder
(n>1) kernels induce feature spaces with different dynamic ranges.
Finally, the results on both data sets reveal an interesting trend: the multilayer arc-cosine kernels
often perform better than their single-layer counterparts. Though SVMs are (inherently) shallow
architectures, this trend suggests that for these problems in binary classification, arc-cosine kernels
may be yielding some of the advantages typically associated with deep architectures.
S
D
3 Deep learning
In this section, we explore how to use kernel methods in deep architectures [7]. We show how to train
deep kernel-based architectures by a simple combination of supervised and unsupervised methods.
Using the arc-cosine kernels in the previous section, these multilayer kernel machines (MKMs)
perform very competitively on multiclass data sets designed to foil shallow architectures [11].
53.1 Multilayer kernel machines
We explored how to train MKMs in stages that involve kernel PCA [13] and feature selection [14] at
intermediate hidden layers and large-margin nearest neighbor classification [15] at the final output
layer. Specifically, for ℓ-layer MKMs, we considered the following training procedure:
1. Prune uninformative features from the input space.
2. Repeat ℓ times:
(a) Compute principal components in the feature space induced by a nonlinear kernel.
(b) Prune uninformative components from the feature space.
3. Learn a Mahalanobis distance metric for nearest neighbor classification.
The individual steps in this procedure are well-established methods; only their combination is new.
While many other approaches are worth investigating, our positive results from the above procedure
provide a first proof-of-concept. We discuss each of these steps in greater detail below.
Kernel PCA. Deep learning in MKMs is achieved by iterative applications of kernel PCA [13]. This
use of kernel PCA was suggested over a decade ago [16] and more recently inspired by the pretraining
of deep belief nets by unsupervised methods. In MKMs, the outputs (or features) from
kernel PCA at one layer are the inputs to kernel PCA at the next layer. However, we do not strictly
transmit each layer’s top principal components to the next layer; some components are discarded if
they are deemed uninformative. While any nonlinear kernel can be used for the layerwise PCA in
MKMs, arc-cosine kernels are natural choices to mimic the computations in large neural nets.
Feature selection. The layers in MKMs are trained by interleaving a supervised method for feature
selection with the unsupervised method of kernel PCA. The feature selection is used to prune away
uninformative features at each layer in the MKM (including the zeroth layer which stores the raw
inputs). Intuitively, this feature selection helps to focus the unsupervised learning in MKMs on
statistics of the inputs that actually contain information about the class labels. We prune features
at each layer by a simple two-step procedure that first ranks them by estimates of their mutual
information, then truncates them using cross-validation. More specifically, in the first step, we
discretize each real-valued feature and construct class-conditional and marginal histograms of its
discretized values; then, using these histograms, we estimate each feature’s mutual information with
the class label and sort the features in order of these estimates [14]. In the second step, considering
only the first w features in this ordering, we compute the error rates of a basic kNN classifier using
Euclidean distances in feature space. We compute these error rates on a held-out set of validation
examples for many values of k and w and record the optimal values for each layer. The optimal w
determines the number of informative features passed onto the next layer; this is essentially the
width of the layer. In practice, we varied k from 1 to 15 and w from 10 to 300; though exhaustive,
this cross-validation can be done quickly and efficiently by careful bookkeeping. Note that this
procedure determines the architecture of the network in a greedy, layer-by-layer fashion.
Distance metric learning. Test examples in MKMs are classified by a variant of kNN classification
on the outputs of the final layer. Specifically, we use large margin nearest neighbor (LMNN) classification
[15] to learn a Mahalanobis distance metric for these outputs, though other methods are
equally viable [17]. The use of LMNN is inspired by the supervised fine-tuning of weights in the
training of deep architectures [18]. In MKMs, however, this supervised training only occurs at the
final layer (which underscores the importance of feature selection in earlier layers). LMNN learns a
distance metric by solving a problem in semidefinite programming; one advantage of LMNN is that
the required optimization is convex. Test examples are classified by the energy-based decision rule
for LMNN [15], which was itself inspired by earlier work on multilayer neural nets [19].
3.2 Experiments on multiway classification
We evaluated MKMs on the two multiclass data sets from previous benchmarks [11] that exhibited
the largest performance gap between deep and shallow architectures. The data sets were created from
the MNIST data set [20] of 28 × 28 grayscale handwritten digits. The mnist-back-rand data set was
generated by filling the image background by random pixel values, while the mnist-back-image data
set was generated by filling the image background with random image patches; examples are shown
in Figs. 4 and 5. Each data set contains 12000 and 50000 training and test examples, respectively.
6test set. SVMs with arc cosine kernels have error rates from 22.36–25.64%. Resul
kernels of varying degree (n) and levels of recursion (ℓ). The best previous results
SVMs with RBF kernels and 22.50% for deep belief nets [2]. See text for details.
Test error rate (%)
21
20
Test error rate (%)
26
Test Test error error rate rate (%) (%) 19
21
8
24
18
20
17
7
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5
19
SVM!RBF
Step (n=0) DBN−3Ramp (n=1) Quarter!pipe (n=
22
DBN!3
6
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
18
Figure 3: Left: examples from the Step convex (n=0) data set. Ramp Right: (n=1) classification Quarter!pipe error(n=2)
rates
5 17
SVMs with arc cosine kernels have error rates from 17.15–20.51%. Results are sh
0 1 1 2 2 3 3 4 4 5
Figure
5of 6
2:
varying 1
Left:
2 degree examples
3 4 4 (n) 5 6
from the
and levels 1 2 rectangles-image
of3 recursion 4 1 5 62
data set. Right: classification error
(ℓ). The best previous results are 19
Step (n=0)
test set. SVMs
with RBF kernels Ramp with (n=1) arc cosine
and 18.63% Quarter−pipe kernels Quarter!pipe for deep (n=2) have
belief RBF (n=2) error rates from 22.36–25.64%. Results a
nets [2]. See text for details.
kernels of varying degree (n) and levels of recursion (ℓ). The best previous results are
Figure 4: Left: examples from the mnist-back-rand SVMs data withset. RBFRight: kernels classification and 22.50% for error deep rates belief onnets the [2]. See text for details.
test set for MKMs with different kernels and numbers of layers ℓ. MKMs with arc-cosine kernel
have error rates from 6.36–7.52%. The best previous 2000results trainingare examples 14.58% as fora SVMs validation withset RBF to choose kernelsthe margin penalty parameter
Test error rate (%)
and 6.73% for deep belief nets [11].
this parameter by cross-validation, 21 we then retrained each SVM using all the trainin
reference, we also report the best results obtained previously from three layer deep be
3) and SVMs with RBF20kernels (SVM-RBF). These references are representativ
state-of-the-art Test Test error error rate rate for (%) deep (%) 19 and shallow architectures on these data sets.
30 21
The right panels of figures 18 2 and 3 show the test set error rates from arc cosine ke
20
degree (n) and levels of recursion (ℓ). We experimented with kernels of degree n
25
17
corresponding to single layer
1
threshold
2 3 4 5
networks
6 1
with
2 3
“step”,
4 5 6“ramp”, 1 2
and
3
“quarter-
4 5 6
19
SVM!RBF
Step (n=0) SVM−RBF Ramp (n=1) Quarter!pipe (n=2)
20
functions. We also experimented with the multilayer DBN!3 kernels described in section
18
Figure from one 3: Left: to six examples levels of from recursion. the convex Overall, data set. theRight: figuresclassification show that on error these rates two on
SVMs different witharc arc cosine kernels outperform have error rates theDBN−3
15 17
from best results 17.15–20.51%. previously Results reported are shown for S
0 1 2 2 3 3 4 4 5 5of 6kernels varying 1 and 2 degree 3 deep 4 (n) 5
belief 6and nets. levels 1 1 2
We of3 recursion give 4 1 5 more 26
(ℓ). details The best on these previous experiments results arebelow.
19.13%
Step (n=0)
with though, RBFRamp we kernels (n=1)
note and that18.63% Quarter−pipe Quarter!pipe
SVMs with for(n=2)
deep arc cosine belief RBF (n=2) nets kernels [2]. See are very text for straightforward details. to trai
Figure 5: Left: examples from the mnist-back-image with data RBFset. kernels, Right: they classification do not require error tuning rates a kernel on thewidth parameter, and unlike d
test set for MKMs with different kernels and numbers
they do not
of
require
layers ℓ.
solving
MKMs
a difficult
with arc-cosine
nonlinear optimization
kernel
or searching over possib
have error rates from 18.43–29.79%. The best 2000 Inprevious our training experiments, results examples areweas 22.61% quickly a validation for discovered SVMs set to with choose that the RBF themultilayer margin penalty kernels parameter; only perfor af
kernels and 16.31% for deep belief nets [11]. this n = parameter 1 kernelsby were cross-validation, used at higher we (ℓ then > 1) retrained levels ineach the recursion. SVM usingFigs. all the 2 and training 3 ther ex
reference, these setswe ofalso results; report inthe particular, best results each obtained grouppreviously of bars shows fromthe three test layer error deep rates belief w
3) kernel and SVMs (of degree withn RBF = 0, kernels 1, 2) was (SVM-RBF). used at the These first layer references of nonlinearity, are representative while theof
n
state-of-the-art
used at successive
for deep
layers.
and
We
shallow
do not
architectures
have a formal
on these
explanation
data sets.
for this effect. How
We trained MKMs with arc-cosine kernels and
The only RBF
right the kernels
panels n = 1 in arc of
each
figures cosine layer.
2kernel and
For
3preserves show
each
the
data
test theset, norm set error
weof its rates inputs: from the arc cosine n = 0 kernel
initially withheld the last 2000 training examples degree onto as aavalidation (n) unit and hypersphere levels set. of Performance recursion in feature (ℓ). space, onWe thisexperimented while validation higher-order with kernels (n > 1) of kernels degree n may =
set was used to determine each MKM’s architecture, corresponding spacesas with described severely to single indistorted layer the previous threshold dynamic section, networks ranges. and with Therefore, also “step”, we “ramp”, hypothesize and “quarter-pip that only
to set the kernel width in RBF kernels, following functions. kernels the preserve same We also methodology sufficient experimented information as with earlier theabout studies multilayer the[11].
magnitude kernels described of their inputs in section to wo 2.3
Once these parameters were set by cross-validation, from composition one we re-inserted to six with levels other the ofkernels.
validation recursion. examples Overall, the into figures the show that on these two data
training set and used all 12000 training examples different for feature arc cosine selection kernels andoutperform distance metric the best learning. results previously reported for SVM
kernels
Finally,
and
the
deep
results
belief
on
nets.
both data
We give
sets
more
reveal
details
an interesting
on these
trend:
experiments
the multilayer
below. At
ar
For kernel PCA, we were limited by memory requirements
though, often perform
to
we notebetter processing
that SVMs than their
only
with single
6000
arc cosine layer
out
kernels counterparts.
of 12000
are very Though straightforward SVMs are to train; shallo
training examples. We chose these 6000 examples randomly, but repeated each experiment five
u
times to obtain a measure of average performance.
with RBF
The
kernels,
results
they
we
do
report
not require
for each
tuning
MKM
a kernel
are the
width parameter, and unlike deep
they do not require solving a difficult nonlinear optimization
5
or searching over possible a
average performance over these five runs.
In our experiments, we quickly discovered that the multilayer kernels only performe
The right panels of Figs. 4 and 5 show the testn set = 1error kernels rates were of used MKMs at higher with different (ℓ > 1) levels kernels in theand
recursion. Figs. 2 and 3 therefor
numbers of layers ℓ. For reference, we also show thesethe setsbest of results; previously in particular, reportedeach results group[11] of bars using shows the test error rates when
traditional SVMs (with RBF kernels) and deepkernel belief(of nets degree (withn three = 0, 1, layers). 2) was used MKMs at the perform first layer sigof nonlinearity, while the n = 1
nificantly better than shallow architectures suchused as SVMs at successive with RBF layers. kernels We doornot LMNN have awith formal feature explanation for this effect. Howeve
selection (reported as the case ℓ = 0). Compared only the to deep n = 1belief arc cosine nets, kernel the leading preserves MKMs the norm obtain of its inputs: the n = 0 kernel ma
slightly lower error rates on one data set and slightly onto ahigher unit hypersphere error ratesinonfeature another. space, while higher-order (n > 1) kernels may in
spaces with severely distorted dynamic ranges. Therefore, we hypothesize that only n=
We can describe the architecture of an MKM by kernels the number preserveof sufficient selectedinformation features atabout each the layer magnitude (in- of their inputs to work e
cluding the input layer). The number of features composition essentially with corresponds other kernels. to the number of units in
each layer of a neural net. For the mnist-back-rand Finally, datathe set, results the best on both MKMdata used setsanreveal n=1anarc-cosine
interesting trend: the multilayer arc co
kernel and 300-90-105-136-126-240 features at often eachperform layer. better For the than mnist-back-image their single layer data counterparts. set, the Though SVMs are shallow a
best MKM used an n=0 arc-cosine kernel and 300-50-130-240-160-150 features at each layer.
MKMs worked best with arc-cosine kernels of degree n = 0 and n = 1. The kernel of degree 5 n = 2
performed less well in MKMs, perhaps because multiple iterations of kernel PCA distorted the
dynamic range of the inputs (which in turn seemed to complicate the training for LMNN). MKMs
with RBF kernels were difficult to train due to the sensitive dependence on kernel width parameters.
It was extremely time-consuming to cross-validate the kernel width at each layer of the MKM. We
only obtained meaningful results for one and two-layer MKMs with RBF kernels.
7We briefly summarize many results that we lack space to report in full. We also experimented
on multiclass data sets using SVMs with single and multi-layer arc-cosine kernels, as described in
section 2. For multiclass problems, these SVMs compared poorly to deep architectures (both DBNs
and MKMs), presumably because they had no unsupervised training that shared information across
examples from all different classes. In further experiments on MKMs, we attempted to evaluate the
individual contributions to performance from feature selection and LMNN classification. Feature
selection helped significantly on the mnist-back-image data set, but only slightly on the mnist-backrandom
data set. Finally, LMNN classification in the output layer yielded consistent improvements
over basic kNN classification provided that we used the energy-based decision rule [15].
4 Discussion
In this paper, we have developed a new family of kernel functions that mimic the computation in
large, multilayer neural nets. On challenging data sets, we have obtained results that outperform previous
SVMs and compare favorably to deep belief nets. More significantly, our experiments validate
the basic intuitions behind deep learning in the altogether different context of kernel-based architectures.
A similar validation was provided by recent work on kernel methods for semi-supervised
embedding [7]. We hope that our results inspire more work on kernel methods for deep learning.
There are many possible directions for future work. For SVMs, we are currently experimenting with
arc-cosine kernel functions of fractional and (even negative) degree n. For MKMs, we are hoping
to explore better schemes for feature selection [21, 22] and kernel selection [23]. Also, it would be
desirable to incorporate prior knowledge, such as the invariances modeled by convolutional neural
nets [24, 4], though it is not obvious how to do so. These issues and others are left for future work.
A
Derivation of kernel function
In this appendix, we show how to evaluate the multidimensional integral in eq. (1) for the arc-cosine
kernel. Let θ denote the angle between the inputs x and y. Without loss of generality, we can take x
to lie along the w1 axis and y to lie in the w1w2-plane. Integrating out the orthogonal coordinates
of the weight vector w, we obtain the result in eq. (3) where Jn(θ) is the remaining integral:
∫
Jn(θ) =
1 −
dw1 dw2 e 2 (w2 1 +w2 2 ) Θ(w1) Θ(w1 cos θ + w2 sin θ) w n 1 (w1 cos θ + w2 sin θ) n . (14)
Changing variables to u=w1 and v = w1 cos θ+w2 sin θ, we simplify the domain of integration to
the first quadrant of the uv-plane:
Jn(θ) = 1
∫ ∞ ∫ ∞
du dv e
sin θ 0 0
−(u2 +v 2 −2uv cos θ)/(2 sin 2 θ) n n
u v . (15)
The prefactor of (sin θ) −1 in eq. (15) is due to the Jacobian. To simplify the integral further, we
adopt polar coordinates u = r cos( ψ π
ψ π
2
+
4
) and v = r sin(
2
+
4
). Then, integrating out the radius
coordinate r, we obtain:
∫ π
2
Jn(θ) = n! (sin θ) 2n+1
cos
dψ
0
n ψ
. (16)
(1 − cos θ cos ψ)
n+1
To evaluate eq. (16), we first consider the special case n=0. The following result can be derived by
contour integration in the complex plane [25]:
∫ π/2
dψ π − θ
= . (17)
0 1 − cos θ cos ψ sin θ
Substituting eq. (17) into our expression for the angular part of the kernel function in eq. (16), we
recover our earlier claim that J0(θ) = π −θ. Related integrals for the special case n = 0 can also be
found in earlier work [8].For the case n>0, the integral in eq. (16) can be performed by the method
of differentiating under the integral sign. In particular, we note that:
∫ π
2
∫ π/2
cos
dψ
0
n ψ
1 ∂
=
(1 − cos θ cos ψ)
n+1
n!
n
∂(cos θ) n
dψ
. (18)
0 1 − cos θ cos ψ
Substituting eq. (18) into eq. (16), then appealing to the previous result in eq. (17), we recover the
expression for Jn(θ) in eq. (4).
8References
[1] Y. Bengio and Y. LeCun. Scaling learning algorithms towards AI. MIT Press, 2007.
[2] G.E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural Computation,
18(7):1527–1554, 2006.
[3] G.E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,
313(5786):504–507, July 2006.
[4] M.A. Ranzato, F.J. Huang, Y.L. Boureau, and Y. LeCun. Unsupervised learning of invariant feature
hierarchies with applications to object recognition. In Proceedings of the 2007 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR-07), pages 1–8, 2007.
[5] R. Collobert and J. Weston. A unified architecture for natural language processing: deep neural networks
with multitask learning. In Proceedings of the 25th International Conference on Machine Learning
(ICML-08), pages 160–167, 2008.
[6] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, to appear,
2009.
[7] J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding. In Proceedings of
the 25th International Conference on Machine Learning (ICML-08), pages 1168–1175, 2008.
[8] C.K.I. Williams. Computation with infinite neural networks. Neural Computation, 10(5):1203–1216,
1998.
[9] R.H.R. Hahnloser, H.S. Seung, and J.J. Slotine. Permitted and forbidden sets in symmetric thresholdlinear
networks. Neural Computation, 15(3):621–638, 2003.
[10] R.M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag New York, Inc., 1996.
[11] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures
on problems with many factors of variation. In Proceedings of the 24th International Conference
on Machine Learning (ICML-07), pages 473–480, 2007.
[12] C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at
http://www.csie.ntu.edu.tw/˜cjlin/libsvm.
[13] B. Schölkopf, A. Smola, and K. Müller. Nonlinear component analysis as a kernel eigenvalue problem.
Neural Computation, 10(5):1299–1319, 1998.
[14] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning
Research, 3:1157–1182, 2003.
[15] K.Q. Weinberger and L.K. Saul. Distance metric learning for large margin nearest neighbor classification.
Journal of Machine Learning Research, 10:207–244, 2009.
[16] B. Schölkopf, A. J. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue
problem. Technical Report 44, Max-Planck-Institut für biologische Kybernetik, 1996.
[17] J. Goldberger, S. Roweis, G.E. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In
L.K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages
513–520. MIT Press, 2005.
[18] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In
B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19,
pages 153–160. MIT Press, 2007.
[19] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to
face verification. In Proceedings of the 2005 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR-05), pages 539–546, 2005.
[20] Y. LeCun and C. Cortes. The MNIST database of handwritten digits. http://yann.lecun.com/
exdb/mnist/.
[21] M. Tipping. Sparse kernel principal component analysis. In Advances in Neural Information Processing
Systems 13. MIT Press, 2001.
[22] A.J. Smola, O.L. Mangasarian, and B. Schölkopf. Sparse kernel feature analysis. Technical Report 99-04,
University of Wisconsin, Data Mining Institute, Madison, 1999.
[23] G. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, and M.I. Jordan. Learning the kernel matrix with
semidefinite programming. Journal of Machine Learning Research, 5:27–72, 2004.
[24] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel. Backpropagation
applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
[25] G.F. Carrier, M. Krook, and C.E. Pearson. Functions of a Complex Variable: Theory and Technique.
Society for Industrial and Applied Mathematics, 2005.
9