

Learning Intersections of Halfspaces with a Margin

Adam R. Klivans

\Lambda 

Divsion of Engineering and Applied Sciences

Harvard University Cambridge, MA 02138 klivans@eecs.harvard.edu

Rocco A. Servedio Department of Computer Science

Columbia University New York, NY 10027 rocco@cs.columbia.edu

Abstract We give a new algorithm for learning intersections of halfspaces with a margin, i.e. under the assumption that no example lies too close to any separating hyperplane. Our algorithm combines random projection techniques for dimensionality reduction, polynomial threshold function constructions, and kernel methods. The algorithm is fast and simple. It learns a broader class of functions and achieves an exponential runtime improvement compared with previous work on learning intersections of halfspaces with a margin.

\Lambda Supported by an NSF Mathematical Sciences Postdoctoral Research Fellowship.

1

Arriaga & Vempala [3] This Paper h1 ^ \Delta  \Delta  \Delta  ^ ht n \Delta  poly

i

log t

ae

j

+

i

log t

ae

j t log

t ae

ae2

n

i

t ae

jt log t log 1

ae

or n

i

log t

ae

j

q

1

ae log t

f (h1; : : : ; ht) -------------- n

i

t ae

jt2 log 1

ae

Table 1: Bounds on running time for learning intersections and arbitrary functions of t halfspaces with margin ae. Each hi is a halfspace over Rn; in the second line f denotes an arbitrary Boolean function (not known a priori to the learner) on t bits. In each case the target function is assumed to have margin ae:

1 Introduction The Perceptron algorithm and Perceptron Convergence Theorem are among the oldest and most famous results in machine learning. The Perceptron Convergence Theorem (see e.g. [10]) states that at most 4=ae2 iterations of the Perceptron update rule are required in order to correctly classify any set S of examples which are consistent with some halfspace which has margin ae on S. (Roughly speaking, this margin condition means that no example lies within distance ae of the separating hyperplane; we give precise definitions later.)

Since halfspace learning is so widely used in machine learning algorithms and applications, it is of great interest to develop efficient algorithms for learning intersections of halfspaces and other more complex functions of halfspaces. While this problem has been intensively studied, progress to date has been quite limited; we give a brief overview of relevant previous work on learning intersections of halfspaces at the end of this section.

1.1 Our results: toward Perceptron-like performance for learning intersections

of halfspaces.

In this paper we take a perspective similar to that of the original Perceptron Convergence Theorem by highlighting the role of the margin; our goal is to obtain results analogous to the Perceptron Convergence Theorem for learning intersections of halfspaces with margin ae. (Roughly speaking, an intersection of t halfspaces has margin ae relative to a data set if each of the defining halfspaces has margin ae on the data set; we give precise definitions later.) The margin is a natural parameter to consider; previous work by Arriaga and Vempala [3] on learning intersections of halfspaces has explicitly studied the dependence on this parameter. Since the Perceptron algorithm learns a single halfspace in time O(1=ae2), the ultimate goal in this framework would be an algorithm which can learn (say) an intersection of two halfspaces in time polynomial in 1=ae as well.

Table 1 summarizes our main results. For any constant t = O(1) number of halfspaces (in our opinion this is the most interesting case) our learning algorithm runs in (1=ae)O(log 1=ae) time, i.e. quasipolynomial in 1=ae: This is an exponential improvement over Arriaga and Vempala's previous result [3] which was an algorithm that runs in (1=ae)!(1=ae

2)

time. (Put another way, our algorithm

can learn the intersection of O(1) halfspaces with margin at least 1=2

p

log n in poly(n) time, whereas

Arriaga and Vempala require the margin to be at least !(1=

p

log n) to achieve poly(n) runtime.)

In fact, we can learn any Boolean function of t = O(1) halfspaces, not just an intersection of halfspaces, in (1=ae)O(log 1=ae) time.

One can instead consider the number of halfspaces t as the relevant asymptotic parameter and view ae as fixed at \Theta (1). For this case we give an algorithm which has a tO(log log t) dependence on

2

Algorithm A(EX(c; D)):

1. Let M be a n \Theta  k random projection matrix. 2. Draw m many examples from EX(c; D) and project them to Rk

using M .

3. Run the kernel Perceptron algorithm using the polynomial kernel

Kd(x; y) = (x \Delta  y + 1)d over the projected examples until a consistent hypothesis is obtained. Let h0 be the kernel Perceptron hypothesis (a mapping from Rk to f\Gamma 1; 1g).

4. Output h : Rn ! f\Gamma 1; 1g; h(x) = sign(h0(M T x)) as the final

hypothesis.

Figure 1: The algorithm is given access to a source EX(c; D) of random labelled examples, where the target concept c is an intersection of t halfspaces over Rn which has margin ae with respect to distribution D: The values of m; k and d are given in Section 6.

t; this algorithm can learn an intersection of t = n1= log log n many halfspaces in poly(n) time. In contrast, the previous algorithm of [3] has a t!(t) dependence on t and thus runs in poly(n) time only for t = o(

log n

log log n ) many halfspaces.

As described below all our results are achieved using simple iterative algorithms (in fact using

simple variants of the Perceptron algorithm!).

1.2 Our Approach Our algorithm for learning an intersection of t halfspaces in Rn with margin ae is given in Figure 1. The algorithm has three conceptual stages: (i) random projection, (ii) polynomial threshold function construction, and (iii) kernel methods used to learn polynomial threshold functions. We now give a brief overview of each of these stages.

Random Projection: Random projection for dimensionality reduction has emerged as a useful tool in many areas of CS theory. The key fact on which most of these applications are based is the Johnson-Lindenstrauss lemma [13] which shows that a random projection of a set of m points in Rn into Rk (with k ss O(

log m

ffl2 )) with high probability will not change pairwise distances by more than a (1 \Sigma  ffl) factor. Arriaga and Vempala [3] were the first to give learning algorithms based on

random projections. Their key insight was that since the geometry of a sample does not change much under random projection, one can run learning algorithms in the low dimensional space Rk rather than Rn and thus get a computational savings.

As described in Section 3, the first step of our algorithm is to perform a random projection of the sample from Rn into a lower dimensional space Rk where k has no dependence on n. After this projection, with high probability we have data points in Rk which are labelled according to some intersection of halfspaces with margin ae=2:

Polynomial Threshold Functions: Constructions of polynomial threshold functions (PTFs) have recently proved quite useful in computational learning theory; for example the DNF learning algorithm of [16] has at its heart the fact that any DNF formula can be expressed as a low degree thresholded polynomial sign(p(x)): The second conceptual step of our algorithm is to construct a polynomial threshold function for an intersection of halfspaces over Rk: We show in Section 4 that

3

any intersection of halfspaces with margin ae=2 over Rk can be expressed as a low-degree polynomial threshold function p over Rk: Moreover, unlike previous analyses (which only gave degree bounds) we show that this PTF p has nonnegligible PTF margin (we define PTF margin in Section 2.2). We can thus view our projected data in Rk as being labelled according to some degree-d PTF over Rk which has nonnegligible PTF margin. (We emphasize that this is only a conceptual rather than an algorithmic step - the learning algorithm itself does not have to do anything at this stage!)

Kernel Methods: The third step is to learn the low-degree polynomial threshold function over Rk. As shown in Section 5 we do this using the Perceptron algorithm with the standard polynomial kernel Kd(x; y) = (1 + x \Delta  y)d: The kernel Perceptron algorithm learns an implicit representation of a halfspace over an expanded feature space; here the expanded space has a feature for each monomial

of degree up to d; and thus each example in Rk corresponds to a point in R(

k+d

d ). We show that

since there is a polynomial threshold function which correctly classifies the data in Rk with some PTF margin, there must be a halfspace over R(

k+d

d ) which correctly classifies the expanded data

with a margin, and thus we can use kernel Perceptron to learn.

1.3 Comparison with Previous Work Many researchers have considered the problem of learning intersections of halfspaces. Efficient algorithms are known for learning intersections of halfspaces under the uniform distribution on the unit ball [7, 22] and on the Boolean cube [15], but less is known about learning under more general probability distributions. Baum [4] gave an algorithm which learns an intersection of two origin-centered halfspaces under any symmetric distribution D (which satisfies D(x) = D(\Gamma x) for all x 2 Rn), and Klivans et al. [15] gave a PTF-based algorithm which learns an intersection of O(1) many poly(n)-weight halfspaces over f0; 1gn in nO(log n) time under any distribution.

The most closely related previous work is that of Arriaga and Vempala [3] who gave an algorithm for learning an intersection of halfspaces with margin ae; see Table 1 for a comparison with their results. Their algorithm uses random projection to reduce dimensionality and then uses a bruteforce search over all (combinatorially distinct) halfspaces over the sample data. In contrast, our algorithm combines polynomial threshold functions and kernel methods with random projections, and is able to achieve an exponential runtime savings over [3].

2 Preliminaries 2.1 Concepts and Margins A concept is simply a Boolean function c : Rn ! f\Gamma 1; +1g: A halfspace over Rn is a Boolean function h : Rn ! f\Gamma 1; 1g defined by a vector w 2 Rn and a value ` 2 R; given an input x 2 Rn; the value of h(x) is sign(w \Delta  x \Gamma  `); i.e. h(x) = +1 if w \Delta  x * ` and h(x) = \Gamma 1 if w \Delta  x ! `: An intersection of t halfspaces h1; : : : ; ht is the Boolean AND of these halfspaces, i.e. the value is 1 if hi(x) = 1 for all i = 1; : : : ; t and is \Gamma 1 otherwise.

For two vectors x; y 2 Rn we write kx \Gamma  yk to denote the Euclidean distance between x and y and we write Sn\Gamma 1 for the unit ball in Rn: We have:

Definition 1 Given X ae Rn and a concept c over Rn, write kXk to denote maxz2X kzk: We say that c has (geometric) margin ae with respect to X if

ae = minfkz \Gamma  yk : z 2 X; y 2 R

n

; c(z) 6= c(y)g=kXk:

4

Our definition of the geometric margin is similar to the notion of robustness defined in Arriaga and Vempala [3]; the difference is that we normalize by dividing by the radius of the data set kXk. In the case where kXk = 1 these notions coincide and the condition is simply that for every z 2 X, every point within a ball of radius ae around z has the same label as z under c:

For D a probability distribution over Rn we write Supp(D) to denote the set fx 2 Rn : D(x) ? 0g. We say that c has margin ae with respect to distribution D if c has margin ae on Supp(D): Thus, for D a distribution where Supp(D) ae Sn\Gamma 1, an intersection of t halfspaces has margin ae with respect to D if every point in Supp(D) lies at least distance ae away from each of the t separating hyperplanes.

Throughout this paper we assume that: (i) All halfspaces in our intersection of halfspaces learning problem are origin-centered, i.e. of the form sign(w \Delta  x \Gamma  `) with ` = 0 - this can be achieved by adding an (n + 1)st coordinate to each example. (ii) All examples lie on the unit ball Sn\Gamma 1 - this can be achieved by adding a new coordinate so that all examples have the same norm and rescaling.

2.2 Polynomial Threshold Functions and PTF Margins Let f : Rn ! f\Gamma 1; 1g be a Boolean function and X be a subset of Rn: A real polynomial p in n variables is said to be a polynomial threshold function (PTF) for f over X if sign(p(x)) = f (x) for all x 2 X: The degree of a polynomial threshold function p is simply the degree of the polynomial p. Polynomial threshold functions are well studied in the case where X = f0; 1gn or f\Gamma 1; 1gn (see e.g. [5, 16, 18, 20]) but we will consider other more general subsets X:

For S ` fx1; : : : ; xng a multiset of variables, we write xS to denote the monomial

Q

i2S xi.

For p(x) =

P

S cSxS a polynomial, we write kpk to denote

q

P

S c

2 S, i.e. the L2 norm of the

vector of coefficients of p: Given a PTF p over X; we define the PTF margin of p over X to

be minfjp(z)j : z 2 Xg=kpk: Note that if p(x) = w \Delta  x is a degree-1 polynomial which has kpk =p

w21 + \Delta  \Delta  \Delta  + w2n = 1; then the PTF margin of p over X is equal to the geometric margin of sign(p(x)) over X (up to scaling by kXk). However in general for polynomials of degree greater than 1 these two notions are not equivalent.

2.3 The Perceptron Algorithm and Kernel Perceptron Perceptron is a simple iterative algorithm which finds a linear separator for a labelled data set X ae Rn if such a separator exists. The algorithm maintains a weight vector w 2 Rn and a bias ` 2 R and updates these parameters additively after each example; see e.g. Chapter 2 of [10] for details. The Perceptron Convergence Theorem bounds the number of updates in terms of the maximum margin of any halfspace (the following is adapted from Theorem 2.3 of [10]):

Theorem 2 Let X ae Rn be a set of labelled examples such that there is some halfspace h (which need not be origin-centered) which has margin ae over X: Then the Perceptron algorithm makes at most

4

ae2 mistakes on X:

Let OE : Rn ! RN be a function which we call a feature expansion. We refer to Rn as the original feature space and RN as the expanded feature space. The kernel corresponding to OE is the function K(x; y) = OE(x) \Delta  OE(y): The use of kernels in machine learning has received much research attention in recent years (see e.g. [10, 12] and references therein).

Given a data set X ae Rn; it is well known (see e.g. [11]) that the Perceptron algorithm can be simulated over OE(X) in the expanded feature space RN using the kernel function K(x; y) to

5

yield an implicit representation of a halfspace in RN : If evaluating K(x; y) takes time T and the Perceptron algorithm is simulated until M mistakes are made on a data set X with jXj = m; the time required is O(mT M 2) (see e.g. [12, 14]).

3 Random Projections We say that an n \Theta  k matrix M is a random projection matrix if each entry of M is chosen independently and uniformly from f\Gamma 1; 1g: We will use the following lemma from Arriaga and Vempala [3] (see Achlioptas [1] for similar results):

Lemma 3 [3] Fix ae ! 1, 0 ! c !

1

2 and w 2 R

n with kwk = 1: Let M be an n \Theta  k random

projection matrix. For any x 2 Rn we have

Pr[w \Delta  x \Gamma  2c ^ (M

T

w) \Delta  (M

T

x) ^ w \Delta  x + 2c] * 1 \Gamma  6e

\Gamma (c2\Gamma c3)k=4

* 1 \Gamma  6e

\Gamma c2k=8

:

With this lemma in hand we can establish the main theorem on random projection which we will use:

Theorem 4 Let X be a set of m points on Sn\Gamma 1 and let h = sign(w \Delta  x) be a halfspace which has margin ae on X: Let k *

2048

ae2 log(

18m

ffi ) and let M be a n \Theta  k random projection matrix. Let

M (X) ae Rk denote the projection of X under M and let h0 : Rk ! f\Gamma 1; +1g denote the function h0(y) = sign((M T w) \Delta  y). Then with probability 1 \Gamma  ffi, the halfspace h0 correctly classifies M (X) with margin at least

ae

2 and we have

1 2 ^ kM (X)k ^ 2.

Proof: We may assume that kwk = 1: After applying M to the points in X, we need to verify that Definition 1 is satisfied for h0 with respect to the points in M (X). Setting c =

ae

16 and

setting k as above, taking x = w in Lemma 3 we have that with probability at least 1 \Gamma 

ffi

3m ,

kM T wk2 ^ kwk2 +

ae

8 = 1 +

ae 8 , so kM

T wk ^ 1 + ae

16 .

Now for each point z 2 X; applying Lemma 3 with x = z, with probability at least 1 \Gamma  ffi3m

we have (w \Delta  z) \Gamma 

ae

8 ^ (M

T w) \Delta  (M T z) ^ (w \Delta  z) + ae

8 : Since j(w \Delta  z)j * ae, this gives j(M

T w) \Delta 

(M T z)j *

7ae

8 . Hence with probability at least 1 \Gamma 

ffi 2 we have minfkz

0 \Gamma  yk : z0 2 M (X); y 2

Rk; h0(z0) 6= h0(y)g * minz2X j(M T w) \Delta  (M T z)j=kM T wk *

7ae=8

1+ae=16 *

3ae

4 . Lemma 3 similarly implies

that 1 \Gamma 

ae

8 ^ kM (X)k ^ 1 +

ae 16 with probability at least 1 \Gamma 

ffi 2 : Thus with probability 1 \Gamma  ffi, h

0 has

margin at least

ae

2 on M (X) and

1 2 ^ kM (x)k ^ 2. 2

A union bound yields the following corollary:

Corollary 5 Let X be a set of m points on Sn\Gamma 1 and let H =

Vt

i=1 hi = sign(w

1\Delta x)^: : :^sign(wt\Delta x)

be an intersection of t halfspaces which has margin ae on X. Let k *

2048

ae2 \Delta  log(

18mt

ffi ) and let M be

a n \Theta  k random projection matrix. Let M (X) ae Rk denote the projection of X under M and let

H0 =

Vt

i=1 sign((M

T wi) \Delta  y). Then with probability 1 \Gamma  ffi, the intersection of halfspaces H0 correctly

classifies M (X) with margin at least

ae

2 and

1 2 ^ kM (X)k ^ 2:

Thus with high probability the projected set of examples in Rk is classified by an intersection of halfspaces with margin

ae

2 . It is easy to see that the corollary in fact holds for any Boolean function

(not just intersections) of t halfspaces.

6

4 Polynomial Threshold Functions for Intersections of Halfspaces

with a Margin

In this section we give several constructions of polynomial threshold functions for intersections of halfspaces with a margin. In each case we give a PTF and also a lower bound on the PTF margin of the polynomial threshold function which we construct. These PTF margin lower bounds will be useful when we analyze the performance of kernel methods for learning polynomial threshold functions.

In order to lower bound the PTF margin of a polynomial p we must upper bound kpk: Fact 6 (proved in Appendix A) helps obtain such upper bounds:

Fact 6 1. For i = 1; : : : ; ` let qi(x) =

P

S ci;SxS be a polynomial of degree at most d over x1; : : : ; xk with kqik2 ^ M: Then kq1(x) : : : q`(x)k2 ^ K`M ` where K =

\Gamma k+d

d

\Delta 

.

2. Let q1; : : : ; q` be polynomials with kqik2 ^ Mi: Then kq1 + \Delta  \Delta  \Delta  + q`k2 ^ `(M1 + \Delta  \Delta  \Delta  + M`):

4.1 Constructions based on Rational Functions Recall that a rational function is a quotient of two real polynomials, i.e. Q(x) = a(x)=b(x). The degree of Q is defined as deg(a) + deg(b): Building on earlier results of Newman [17] on rational functions which approximate the absolute value function jxj, in [6] Beigel et al. gave a construction of a low-degree rational function which closely approximates the function sgn(x). We will use the following lemma (Lemma 9 of [6]):

Lemma 7 [6] For all integers r; ` * 1 there is a univariate rational function P r` (x) =

a(x)

b(x) of

degree O(` log r) with the following properties: (i) P r` (x) 2 [1; 1 +

1

r ] for all x 2 [1; 2

`]; (ii) P r

` (x) 2

[\Gamma 1 \Gamma 

1

r ; \Gamma 1] for all x 2 [\Gamma 2

`; \Gamma 1]; and (iii) Each coefficient of a(x); b(x) has magnitude at most

2O(`

2 log r)

.

The following theorem generalizes Theorem 24 in [15], which addresses the special case of intersections of low-weight halfspaces over the space X = f0; 1gn:

Theorem 8 Let X be a subset of Rk with

1

2 ^ kXk ^ 2 and c : R

k ! f\Gamma 1; 1g be an intersection of t origin-centered halfspaces h1; : : : ; ht: If c has margin ae on X then there is a polynomial threshold function of degree d = O(t log t log

1

ae ) for c on X: If d ^ k then this PTF has PTF margin

(ae=k)O(t log t log 1=ae) on X:

Proof: We must exhibit a polynomial p(x) of the claimed degree such that for any z 2 X we have sign(p(z)) = c(z) and

jp(z)j

kpk * (k=ae)

O(t log t log 1=ae):

Let w1 \Delta  x = 0; : : : ; wt \Delta  x = 0 be the t hyperplanes which define halfspaces h1; : : : ; ht; we may assume without loss of generality that each kwik = 1: Now consider the sum of rational functions

Q(x) = P

2t

log 4=ae(2(w

1

\Delta  x)=ae) + \Delta  \Delta  \Delta  + P

2t

log 4=ae(2(w

t

\Delta  x)=ae) \Gamma  t + 1=2:

Fix any z 2 X: Since c has margin ae on X and

1

2 ^ kXk ^ 2; for each i = 1; : : : ; t we have ae 2 ^ aekXk ^ jw

i \Delta  zj ^ kwik \Delta  kXk ^ 2 and hence j2(wi \Delta  z)=aej 2 [1; 4

ae ]: Consequently P

2t log 4=ae(

2(wi\Delta z)

ae )

lies in [1; 1+

1

2t ] if hi(z) = 1 and lies in [\Gamma 1\Gamma 

1 2t ; \Gamma 1] if hi(z) = \Gamma 1: Thus if hi(z) = 1 for all i we have

Q(z) * t \Gamma  t +

1

2 =

1 2 , and if hi(z) = \Gamma 1 for some i we have Q(z) ! \Gamma 1 + (t \Gamma  1) +

(t\Gamma 1)

2t \Gamma  t +

1 2 ! \Gamma 

1 2 :

So sign(Q(z)) = c(z) for all z 2 X:

7

Since Q(x) is a sum of t rational functions of degree O(log t log

1

ae ); we can clear denominators

and re-express Q(x) as a single rational function A(x)=B(x) of degree O(t log t log

1

ae ): It follows that

the function p(x) = A(x)B(x), which is a polynomial of degree O(t log t log

1

ae ), has sign(p(z)) =

sign(Q(z)) as desired.

Now we must bound kpk: We have k

2wi\Delta x

ae k

2 = 4

ae2 so by part (1) of Fact 6 we have that

k(

2wi\Delta x

ae )

jk2 ^ ( 4k

ae2 )

j for all j: By Lemma 7 we have that P 2t

log 4=ae(x) =

a(x)

b(x) where a(x); b(x)

are polynomials of degree O(log t log

1

ae ) with coefficients of magnitude at most 2

O((log 1ae )2 log t)

=

(

1

ae )

O(log t log 1=ae): It follows from part (2) of Fact 6 that ka( 2wi\Delta x

ae )k

2 ^ ( k

ae )

O(log t log 1=ae) \Delta ( 1

ae )

O(log t log 1=ae)

which equals ( kae )O(log t log 1=ae), and the same holds for kb( 2w

i\Delta x

ae )k

2: Expressing Q(x) as a rational function A(x)=B(x); we have that B(x) =

Qt

i=1 b(

2wi\Delta x

ae ), so since d ^ k part (1) of Fact 6 implies that

kB(x)k2 ^ kO(t log t log 1=ae)(

k

ae )

O(t log t log 1=ae) = ( k

ae )

O(t log t log 1=ae): Simple calculations using part (1) of

Fact 6 show that kA(x)k2 and kp(x)k = kA(x)B(x)k are also (k=ae)O(t log t log 1=ae), and we are done. 2

By modifying this construction, we get a polynomial threshold function for any Boolean function of t halfspaces rather than just an intersection (at a relatively small cost in degree and PTF margin):

Theorem 9 Let f : f\Gamma 1; 1gt ! f\Gamma 1; 1g be any Boolean function on t bits. Let X be a subset of Rk with 12 ^ kXk ^ 2 and c : Rk ! f\Gamma 1; 1g be the function f (h1; : : : ; ht) where h1; : : : ; ht are origincentered halfspaces in Rk: If c has margin ae on X then there is a PTF of degree d = O(t2 log

1

ae ) for

c on X: If d ^ k then this PTF has PTF margin (ae=k)O(t

2 log 1=ae)

on X:

Proof: As before, we give a polynomial p(x) of the claimed degree such that for any z 2 X we have sign(p(z)) = c(z) and

jp(z)j

kpk * (k=ae)

O(t2 log 1=ae):

Again let w1 \Delta  x = 0; : : : ; wt \Delta  x = 0 be the hyperplanes for halfspaces h1; : : : ; ht; where each wi is a unit vector. For each i = 1; : : : ; t consider the rational function

Qi(x) = P

23t

log 4=ae

\Gamma 

2(w

i

\Delta  x)=ae

\Delta 

:

Fix any z 2 X: As before we have that j2(wi \Delta  z)=aej 2 [1;

4

ae ]; so by Lemma 7 the value of Qi(z)

differs from the \Sigma 1 value hi(z) = sign(wi \Delta  z) by at most

1

23t : Since f is a Boolean function on

t inputs, it is expressible as a multilinear polynomial ~f of degree t, with coefficients of the form

i=2t where i is an integer in [\Gamma 2t; 2t]. (The polynomial ~f is just the Fourier representation of f .) Multiply ~f by 2t, so now ~f : f+1; \Gamma 1gt ! f+2t; \Gamma 2tg, and ~f has integer coefficients which are at most 2t in absolute value.

Now we would like to argue that ~f (Q1(z); : : : ; Qt(z)) has the same sign as f (h1(z); : : : ; ht(z)). To do this we show that the "error" of each Qi(z) relative to the \Sigma 1 value hi(z) (which error is at most

1

23t ) does not cause

~f to have the wrong sign. The polynomial ~f has at most 2t terms, each of

which is the product of an integer coefficient of magnitude at most 2t and up to t of the Qi's. The product of the Qi's incurs error at most O(t2\Gamma 3t) relative to the corresponding product of the hi's, and thus the error of any given term (including the integer coefficient) is at most O(t2\Gamma 2t): Since we add up at most 2t terms, the overall error is at most O(t2\Gamma t) error, which is much less than what we could tolerate (we could tolerate error 2t; recall that ~f takes value \Sigma 2t on \Sigma 1 inputs). Thus ~f (Q1(z); : : : ; Qt(z)) has the same sign as f (h1(z); : : : ; ht(z)) for all z 2 X:

Now ~f is a multilinear polynomial of degree t, and each Qi is a rational function of degree O(t log w). We can bring ~f (Q1; : : : ; Qt); to a common denominator (which is the product of the

8

denominators of the Qi's) of degree O(t2 log w). Hence we have a single multivariate rational function A(x)=B(x) which takes the right sign on z; and we can convert this rational function to a polynomial threshold function p(x) = A(x)B(x) as in the proof of Theorem 8.

Now we must bound kpk: Let Qi(x) =

ai(x)

bi(x) : The analysis from the previous proof implies that

kai(x)k2 and kbi(x)k2 are both at most (

k

ae )

O(t log 1=ae): Now consider a monomial (in the "variables"

Q1(x); : : : ; Qt(x)) in the polynomial ~f (Q1(x); : : : ; Qt(x)): Since the numerator ff(x) of such a monomial is the product of at most t of the ai(x)'s, and each ai(x) has degree at most O(log t log 1ae ); the

fact that d ^ k and part (1) of Fact 6 together give kff(x)k2 ^ kO(t log t log 1=ae)( kae )O(t

2 log 1=ae)

which

equals (

k

ae )

O(t2 log 1=ae): The same holds for the denominator fi(x) of such a monomial. Since the

common denomiator for ~f (Q1; : : : ; Qt) is the product of the denominators of the Qi's, clearing all denominators we have that ~f (Q1; : : : ; Qt) = A(x)=B(x) with kA(x)k2 and kB(x)k2 both at most (

k

ae )

O(t2 log 1=ae): We thus have kp(x)k2 = kA(x)B(x)k2 = ( k

ae )

O(t2 log 1=ae) and the theorem is proved.

2

4.2 Constructions using Extremal Polynomials The bounds from the previous section are quite strong when t is relatively small. If t is large but ae is also quite large, then the following bounds based on Chebyshev polynomials are better.

The r-th Chebyshev polynomial of the first kind, Tr(x); is a univariate degree-r polynomial with the following properties [9]:

Lemma 10 The polynomial Tr(x) =

Pr

i=0 aix

i satisfies: (i) jT

r(x)j ^ 1 for jxj ^ 1 with Tr(1) = 1;

(ii) T 0r(x) * r2 for x ? 1 with T 0r(1) = r2; and (iii) For i = 0; : : : ; r each ai is an integer with

jaij ^ 2r:

The following theorem generalizes results in [16]: Theorem 11 Let X be a subset of Rk with

1

2 ^ kXk ^ 2 and let c : R

k ! f\Gamma 1; 1g be an

intersection of t origin-centered halfspaces h1; : : : ; ht: If c has margin ae on X then there is a PTF of degree d = O(

p

1=ae log t) for c on X: If d ^ k then this PTF has PTF margin 1=kO(

p

1=ae log t)

on X:

Proof: As in the previous proofs we must exhibit a polynomial p(x) such that for any z 2 X we have sign(p(z)) = c(z) and

jp(z)j

kpk * 1=k

O(

p

1=ae log t):

Let w1 \Delta  x = 0; : : : ; wt \Delta  x = 0 be the t hyperplanes for halfspaces h1; : : : ; ht where each kwik = 1: Let P be the univariate polynomial P (x) = Tr(1 \Gamma  x) where r = d

p

2=aee: The first part of Lemma

10 implies that jP (x)j ^ 1 for x 2 [0; 2], and the second part implies that P (x) * 2 for x ^

\Gamma ae

2 :

Now consider the polynomial threshold function sign(p(x)) where

p(x) = t +

1

2

\Gamma 

tX

i=1

(P (w

i

\Delta  x))

dlog 2te

:

Since P is a polynomial of degree r = d

p

2=aee and wi \Delta x is a polynomial of degree 1, this polynomial

threshold function has degree d = d

p

2=aee \Delta  dlog 2te: We now show that p(x) has the desired

properties described above.

We first show that for any z 2 X the polynomial p takes the right sign and has magnitude at least

1

2 : Fix any z 2 X: For each i = 1; : : : ; t we have

ae 2 ^ aekXk ^ jw

i \Delta  zj ^ kwik \Delta  kXk ^ 2:

9

ffl If c(z) = 1 then for each i we have

ae

2 ^ w

i \Delta  z ^ 2 and hence we have that P (wi \Delta  z)

(and also P (wi \Delta  z)dlog 2te lies in [\Gamma 1; 1]: Consequently we have that p(z) * t +

1

2 \Gamma  t *

1 2 so

sign(p(z)) = c(z) = 1.

ffl If c(z) = \Gamma 1 then for some i we have wi \Delta  z 2 [\Gamma 2; \Gamma 

ae

2 ], so consequently P (w

i \Delta  z) * 2 and

P (wi \Delta z)dlog 2te * 2t: Since P (wj \Delta z)dlog 2te * \Gamma 1 for all j, we have p(z) ^ t+

1

2 \Gamma 2t+(t\Gamma 1) = \Gamma 

1 2

so sign(p(z)) = c(z) = \Gamma 1:

To finish the proof it remains to bound kpk: Since kwi \Delta  xk2 = 1 for all i; by part 2 of Fact 6 we have k1 \Gamma  wi \Delta  xk2 ^ 4 so by part 1 of Fact 6 we have that k(1 \Gamma  wi \Delta  x)jk ^ (4k)j for j = 0; : : : ; r: Since (by Lemma 10) Tr(x) =

Pr

j=0 ajx

j where each ja

jj ^ 2

r; for each j = 0; : : : ; r we have

kaj(1 \Gamma  wi \Delta  x)jk2 ^ 22r(4k)r. By part 2 of Fact 6 we obtain kTr(1 \Gamma  wi \Delta  x)k2 ^ (r + 1)2(16k)r; and now part 1 implies that (P (wi \Delta  x))dlog 2te = kO(r log t). Using part 2 again we obtain that kpk ^ (t + 1)2kO(r log t) = kO(r log t), and the theorem is proved.

As Arriaga and Vempala observed in [3], DNF formulas can be viewed as unions of halfspaces. If we rescale the cube so that it is a subset of Sk\Gamma 1, it is easy to check that a Boolean function f : f\Gamma 1; 1gk ! f\Gamma 1; 1g has margin ae with respect to X ` f\Gamma 1; 1gk if for every z 2 X we have that

every Boolean string z0 which differs from z in at most a

ae2

4 fraction of bits has f (z

0) = f (z):

Since any DNF formula with t terms can be expressed as a union of t halfspaces, we have the following corollary of Theorem 11:

Corollary 12 Let X ae f\Gamma 1; 1gk and let c be a t-term DNF formula on k variables. If c has margin ae on X then there is a polynomial threshold function of degree O(

p

1=ae log t) for c on X which has

PTF margin 1=kO(

p

1=ae log t) on X: If d ^ k then this PTF has PTF margin (1=k)O(

p

1=ae log t) on X:

A similar corollary for DNF formulas also follows from Theorem 8 but we are most interested in DNFs with t =poly(n) terms so we focus on Theorem 11.

5 Kernel Perceptron for learning PTFs with PTF margin In this section we first define a new kernel, the Complete Symmetric Kernel, which arises naturally in the context of polynomial threshold functions. We give an efficient algorithm for computing this kernel (which may be of independent interest), and indeed all results of the paper could be proved using this new kernel. To make our overall algorithm simpler, however, we ultimately use the standard polynomial kernel which we discuss later in this section.

Let OEd : Rk ! R(

k+d

d ) be a feature expansion which maps (x1; : : : ; xk) to the vector (1; x1; : : : ; xk;

x21; x1x2; : : : ) containing all monomials of degree up to d. Let Kd(x; y) = OEd(x) \Delta  OEd(y) be the kernel corresponding to OEd. We refer to Kd(x; y) as the complete symmetric kernel since as explained in Appendix B the value Kd(x; y) equals the sum of certain complete symmetric polynomials.

For a data set X ae Rk we write OEd(X) to denote the expanded data set of points in R(

k+d

d ): The

following lemma gives a mistake bound for the Perceptron algorithm using the complete symmetric kernel:

Lemma 13 Let X ae Rk be a set of labelled examples such that there is some degree-d polynomial threshold function p(x) which correctly classifies X and has PTF margin ae over X: Then the Perceptron algorithm (run on OEd(X) using the complete symmetric kernel Kd) makes at most

4kOEd(X)k2

ae2 mistakes on X:

10

Proof: The vector W 2 R(

k+d

d ) whose coordinates are the coefficients of p has margin minz2X jW \Delta OEd(z)j

kW k\Delta kOEd(X)k

over OEd(X): Since W \Delta  OEd(z) = p(z) and kW k = kpk; the lemma follows by from the definition of

the PTF margin of p and the Perceptron Convergence Theorem (Theorem 2). 2

In Appendix B we give a polynomial time algorithm for computing Kd(x; y), but this algorithm is somewhat cumbersome. With the aim of obtaining a faster and simpler overall algorithm, we now describe an alternate approach based on the well known polynomial kernel.

As in [10], we define the degree-d polynomial kernel K0d : Rk \Theta Rk ! R as K0d(x; y) = (1+x\Delta y)d:

It is clear that K0d(x; y) can be computed efficiently. Let OE0d : Rk ! R(

k+d

d ) be the feature expansion

such that K0d(x; y) = OE0d(x) \Delta  OE0d(y); note that OE0d(x) differs from OEd(x) defined above because of the coefficients that arise in the expansion of (1 + x \Delta  y)d:

We have the following polynomial kernel analogue of Lemma 13:

Lemma 14 Let X ae Rk be a set of labelled examples such that there is some degree-d polynomial threshold function p(x) which correctly classifies X and has PTF margin ae over X: Then the

Perceptron algorithm (run on OE0d(X) using the polynomial kernel K0d) makes at most

4(1+kXk2)d

ae2 mistakes on X:

Proof: We view OE0d(x) as a vector (aSxS) of monomials with coefficients. By inspection of the coefficients of (1 + x \Delta  y)d it is clear that each aS * 1: Let W 0 be the vector in R(

k+d

d ) such that

W 0 \Delta  OE0d(x) = p(x) as a formal polynomial. For each monomial xS in p(x), the W 0S coordinate of W 0 equals WS=aS ^ WS where W is defined as in the proof of Lemma 13 so we have kW 0k ^ kW k:

The vector W 0 has margin

minz2X jW 0\Delta OE0d(z)j

kW 0k\Delta kOE0d(X)k =

minz2X jp(z)j kW 0k\Delta kOE0d(X)k *

minz2X jp(z)j kW k\Delta kOE0d(X)k over OE

0 d(X): It is easy

to verify that kOE0d(X)k ^ (1 + kXk2)d=2, so W 0 has margin at least

minz2X jp(z)j

kW k\Delta (1+kXk2)d=2

=

ae

(1+kXk2)d=2

.

The lemma now follows from the Perceptron Convergence Theorem. 2

The output hypothesis of this kernel Perceptron is an (implicit representation of a) halfspace over R(

k+d

d ) which can be viewed as a polynomial threshold function of degree d over Rk:

6 The Main Results In this section we give our main learning results by bounding the running time of algorithm A and proving that it outputs an accurate hypothesis.

Our first theorem gives a good bound for the case where t is relatively small:

Theorem 15 Algorithm A learns any ae-margin intersection of t halfspaces over Rn in at most

n

ffl \Delta  (

t ae log

1 ffiffl )

O(t log t log 1=ae) time steps.

Proof: Let c be an intersection of t origin-centered halfspaces over Rn which has margin ae with respect to distribution D where Supp(D) ae Sn\Gamma 1: Let m equal the number of examples our algorithm draws from EX(c; D); we defer specifying m until the end of the proof. Let k = O(

1

ae2 \Delta  log

mt

ffi ); and

d = O(t log t log 1ae ). Let X be the set of m examples in Rn, and let M (X) be the projected set of

m examples in Rk: Note that it takes nkm time steps to construct the set M (X):

By Corollary 5, with probability 1\Gamma ffi we have that

1

2 ^ kM (X)k ^ 2 and there is an intersection

of t origin-centered halfspaces in Rk which has margin at least

ae

2 on M (X): By Theorem 8 there

is a polynomial threshold function over Rk of degree d = O(t log t log

1

ae ) which has PTF margin

(

ae

k )

O(d) with respect to M (X): By Lemma 14 the degree-d polynomial kernel Perceptron algorithm

11

makes at most (

k

ae )

O(d) mistakes when run on M (X); and thus once M (X) is obtained the algorithm

runs for at most m \Delta  (

k

ae )

O(d) = ( k

ae )

O(d)=ffl time steps.

Now we show that with probability 1 \Gamma  ffi algorithm A outputs an ffl-accurate hypothesis for c relative to D: Since the output hypothesis h(x) = sign(p(M x)) is computed by first projecting x 2 Rn down to Rk via M and then evaluating the k-variable PTF p, it suffices to show that p is a good hypothesis under the distribution M (D) obtained by projecting D down to Rk via M . It is well known (see e.g. [2]) that the VC dimension of the class of degree-d PTFs over k real variables is

\Gamma k+d

d

\Delta 

. Thus by the VC theorem [8] in order to learn to accuracy ffl and confidence ffi it

suffices to take m = O(

kd

ffl log

1

ffl +

1

ffl log

1

ffi ): It is straightforward to verify that k = (

d ae log

1 ffiffl )

O(1);

m =

1

ffl \Delta  (

d ae log

1 ffiffl )

O(d) satisfy the above conditions on m and k. Since d = O(t log t log 1

ae ) we have

k = (

t

ae log

1 ffiffl )

O(1) and m = 1

ffl \Delta  (

t ae log

1 ffiffl )

O(t log t log 1=ae) which proves the theorem. 2

Note that for a constant t = O(1) number of halfspaces Algorithm A has a quasipolynomial ((

1

ae )

O(log 1=ae)) runtime dependence on the margin ae, in contrast with the exponential (( 1

ae )

O(log 1ae )=ae2)

)

dependence of [3].

The proof of Theorem 15 used the polynomial threshold function construction of Theorem 8. We can instead use the construction of Theorem 11 to obtain:

Theorem 16 Algorithm A learns any ae-margin intersection of t halfspaces over Rn in at most

n

ffl \Delta  (

log t

ae log

1 ffiffl )

O(

p

1=ae log t) time steps.

For a constant ae = \Theta (1) margin Algorithm A has an almost polynomial ((tO(log log t)) runtime dependence on t, in contrast with the exponential (t!(t)) dependence of [3]. By Corollary 12 the above bound holds for learning t-term DNF with margin ae as well.

Finally, we can use the construction of Theorem 9 to obtain:

Theorem 17 Algorithm A learns any Boolean function of t halfspaces with margin ae in at most

n

ffl \Delta  (

t ae log

1 ffiffl )

O(t2 log 1=ae) time steps.

7 Discussion 7.1 Is Random Projection Necessary? A natural question is whether our quantitative results could be achieved simply by using kernel Perceptron (or a Support Vector Machine) without first performing random projection. Given a data set X in Rn classified by an intersection of t = 2 halfspaces with margin ae, Theorem 8 implies the existence of a polynomial threshold function for X of degree d = O(log(1=ae)) with PTF margin (ae=n)O(log(1=ae)). Using either the degree-d polynomial kernel or the Complete Symmetric Kernel,

we obtain a halfspace over R(

n+d

d ) which classifies the expanded data set OE(X) with geometric

margin (ae=n)O(log(1=ae)).1 Thus it appears that without the initial projection step, the required sample complexity for either kernel Perceptron or an SVM will be (n=ae)\Omega (log(1=ae)), as opposed to the bounds in Section 6 which do not depend on n; so random projection does indeed seem to provide a gain in efficiency.

1In Arriaga and Vempala [3] it is claimed that if the geometric margin of a degree-d PTF p in Rn is ae then the

margin of the corresponding halfspace in R(

n+d

d ) is at least aed, but this claim is in error [23]; to bound the margin of

the halfspace in R(

n+d

d ) one must analyze the PTF margin of p rather than its geometric margin.

12

7.2 Lower Bounds on Polynomial Threshold Functions The main result of O'Donnell and Servedio in [19], if suitably interpreted, proves that there exists a set X ae R2 labelled according to the intersection of two halfspaces with margin ae for which any PTF correctly classifying X must have degree \Omega (

log(1=ae)

log log(1=ae) ). This lower bound implies that our choice of d in the proof of Theorem 15 is essentially optimal with respect to ae. For a discussion of

other lower bounds on PTF constructions see Klivans et al. [15].

7.3 Alternative Algorithms We note that after random projection, in Step 3 of Algorithm A there are several other algorithms that could be used instead of kernel Perceptron. For example, we could run a support vector machine over Rk with the same degree d polynomial kernel to find the maximum margin hyperplane

in R(

k+d

d ); alternatively we could even explicitly expand each projected example M (x) 2 Rk into

OE0d(M (x)) 2 R(

k+d

d ) and explicitly run Perceptron (or indeed any algorithm for solving linear

programs such as the Ellipsoid algorithm) to learn a single halfspace in R(

k+d

d ). It can be verified

that each of these approaches gives the same asymptotic runtime and sample complexity as our kernel Perceptron approach. We use kernel Perceptron both for its simplicity and for its ability to take advantage of the actual margin if it is better than the worst-case bounds presented here.

7.4 Future Work and Implications for Practice We feel that our results give some theoretical justification for the effectiveness of the polynomial kernel in practice, as kernel Perceptron takes direct advantage of the representational power of polynomial threshold functions. We are working on experimentally assessing the algorithm's performance.

8 Acknowledgements We thank Santosh Vempala for helpful discussions.

References

[1] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary

coins. Journal of Computer and System Sciences, 66(4):671-687, 2003.

[2] M. Anthony. Classification by polynomial surfaces. Discrete Applied Mathematics, 61:91-103,

1995.

[3] R. Arriaga and S. Vempala. An algorithmic theory of learning: Robust concepts and random

projection. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS), pages 616-623, 1999.

[4] E. Baum. A polynomial time algorithm that learns two hidden unit nets. Neural Computation,

2:510-522, 1991.

[5] R. Beigel. When do extra majority gates help? polylog(n) majority gates are equivalent to

one. Computational Complexity, 4:314-324, 1994.

13

[6] R. Beigel, N. Reingold, and D. Spielman. PP is closed under intersection. Journal of Computer

and System Sciences, 50(2):191-202, 1995.

[7] A. Blum and R. Kannan. Learning an intersection of a constant number of halfspaces under

a uniform distribution. Journal of Computer and System Sciences, 54(2):371-380, 1997.

[8] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the VapnikChervonenkis dimension. Journal of the ACM, 36(4):929-965, 1989.

[9] E. Cheney. Introduction to Approximation Theory. McGraw-Hill, New York, New York, 1966. [10] N. Cristianini and J. Shawe-Taylor. An introduction to Support Vector Machines (and other

kernel-based learning methods). Cambridge University Press, 2000.

[11] Y. Freund and R. Schapire. Large margin classification using the Perceptron algorithm. In

Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 209-217, 1998.

[12] R. Herbrich. Learning Kernel Classifiers. MIT Press, 2002. [13] W. Johnson and J. Lindenstrauss. Extensions of Lipshitz mapping into Hilbert space. Contemporary Mathematics, 26:189-206, 1984.

[14] R. Khardon, D. Roth, and R. Servedio. Efficiency versus Convergence of Boolean Kernels for

On-Line Learning Algorithms. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press.

[15] A. Klivans, R. O'Donnell, and R. Servedio. Learning intersections and thresholds of halfspaces.

In Proceedings of the Forty-Third Annual Symposium on Foundations of Computer Science, pages 177-186, 2002.

[16] A. Klivans and R. Servedio. Learning DNF in time 2

~O(n1=3)

. In Proceedings of the Thirty-Third

Annual Symposium on Theory of Computing, pages 258-265, 2001.

[17] D. J. Newman. Rational approximation to jxj. Michigan Mathematical Journal, 11:11-14,

1964.

[18] R. O'Donnell and R. Servedio. Extremal properties of polynomial threshold functions. In

Proceedings of the Eighteenth Annual Conference on Computational Complexity, pages 3-12, 2003.

[19] R. O'Donnell and R. Servedio. New degree bounds for polynomial threshold functions. In

Proceedings of the 35th ACM Symposium on Theory of Computing, pages 325-334, 2003.

[20] M. Saks. Slicing the hypercube, pages 211-257. London Mathematical Society Lecture Note

Series 187, 1993.

[21] A. Shpilka. Lower Bounds for Small Depth Arithmetic and Boolean Circuits. PhD thesis,

Hebrew University, 2001.

[22] S. Vempala. A random sampling based algorithm for learning the intersection of halfspaces.

In Proceedings of the 38th Annual Symposium on Foundations of Computer Science, pages 508-513, 1997.

14

[23] S. Vempala. Personal communication, 2004. [24] J. Zhou. Introduction to symmetric polynomials and symmetric functions. Lecture notes for

course at Tsinghua University, available at cms.zju.edu.cn/course/cn/SymmetricF.pdf, 2003.

A Proof of Fact 6 For (1), we have q1(x) : : : q`(x) =

P

S1;:::;S` c1;S1 : : : c`;S`xS1 : : : xS` from which it follows that kq1(x) : : : q`(x)k

2

is at most

0

@

X

S1;:::;S`

jc1;S1 : : : c`;S`j

1 A

2

^ K

`

X

S1;:::;S`

(cS1 : : : cS`)

2

= K

`

`Y

i=1

0 @

X

Si

c

2

Si

1 A ^ K`M `

where the second inequality follows from Cauchy-Schwarz using the fact that each qi(x) has at most K =

\Gamma k+d

d

\Delta 

monomials (so the first sum has at most K` summands). For (2), we have

qi(x) =

P

S ci;SxS so by Cauchy-Schwarz we have

kq1(x) + \Delta  \Delta  \Delta  + q`(x)k

2

=

X

S

(c1;S + \Delta  \Delta  \Delta  + c`;S)

2

^ `



X

S

c

2

1;S + \Delta  \Delta  \Delta  +

X

S

c

2

`;S

!

which is at most `(M1 + \Delta  \Delta  \Delta  + M`): B Computing the Complete Symmetric Kernel Recall that OEd : Rk ! R(

k+d

d ) maps (x1; : : : ; xk) to the vector (1; x1; : : : ; xk; x2

1; x1x2; : : : ) of all

monomials of degree up to d, and Kd(x; y) = OEd(x) \Delta  OEd(y):

Lemma 18 There is a poly(k; d) time algorithm for computing Kd(x; y): Proof: Writing zi for xiyi; it is easy to see that Kd(x; y) =

Pd

`=0 h`(z1; : : : ; zk) where h`(z1; : : : ; zk) =P

d1+\Delta \Delta \Delta +dk=` z

d1 1 \Delta  \Delta  \Delta  z

dk k is the `-th complete symmetric polynomial (the sum of all monomials of degree exactly `). Let e`(z1; : : : ; zk) denote the `-th elementary symmetric polynomial (the sum of

all multilinear monomials of degree exactly `). As shown in [24] (Equation (8), p. 15) the identity h` = det(E) is known to hold where E is the ` \Theta  ` matrix whose (i; j) entry is e1\Gamma i+j (interpreting er as 0 for r ! 0). Thus computing Kd(x; y) reduces to computing the polynomials e`; these polynomials can be computed efficiently via polynomial interpolation (see e.g. Section 2.5 of [21]). 2

15