designed by koreabysvmvgmadeinkorea

Adaptive constraint reduction for training support
vector machines.
A support vector machine (SVM) determines whether a given observed
pattern lies in a particular class. The decision is based on prior
training of the SVM on a set of patterns with known classification, and
training is achieved by solving a convex quadratic programming problem.
Since there are typically a large number of training patterns, this can
be expensive. In this work, we propose an adaptive constraint reduction
primal-dual interior-point method for training a linear SVM with
[l.sub.1] penalty (hinge loss) for misclassification. We reduce the
computational effort by assembling the normal equation matrix using only
a well-chosen subset of patterns. Starting with a large portion of the
patterns, our algorithm excludes more and more unnecessary patterns as
the iteration proceeds. We extend our approach to training nonlinear
SVMs through Gram matrix approximation methods. We demonstrate the
effectiveness of the algorithm on a variety of standard test problems.Key words. Constraint reduction, column generation, primal-dual
interior-point method, support vector machine.AMS subject classifications. 90C20, 90C51, 90C59, 68W01.
Article Type:
Machine learning
(Research)
Convex functions
(Research)
Quadratic programming
(Research)
Jung, Jin HyukO'Leary, Dianne P.Tits, Andre L.
04/01/2008
Publication:
Name:&Electronic&Transactions&on&Numerical&Analysis Publisher:&Institute&of&Computational&Mathematics Audience:&Academic Format:&Magazine/Journal Subject:&C&Mathematics Copyright:&COPYRIGHT&2008&Institute&of&Computational&Mathematics ISSN:&
Date:&April, 2008 Source Volume:&31
Event&Code:&310 Science & research
Geographic:
Geographic&Scope:&United States Geographic&Code:&1USA United States
Accession Number:
Full Text:
1. Introduction. Characteristics such as gill placement, coloring,
and habitat can predict whether or not a mushroom is edible. Pattern
recognition tasks such as this can be automated by use of a support
vector machine (SVM). Given a pattern (set of observed characteristics)
x in some domain set X, the SVM decides whether or not the pattern is in
a particular class, e.g., "edible". In the case of a linear
SVM, the machine makes the decision by testing whether the point in X
specified by the pattern is above or below a hyperplane{x :
- [gamma] = 0}.Here
denotes an inner product. Before the SVM can be
put to use, a training process determines the parameters w and [gamma],
based on a set of training patterns [x.sub.i] each having a
predetermined classification label [y.sub.i] = [+ or -]1, i = 1, ..., m,
e.g., "edible" or "not edible". The goal is to set
the parameters so thatsign( - [gamma]) = [y.sub.i], for i =1, ..., m.Thus, the machine is trained to correctly identify the patterns
with known classifications and is then used for classifying future
patterns. If no hyperplane separates the two classes, then a loss
function is included in the training to add a penalty for
misclassification. In either case, this training process can be
formulated as a convex quadratic program (CQP).
Often, the number of training patterns m is very much larger than
the dimension of x (and w), and it is well known that w and [gamma] are
determined by a small subset of the training patterns. In this paper, we
develop an efficient primal-dual interior-point method (PDIPM) for
solving this CQP. The novelty of our algorithm is the way that we
iteratively identify the small subset of critical training patterns and
ignore the rest.Other authors have also exploited the fact that the number of
critical patterns is small. Osuna et al. [25] proposed a decomposition
algorithm for the dual SVM formulation, solving a sequence of reduced
problems, and Joachims [21] proposed some improvements. Platt [26]
proposed a sequential minimal optimization (SMO) algorithm that allows
only two variables see the four essays in [20] for
further discussion of the literature. Ferris and Munson [14] considered
training linear SVMs with [l.sub.1] and [l.sub.2] hinge loss functions.
They efficiently applied the Sherman-Morrison-Woodbury (SMW) formula to
solving the normal equations for training the SVM, the most expensive
operation in the PDIPM. Gertz and Griffin [16] proposed using either a
parallel direct solver or a preconditioned conjugate gradient solver
tailored for the PDIPM normal equations in training an SVM with
[l.sub.1] hinge loss.In this work the focus is again on the normal equations for the
[l.sub.1] hinge loss formulation. Like Osuna et al., we reduce
computational cost by omitting unnecessary constraints or patterns in
assembling the matrix for the normal equations. However, in contrast to
the decomposition based algorithms, we solve only one optimization
problem, using constraint selection only to determine the search
direction at each iteration of a PDIPM. Our algorithm is closely related
to a PDIPM proposed for solving a general CQP with many inequality
constraints [22].Reducing the computational cost of linear and convex programming by
using only a small number of the constraints has been actively studied.
Ye [38] pioneered a "build-down" scheme for linear programming
(LP), proposing a rule that can safely eliminate inequality constraints
that will not be active at the optimum. Dantzig and Ye [9] proposed a
"build-up" IPM of dual affine-scaling form. A potential
reduction algorithm proposed by Ye [39] allows column generation for
linear feasibility problems, and Luo and Sun [23] proposed a similar
scheme for convex quadratic feasibility problems, to which CQPs can be
transformed. Den Hertog et al. proposed "build-up" and
"build-down" IPM variants [11, 12]. They also proposed a
path-following cutting plane algorithm for convex programming, where
they used the "build-up and -down" IPM for LP to solve a
sequence of LP relaxations of the convex programming [10].In our algorithm the constraints at each step are chosen based only
on the current iterate, rather than by building up or down. Related
algorithms for LP were considered in [32] and [34].To present our algorithm, we begin in Section 2 by formulating the
SVM training problem as a CQP. In Section 3, we describe our adaptive
constraint reduction approach for a quadratic programming version of
Mehrotra's predictor-corrector (MPC) algorithm (see [24], [17], and
[16]). We extend our results to certain nonlinear SVMs in Section 4. In
Section 5, numerical results are presented. Finally, concluding remarks
are provided in Section 6.Throughout this paper we denote matrices by upper-case bold letters
and column vectors by lower-case bold letters. The entries of a vector x
are [x.sub.i], while those for a matrix K are [k.sub.ij]. The ith row of
the matrix X is denoted by [x.sup.T.sub.i]. Given a vector y, the matrix
Y is the diagonal matrix formed from the entries in the vector: Y =
diag(y). The cardinality of a set S is denoted by [absolute value of S].2. Support vector machines. An SVM is a binary classifier,
answering 'yes' or 'no' to an input pattern, based
on whether or not the pattern lies in a particular half space. In this
section we review the formulation of SVMs and their training.2.1. Data representation, classifier, and feature space. The
training data patterns are defined as([x.sub.1], [y.sub.1]), ..., ([x.sub.m], [y.sub.m]) [member of] X x
{-1, +1}, (2.1)where m is the number of training patterns.[FIGURE 2.1 OMITTED]A linear classifier is a hyperplane {x :
= [gamma]} in
X that separates the "-" patterns ([y.sub.i] = -1) from the
"+" patterns ([y.sub.i] = +1). For a pattern x [member of] X,
the decision or prediction y of the classifier isy = sign( - [gamma]). (2.2)To find a good classifier, it may be necessary to use a feature map[PHI] : X [right arrow] H (2.3)x [??] [PHI](x), (2.4)to map the training patterns into a feature space H (possibly of
higher dimension) endowed with an inner product [.sub.H]. We
define the length or norm of a vector a [member of] H to be[[parallel]a[parallel].sub.H] = [square root of [().sub.H]].A linear classifier determined in the feature space may induce a
nonlinear classifier in the original pattern space. For example, define
a feature map from [R.sup.2] to [R.sup.3] as[PHI] : [R.sup.2] [right arrow] [R.sup.3], [([x.sub.1],
[x.sub.2]).sup.T] [??] [([x.sup.2.sub.1], [x.sup.2.sub.2], [square root
of (2[x.sub.1][x.sub.2])).sup.T.Then the inner-product in the feature space [R.sup.3] can be used
to define the separator in [R.sup.2] illustrated in Figure 2.1.
Accordingly, we first limit our study to linear SVMs in Rn, and then
consider nonlinear SVMs in Sections 4 and 5.2.2.2. Maximizing the separation margin. We now focus on finding a
linear classifier, where [x.sub.i] [member of] [R.sup.n] for i = 1, ...,
m. We use the standard inner product
[x.sup.T.sub.1] [x.sub.2].If the training patterns are strictly separable, then there are
infinitely many hyperplanes that can correctly classify the patterns, as
illustrated in Figure 2.2. To choose a desirable separating hyperplane,
we seek one that maximizes the separation margin, defined to be the
minimal distance from the hyperplane to the "+" patterns
([y.sub.i] = 1) plus the minimal distance to the "-" patterns
([y.sub.i] = -1).[FIGURE 2.2 OMITTED]How can we find this hyperplane? If the patterns are separable,
then there exists at least one "+" pattern and one
"-" pattern closest to any separating hyperplane. Define the
"+" and "-" class boundary planes to be the two
hyperplanes parallel to the separating hyperplane that contain these two
patterns. Then the distance between these class boundary planes is the
separation margin, and we define w and [gamma] so that {x :
= [gamma]} is the plane halfway between them. Since the patterns are
separable, there is no pattern in between the boundary planes, and we
can scale w and y so that, for all i [member of] {1, ..., m}, - [gamma] [greater than or equal to]
[y.sub.i], if [y.sub.i] = +1, (2.6) - [gamma] [less than or equal to] [y.sub.i],
if [y.sub.i] = -1, (2.7)or equivalently[y.sub.i]( - [gamma]) [greater than or equal
to] 1.So, the boundaries of the half spaces defined by (2.6) and (2.7)
are the "+" and "-" class boundary planes.Since the distance between the boundary planes is
2/[parallel]w[parallel], where [[parallel]w[parallel].sup.2] = , the problem can now be modeled as an optimization problem, called
the hard-margin SVM:[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2.8)[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (2.9)where X = [[[x.sub.1], ... [x.sub.m]].sup.T] [member of]
[R.sup.mxn] and e = [[1, ..., 1].sup.T]. Notice that this problem has
one constraint per pattern. Typically m [much greater than] n, and that
is the case we consider here.If the data are not separable, there is no solution to the
hard-margin optimization problem. To cope with this situation, we add a
misclassification penalty in the objective function (2.8). We introduce
a nonnegative relaxation variable [xi], in order to measure
misclassification, obtaining relaxed constraints[y.sub.i]( - [gamma]) [greater than or equal
to] 1 - [[xi].sub.i] (2.10)Adding an [l.sub.1] misclassification penalty to the objective
function (2.8), we get the (primal) soft-margin SVM (with [l.sub.1]
hinge loss) proposed by Cortes and Vapnik (1) [8]:[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2.11)s.t. Y(Xw - e[gamma]) + [xi] [greater than or equal to] e, (2.12)[xi] [greater than or equal to] 0, (2.13)where [tau] is an m-dimensional vector of positive penalty
parameters for the trade-off between the separation margin maximization
and the error minimization. This soft-margin formulation is often
preferred to the hard-margin formulation even when the training patterns
are strictly classifiable [4]. Notice this formulation is a CQP with m
nontrivial constraints (2.12), m bound constraints (2.13), m relaxation
variables [xi], and n variables w, where m [much greater than] n.2.3. Dual formulation, Gram matrix, and support vectors. The dual
of the CQP given by formulae (2.11)-(2.13) is[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2.14)s.t. [y.sup.T] [alpha] = 0, (2.15)0 [less than or equal to] [alpha] [less than or equal to] [tau],
(2.16)where the symmetric positive semidefinite Gram matrix K [member of]
[R.sup.mxm] has entries[k.sub.ij] = ,(i.e., K = [XX.sup.T]) and where [[alpha].sub.i] is the dual
variable associated with the ith constraint in (2.12).If [[alpha].sup.*] solves this dual problem, then we can compute
the solution to the primal problem(2.11)-(2.13) from it:[w.sup.*] = [X.sup.T] Y [[alpha].sup.*] = [summation over i[member
of]S] [[alpha].sup.*.sub.i] [y.sub.i] [x.sub.i], for S = {i : 0 <
[[alpha].sup.*.sub.i]}, (2.17)[[gamma].sup.*] = 1/[absolute value of [S.sub.on]] [summation over
i[member of]S] ( - [y.sub.i]), for
[S.sub.on] = {i :0 < [[alpha].sup.*.sub.i] < [[tau].sub.i]},
(2.18)[[xi].sup.*.sub.i] = max{1 - [y.sub.i](
- [[gamma].sup.*]), 0} , for i = 1, ..., m. (2.19)Note that (2.18) is obtained from (2.12), noticing that[y.sub.i] ( - [[gamma].sup.*]) =
1 for all i such that 0 < [[alpha].sup.*.sub.i] < [[tau].sub.i].The subscript "on" will be explained in the next
paragraph. While [[gamma].sup.*] = ( -
[y.sub.i]) for every i [member of] [S.sub.on], the averaging in (2.18)
provides somewhat better accuracy than using a single equation to
determine [[gamma].sup.*] . Equation (2.19) is obtained from (2.12) and
(2.13). In view of (2.17), the Lagrange multiplier [[alpha].sub.i] can
be interpreted as the weight of the ith pattern in defining the
classifier.Support vectors (SVs) are the patterns that contribute to defining
the classifier, i.e., those associated with positive weight
[[alpha].sup.*.sub.i]. The on-boundary support vectors have weight
strictly between the lower bound 0 and the upper bound Ti, and,
geometrically, lie on their class boundary plane, i.e., both (2.12) and
(2.13) are active. The off-boundary support vectors have the maximal
allowable weight [[alpha].sup.*.sub.i] = [[tau].sub.i] and lie on the
wrong side of the class boundary plane, i.e., (2.12) is active but
(2.13) is inactive [28]. (2) We summarize this classification in Table
2.1.We now have formulated our optimization problem and turn our
attention to solving it.3. Adaptive constraint (pattern) reduction. In this section we
present a standard primal-dual interior-point method for training our
SVM, and then improve the efficiency of the method by adaptively
ignoring some patterns. Since each pattern corresponds to a primal
constraint, this is called constraint reduction.3.1. Primal-dual interior-point method. Since the soft-margin
formulation for the SVM (2.11)-(2.13) is a CQP, every solution to the
associated Karush-Kuhn-Tucker (KKT) conditions is a global optimum.
Therefore, training the machine is equivalent to finding a solution to
the KKT conditions for the primal (2.11)-(2.13) and the dual
(2.14)-(2.16) problems [16]:w - [X.sup.T] Y [alpha] = 0, (3.1)[y.sup.T] [alpha] = 0, (3.2)[tau] - [alpha] - u = 0, (3.3)YXw - [gamma]y + [xi] - e - s = 0, (3.4)S[alpha] = 0, (3.5)U[xi] = 0, (3.6)s, u, [alpha], [xi] [greater than or equal to] 0,where s is a slack variable vector for the inequality constraints
(2.12), and u is a slack for the upper bound constraints (2.16) and a
vector of multipliers for the non-negativity constraints (2.13).
Conditions (3.1)-(3.3) relate the gradient of the objective function to
the constraints that are active at an optimal solution, while (3.4) is
the primal feasibility condition. Conditions (3.5) and (3.6) enforce
complementary slackness.In order to find a solution, we use a PDIPM. We apply a Newton-like
method to solving the KKT conditions, with perturbations added to the
complementarity conditions (3.5) and (3.6). For the variant of the MPC
algorithm discussed in [16] (analogous to that in [36]), the search
direction is obtained by solving the system of equations[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].First, an affine scaling (predictor) direction ([DELTA][w.sup.aff],
[DELTA][[gamma].sup.aff], [DELTA][[xi].sup.aff], [DELTA][s.sup.aff],
[DELTA][[alpha].sup.aff], [DELTA][u.sup.aff]) is computed by setting[r.sub.sv] = S[alpha], (3.7)[r.sub.[xi]u] = U[xi]. (3.8)Then, the combined affine-scaling and corrector step is obtained by
setting[r.sub.sv] = S[alpha] - [sigma][mu]e + [DELTA][S.sup.aff]
[DELTA][[alpha].sup.aff], (3.9)[r.sub.[xi]u] = diag([xi])u - [sigma][mu]e +
[DELTA][U.sup.aff][DELTA][[xi].sup.aff], (3.10)where the complementarity measure (3) is defined by[mu] = [s.sup.T] [alpha] + [[xi].sup.T] u/2m,[sigma] > 0 is a centering parameter defined later, and
[DELETA][S.sup.aff] and [DELTA][U.sup.aff] are diagonal matrices formed
from [DELTA][s.sup.aff] and [DELTA][u.sup.aff] respectively. These
equations can be reduced to the normal equations[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (3.11)[[bar.r].sub.w] = [r.sub.w] + [X.sup.T] Y [[OMEGA].sup.-1]
[r.sub.[OMEGA]], and [[bar.r].sub.[alpha]] = [r.sub.[alpha]] - [y.sup.T]
[[OMEGA].sup.-1] [r.sub.[OMEGA]]. Once (3.11) has been solved for
[[DELTA].sub.w], the increments [DELTA][gamma], [DELTA][alpha],
[DELTA][xi], [DELTA]u, and [DELTA]s are obtained by solving[DELTA][gamma] = 1/[y.sup.T] [[OMEGA].sup.-1] y
(-[[bar.r].sub.[alpha]] + [[bar.y].sup.T] [DELTA]w), (3.12)[DELTA][alpha] = -[[OMEGA].sup.-1] ([r.sub.[OMEGA]] + YX[DELTA]w -
y[DELTA][gamma]), (3.13)[DELTA][xi] = -[U.sup.-1] diag([xi])([[bar.r].sub.u] -
[DELTA][alpha]), (3.14)[DELTA]u = - diag[([xi]).sup.-1]([r.sub.[xi]u] + U[DELTA][xi]),
(3.15)[DELTA]s = - diag[([alpha]).sup.-1] ([r.sub.sv] + S[DELTA][alpha]),
(3.16)where [[bar.r].sub.u] = [r.sub.u] + diag[([xi]).sup.-1]
[r.sub.[xi]u] and [r.sub.[OMEGA]] = [r.sub.s] + diag[([alpha]).sup.-1]
[r.sub.sv] - [U.sup.-1] diag([xi]) [[bar.r].sub.u]; see [16] for
detailed derivation.Forming and solving the normal equations (3.11) is the most time
consuming task in a step of the predictor-corrector algorithm, so we now
focus on how to speed up this process. Fine and Scheinberg [15] and
Ferris and Munson [14] use the dual formulation (2.14)-(2.16), so their
normal equations involve an m x m matrix, considerably larger than our n
x n matrix M. They use the Sherman-Morrison-Woodbury (SMW) formula to
reduce computational complexity from O([m.sup.3]) to O([mn.sup.2]). In
contrast, Gertz and Griffin [16] solve the normal equations (3.11)
iteratively, using a matrix we call M(q) as a preconditioner. The
approach of Woodsend and Gondzio [35], with [l.sub.1] hinge loss,
results in the same KKT system and search direction equations as those
of Gertz and Griffin. However, they apply a different sequence of block
eliminations, resulting in different normal equations. Chapelle
discusses relations between the primal and dual based approaches [6]. He
shows that when using the SMW formula, finding the search direction has
the same computational complexity for the primal as for the dual, but he
argues that the primal approach is superior, because it directly
attempts to maximize the separation margin. We choose to speed up the
computation by forming an approximation to the matrix M.3.2. Constraint reduction. In [22] we developed and analyzed an
algorithm for solving CQPs by replacing the matrix M in the normal
equations (3.11) by an approximation to it. In this section we see how
this idea can be applied to the SVM problem.Since [y.sub.i] = [+ or -]1 and [OMEGA] is diagonal, we see thatY[[OMEGA].sup.-1] Y = [[OMEGA].sup.-1] and [y.sup.T]
[[OMEGA].sup.-1] y = [e.sup.T] [[OMEGA].sup.-1] e.Now, consider the matrix of the normal equations (3.11),[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (3.17)If Q = {1, ..., m}, then [M.sub.(Q)] = M. If Q [subset] {1, ...,
m}, then [M.sub.(Q)] is an approximation, accurate if the neglected
terms are small relative to those included. The approximation reduces
the computational cost for the matrix assembly, which is the most
expensive task in a PDIPM, from O([mn.sup.2]) to O([absolute value of
Q][n.sup.2]).How do we obtain a good approximation? Strict complementarity
typically holds between the variables s and [alpha] and between u and
[xi], so we make use of this assumption. The last term in [M.sub.(Q)] is
a rank-one matrix, so the bulk of the work is in handling the second
term. Patterns associated with larger [w.sup.-1.sub.i] make a larger
contribution to this term. Since [[alpha].sub.i] + [u.sub.i] =
[[tau].sub.i], the quantity [w.sup.-1.sub.i] becomes very large exactly
when both [s.sub.i] and [[xi].sub.i] are close to zero. Therefore, as
seen in Table 2.1, the important terms in the first summation in (3.17)
are associated with the on-boundary support vectors. Identifying
on-boundary support vectors is not possible until we find the maximal
margin classifier and its class boundaries. At each iteration, however,
we have an approximation to the optimal value of [w.sub.i], so we find
prospective on-boundary support vectors by choosing the patterns with
small [w.sub.i]. As described in [16], a growing [w.sup.-1.sub.i]
diverges to infinity at an O([[mu].sup.-1]) rate and a vanishing
[w.sup.-1.sub.i] converges to zero at an O([mu]) rate. Therefore, as the
intermediate classifier approaches the maximal margin classifier, it
becomes clearer which patterns are more likely to be on-boundary support
vectors. This enables us to adaptively reduce the index set size used in
(3.17). To measure how close the intermediate classifier is to the
optimal one, we can use the complementarity measure //, which converges
to zero. At each iteration, we set the size q of our index set Q to be a
value between two numbers [q.sub.L] and [q.sub.U]:q = max {[q.sub.L], min {[??][rho]m[??], [q.sub.U]}}, (3.18)where the heuristic choice[rho] =[[mu].sup.1/[beta]]tends to synchronize decrease of [absolute value of Q] with
convergence of the optimality measure. Here [beta] > 0 is a parameter
for controlling the rate of decrease, and [q.sub.U] is also an algorithm
parameter, but [q.sub.L] changes from iteration to iteration. At the
first iteration we randomly choose [q.sub.U] indices, because we have no
information about the classifier. Fast clustering algorithms may improve
the initial selection [3].We now show how to choose patterns and determine [q.sub.L] at each
iteration. Based on our examination of [w.sup.-1.sub.i], there are
several reasonable choices. Define Q(z, q), the set of all subsets of M
= {1, ..., m} that contain the indices of q smallest components of z
[member of] [R.sup.m]:Q(z, q) = {Q | Q [subset or equal to] M, [absolute value of Q] = q
and [z.sub.i] [less than or equal to] [z.sub.j] [for all]I [member of]
Q, j [not member of] Q}. (3.19)If there are no ties for the qth smallest component, then Q(z, q)
contains a single index set. It is also possible to include additional
indices in these sets for heuristic purposes, as allowed by [32]. Then,
we have the following choices of patterns:* Rule 1: Q(YXw - [gamma]y + [xi] - e, q). This rule selects the
patterns [x.sub.i] corresponding to indices in these sets with the
smallest "one-sided" distance to its class boundary plane,
meaning that we only consider patterns that are on the wrong side of
their class boundary plane. Assuming primal feasibility, this measures
the slacks of the primal constraints (2.12). This choice is most
intuitive because support vectors contribute to defining the classifier,
which is the underlying idea for most decomposition-based algorithms.
Inspired by the rate of convergence or divergence of [w.sup.-1.sub.i],
as noted above, we define the lower bound [q.sub.L] on the index set
size by[q.sub.L] = [absolute value of {i : [[alpha].sub.i]/[s.sub.i]
[greater than or equal to] [theta] [square root of ([mu])] or [s.sub.i]
[less than or equal to] [square root of ([mu])}],where [theta] is a prescribed parameter, so that we include terms
with small values of [s.sub.i].* Rule 2: Q([OMEGA]e, q). The patterns [x.sub.i] chosen by these
sets have the smallest [w.sub.i], and thus the largest contributions to
the second term in the definition of the matrix M. Again, considering
the rate of convergence or divergence of [w.sup.-1.sub.i], we define the
lower bound qL on the index set size by counting the number of large
[w.sup.-1.sub.i]:[q.sub.L] = [absolute value of {i : [w.sup.-1.sub.i] [greater than
or equal to] [theta] [square root of ([mu])}], (3.20)where [theta] is a prescribed parameter. Under our strict
complementarity assumption, the value of [q.sub.L] will eventually
converge to the number of diverging [w.sup.-1.sub.i], or, equivalently,
the number of on-boundary support vectors.Either of these choices, however, may have an imbalance between the
number of patterns selected from the "+" and "-"
classes. In the worst case, we might select patterns from only one of
the classes, whereas on-boundary support vectors are typically found in
both classes. To avoid this unfavorable situation, we might want to use
a balanced choice of patterns, specifying the number [q.sup.+] and
[q.sup.-] chosen from each class as[q.sup.+] = max([q.sup.+.sub.L], min([??]min ([??][rho]m[??],
[q.sub.U]/2[??], [m.sup.+])), (3.21)[q.sup.-] = max([q.sup.-.sub.L], min([??]min ([??][rho]m[??],
[q.sub.U]/2[??], [m.sup.-])), (3.22)where [m.sup.+] and [m.sup.-] are the number of "+" and
"-" patterns, respectively. The values [q.sup.+.sub.L] and
[q.sup.-.sub.L] are lower bounds de for example,
instead of (3.20) we use[q.sup.+.sub.L] = [absolute value of {i : [w.sup.-1.sub.i] [greater
than or equal to] [theta] [square root of ([mu])] and [y.sub.i] = 1}],[q.sup.-.sub.L] = [absolute value of {i : [w.sup.-1.sub.i] [greater
than or equal to] [theta] [square root of ([mu])] and [y.sub.i] = -1}].We adjust either [q.sup.+] or [q.sup.-] so that [q.sup.+] +
[q.sup.-] = q, the desired number of patterns. Given [q.sup.+] and
[q.sup.-], we define the sets[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].The set Q is the union of one set in [Q.sup.+] (z, [q.sup.+]) and
another in [Q.sup.-] (z, [q.sup.-]):Q [member of] Q(z, q) = {Q | Q = [Q.sup.+] [union] [Q.sup.-],
[Q.sup.+] [member of] [Q.sup.+] (z, [q.sup.+]) and [Q.sup.-] [member of]
[Q.sup.-] (z, [q.sup.-])}.In the notation, we have suppressed the dependence of Q(z, q) on
[q.sup.+] and [q.sup.-] , to be consistent with notation for the
non-balanced choice (3.19). Having determined Q, we construct the
reduced normal equation for one step of our interior-point method by
assembling the matrix for the normal equation using a subset of the
patterns, and solving[M.sub.(Q)] [DELTA]w = -[[bar.r].sub.w] - 1/[y.sup.T]
[[OMEGA].sup.-1] y [[bar.r].sub.[alpha]] [bar.y]. (3.23)Then we solve (3.12)-(3.16) for [DELTA][gamma], [DELTA][alpha],
[DELTA][xi], [DELTA]u, and [DELTA]s. Before we proceed, we have to
ensure that the reduced matrix [M.sub.(Q)] is positive definite.PROPOSITION 3.1. The matrix [M.sub.(Q)] is symmetric and positive
definite.Proof. See Proposition 1 in [16].The following proposition explains the asymptotic convergence of
the reduced matrix to the unreduced one.PROPOSITION 3.2. For q defined in (3.18) and for all Q [member of]
Q([OMEGA]e, q), there exists a positive constant CM satisfying
[[parallel]M - [M.sub.(Q)][parallel].sub.2] [less than or equal to]
[C.sub.M] [square root of ([mu])].Proof. See Proposition 5 in [16].Winternitz et al. presented convergence results for a MPC
constraint reduction algorithm for LP [34]. In recent work [22], we
provided a convergence proof for a constraint-reduced affine-scaling
primal-dual interior-point method for convex quadratic programming.
Typically, MPC algorithms require less work than affine scaling,
especially when the initial point is not near the central path.
Therefore, we state in Algorithm 1 a variant of a Mehrotra-type
predictor-corrector algorithm with a constraint reduction mechanism to
determine [M.sub.(Q)].ALGORITHM 1 (Constraint reduced SVM (CRSVM)).[TEXT NOT REPRODUCIBLE IN ASCII]When the matrix X is sparse, the first two terms in [M.sub.(Q)] are
probably also sparse, but adding the third term makes the matrix dense.
Therefore, in this case we solve the normal equations (3.23) by
computing a sparse Cholesky factor (with full pivoting) for the sum of
the first two terms, and then applying the SMW formula to incorporate
the rank 1 matrix. When X is dense, we fully assemble [M.sub.(Q)] and
obtain its dense Cholesky factor.4. Kernelization. So far we have considered the case in which the
Gram matrix K is defined as [k.sub.ij] = [x.sup.T.sub.i] [x.sub.j],
corresponding to the standard inner product
= [x.sup.T.sub.i] [x.sub.j]. More generally, we might define K using a
symmetric and positive semidefinite kernel: (4)k : X x X [right arrow] R(x, [bar.x]) [??] k(x, [bar.x]),setting [k.sub.ij] = k([x.sub.i], [x.sub.j]). (5) We assume that
the dimension of the space X is l, reserving n for the rank of the
approximation to K used in the optimization problem derived below.If a data set has an enormous number of training patterns, building
K may not be feasible. For example, if the kernel is Gaussian, K is
usually dense even when the matrix X of training patterns is sparse.
Even worse, the matrix M is now m x m, and our constraint reduction is
not effective. The reason for this is that if K has full rank, our KKT
conditions become[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (4.1)which is difficult to approximate. Several researchers [6, 15, 30]
have proposed to replace the Gram matrix with a low rank approximation K
[approximately equal to] [VG.sup.2] [V.sup.T], where G is an n x n
symmetric and positive definite matrix and V is an m x n matrix, for m
[much greater than] n. If V has full rank, then K has rank n. Such
approximations have been computed using the truncated eigenvalue
decomposition [7, 18], low rank Cholesky factorization with symmetric
pivoting [2, 15], the Nystrom method [13, 33], and kernel PCA map [19,
31]. For some kernels, the fast multipole method [27, 37] can be
employed to compute the truncated eigenvalue decomposition.Substituting a low rank approximation for K allows constraint
reduction to be effective in training nonlinear SVMs, if we are careful
in our problem formulation. Consider an approximate dual CQP with K in
(2.14)-(2.16) replaced by [VG.sup.2][V.sup.T]. What primal problem would
induce this dual? It is the problem obtained by substituting VG for X in
the primal (2.11)-(2.13). Therefore, to take advantage of constraint
reduction in training nonlinear SVMs, we use the data VG in place of X
and apply Algorithm 1.If G is only readily available in its squared inverse form
[G.sup.-2], we could think of letting [bar.w] = Gw [member of]
[R.sup.n], which leads to the following problem:[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],where X = V. This formulation would be useful if computing G is not
desirable. For instance, [G.sup.-2] is a submatrix of K when the
empirical kernel map [29, 31] is employed. Applying the constraint
reduction to this formulation is straightforward. A simple change of
[M.sub.(Q)] from (3.17) to[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (4.2)5. Numerical results. We tested Algorithm 1 using MATLAB version
R2007a on a machine running Windows XP SP2 with an Intel Pentium IV
2.8GHz processor with 16 KB L1 cache, 1 MB L2 cache, 2 x 1 GB
DDR2-400MHz configured as dual channel, with Hyper Threading enabled.Both [tol.sub.r] and [tol.sub.[mu]] were set to [10.sup.-8]. The
iteration limit was set to 200. We set the parameters as [beta] = 4 to
control the rate of decrease of q, [theta] = [10.sup.2] to determine
[q.sub.L], and [[tau].sub.i] = 1 for i = 1, ..., m to penalize
misclassification. In our experiments we vary [q.sub.U]. The initial
starting point was set as in [16]:w = 0, [gamma] = 0, [xi] = s = [alpha] = u = 2e.We compared Algorithm 1 CRSVM to LIBSVM [5], and [SVM.sup.light]
[21]. We set their termination tolerance parameters as their default
value [10.sup.-3].5.1. Linear SVM examples. We tested our implementation on problems
MUSHROOM, ISOLET, WAVEFORM, and LETTER, all taken from [16]. Except for
the ISOLET problem, all inputs were mapped to higher dimensional feature
space via the mapping associated with the second order polynomial kernel
k(x, [bar.x]) = [([x.sup.T] [bar.x] + 1).sup.2], as in [16]. The mapping
[PHI] is defined as[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (5.1)where [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. The ith
row of X is set to [PHU][([x.sub.i]).sup.T], where [x.sup.T.sub.i] is
the ith training input. We also normalized the resulting matrix using[x.sub.ij] = [x.sub.ij]/[max.sub.kl] [absolute value of
[x.sub.kl]], (5.2)as directed in [16]. Properties ofthe problems are summarized in
Table 5.1.In our experiment, we compared our algorithms to the unreduced MPC
algorithm, which uses all the constraints for every iteration. We
experimented with several variants of our algorithms:* nonadaptive balanced constraint reduction, which uses fixed
[q.sup.+] and [q.sup.-] thro* adaptive non-balanced constraint reduction, which determines q as
in (3.18);* adaptive balanced constraint reduction, which determines
[q.sup.+] and [q.sup.-] as in (3.21) and (3.22).We chose constraints using either one-sided distance (YXw -
[gamma]y + [xi] - e) or [OMEGA]e to form the set Q, as explained in
Section 3.2, resulting in six algorithm variants.In Figure 5.1a and 5.1b, the time and iteration count of the
algorithm with the two constraint choices and the balanced selection
scheme are compared to those of the unreduced MPC. We set [q.sub.U] = m,
Bal = true, and CC = 'one-sided dist' or CC =
'omega'. Bar graphs are grouped by problem. All algorithms
produced residuals with similar accuracy, and Figure 5.1b shows that the
number of iterations is insensitive to algorithm choice. As a result,
all of the constraint reduction algorithms are faster than the unreduced
MPC algorithm, as seen in Figure 5.1a. In solving hard problems
(mushroom and waveform for instance, which have many support vectors),
it is observed that the constraint choice based on [OMEGA]e shows better
performance than the other. This is because the number of on-boundary
support vectors is nevertheless sma see Table 5.1.Figure 5.2 compares the balanced and nonbalanced adaptive reduction
algorithms over a range of [q.sub.U]. In solving well balanced problems,
the two algorithms show little difference, as seen in Figure 5.2a. On
the other hand, for problems such as ISOLET, having many more patterns
in one class than the other, balanced selection gives more consistent
performance, especially for small values of [q.sub.U], as seen in Figure
5.2b. In problems with more than two classification labels, a
one-class-versus-the-rest approach is frequently employed [30, Chap. 7],
so the number of "-" patterns is often much larger than the
number of "+" patterns.In Figure 5.3, we compare the adaptive and nonadaptive balanced
reduction algorithms over a range of values of the upper bound on the
index size [q.sub.U]. Observe that there is little difference in
iteration counts between the two variants over a large range of
[q.sub.U] values. The time taken to solve a problem decreases very
slowly or remains almost invariant with the adaptive algorithm, as
[q.sub.U] decreases over this range, whereas the nonadaptive algorithm
is more expensive for large values of [q.sub.U].[FIGURE 5.1 OMITTED][FIGURE 5.2 OMITTED]In Figure 5.4, the two constraint choices based on one-sided
distance and [OMEGA]e are applied to the adaptive balanced reduction
algorithm and are compared over a range of [q.sub.U] values. In solving
easy problems having almost all support vectors on the boundary planes,
it is hard to say which constraint choice is better than the other. For
hard problems, the [OMEGA]e based constraint choice is capable of
filtering out more patterns at later iterations and shows better
performance.In Figure 5.5, we compare the number of patterns used and the
complementarity measurement [mu] at every iteration for various choices
of [q.sub.U]. When [q.sub.U] is large, the graphs of [mu] are quite
close to each other. From these graphs we see that the search direction
of the adaptive reduction algorithm is not as good as that of the
unreduced algorithm at early iterations. At later iterations, however,
the search direction of the adaptive reduction algorithm is as good as
that of the unreduced MPC algorithm, and sometimes better.[FIGURE 5.3 OMITTED][FIGURE 5.4 OMITTED]We compared our algorithm CRSVM to LIBSVM [5] and [SVM.sup.light]
[21] on the adult problem of the UCI repository [1]. We obtained a
formatted problem from the LIBSVM web page [5]. The problem consists of
9 sparse training sets with different numbers of sample patterns. Each
training set has a corresponding testing set. For this comparison, we
used the linear SVM, giving the algorithms X in a sparse format with no
modification except the normalization (5.2). We used adaptive balanced
constraint reduction, choosing patterns based on [OMEGA]e. Figure 5.6
shows the timing results. Observe that the timing curve of our algorithm
is close to linear, while those of LIBSVM and [SVM.sup.light] are
between linear and cubic [26].[FIGURE 5.5 OMITTED][FIGURE 5.6 OMITTED]5.2. Nonlinear SVM examples. We compared our algorithm to LIB SVM
and [SVM.sup.light]. We used adaptive balanced constraint reduction,
choosing patterns based on [OMEGA]e. We tested the algorithms on the
adult problem of the UCI repository. For this comparison, we used a
Gaussian kernel k(x, [bar.x]) = exp (-[[parallel]x -
[bar.x][parallel].sup.2]/(2[[sigma].sup.2])) with [sigma] = [square root
of (l/2)], where l is 123, the dimension of the input patterns. (6)We computed a low-rank approximation to K using both MATLAB's
EIGS function and our implementation in MATLAB of a low rank Cholesky
factorization with symmetric pivoting [15]. The CHOL algorithm returns a
rank n Cholesky factor L and a permutation matrix P such that [P.sup.T]
[LL.sup.T] P [approximately equal to] K.[FIGURE 5.7 OMITTED]Figure 5.7 shows results for these approximations on the ADULT data
set. (7) Figure 5.7a and 5.7b show relative error of the low rank
approximation to K measured by[[parallel]K - V[LAMBDA][V.sup.T][parallel].sub.[infinity]]/[[parallel] K[parallel].sub.[infinity]] and [[parallel]K - [P.sup.T] [LL.sup.T]
P[parallel].sub.[infinity]]/[[parallel]K[parallel].sub.[infinity]].As illustrated in Figure 5.7a, for a given rank EIGS approximates K
much better than CHOL. However, EIGS uses a fast multipole algorithm,
IFGT, to approximate matrix-vector products involving K, and there is a
limit to how large the rank can be before IFGT becomes impractically
slow, since its tolerance must be tightened as the rank is increased. In
our experiments we set the IFGT tolerance to be min(0.5,4/ [square root
of (rank)]). Figure 5.7b shows that, when the rank is fixed, errors in
the Gram matrix approximation by CHOL are more sensitive to the number
of training patterns. Figure 5.7c and 5.7d show the time to approximate
K. In Figure 5.7a and 5.7c, EIGS and CHOL were tested on the set of 6414
training patterns. In Figure 5.7b and 5.7d, we requested a rank 64
approximation from EIGS and a rank 300 approximation from CHOL.[FIGURE 5.8 OMITTED]Figure 5.8a compares CRSVM with the other methods. Notice both
LIBSVM and [SVM.sup.light] are implemented in the C language. We expect
we can improve CRSVM and CHOL by implementing them in C. We requested 64
eigenvalues and eigenvectors from EIGS to form a rank 64 approximation
to K. We set CHOL to form a rank 300 approximation. Figure 5.8b shows
times for both the approximation and the training. Table 5.2 shows
accuracy of the classifier generated by each algorithm, measured by the
percentage of correctly classified testing patterns. The classifiers
were tested on data sets associated with the training set. Notice that,
with a proper approximation, it is possible to get a classifier
performing as well as the one trained with the exact matrix.5.3. Visualizing how the algorithm works. To illustrate how our
algorithm achieves efficiency, we constructed a two dimensional toy
problem. We generated 2000 uniformly distributed random points in [-1,1]
x [-1,1]. Then, we set an intentional ellipsoidal separation gap and
deleted patterns inside the gap, resulting in 1727 remaining patterns.
Figure 5.9 shows snapshots of several iterations of the adaptive
balanced reduction algorithm (with [q.sub.U] = m) in solving the
problem. Patterns are chosen based on [OMEGA]e. To find an ellipsoidal
classifier, the mapping (2.5) (associated with a second order
homogeneous polynomial kernel) is used to map the 2-dimensional input
space of the problem to a 3 -dimensional feature space. The dashed
ellipsoids are the boundary curves, corresponding to boundary planes in
the feature space. As the iteration count increases, the number of
selected patterns decreases, and only the on-boundary support vectors
are chosen at the end, leading to significant time savings.[FIGURE 5.9 OMITTED]6. Conclusion. We presented an algorithm for training SVMs using a
constraint reduced IPM with a direct solver for the normal equations.
Significant time saving is reported for all problems, since the
algorithm acts as an adaptive filter for excluding unnecessary patterns.
For problems in which iterative solvers are appropriate, constraint
reduction would reduce the cost of matrix-vector products.Balanced constraint selection is more robust than unbalanced. The
[OMEGA]e constraint choice proved to be more effective than the
one-sided distance, especially for hard problems that have many
off-boundary support vectors. Other constraint choice heuristics can be
used provided that they include constraints that seem to be most active
at the current point. Blending different constraint choices is also
allowable.We experimentally compared our algorithms with other popular
algorithms, including LIBSVM and [SVM.sup.light]. We showed the
potential of our algorithms on training linear SVMs. In training
nonlinear SVMs, we showed how our algorithm can be applied when using a
low-rank approximation to the Gram matrix.The algorithm has substantial potential parallelism, but load
balancing could be challenging, since constraints are dynamically
chosen.Acknowledgment. We would like to thank Joshua D. Griffin for
helpful conversations and for providing the SVM training samples.REFERENCES[1] A. ASUNCION AND D. NEWMAN, UCI machine learning repository,
2007. Available at: http://archive.ics.uci.edu/ml.[2] F. R. BACH AND M. I. JORDAN, Predictive low-rank decomposition
for kernel methods, in ICML '05: Proc. of the 22nd International
Conf. on Machine Learning (Bonn, Germany, August 07-11, 2005), ICML
'05, vol. 119. ACM, New York, NY, pp. 33-40.[3] D. BOLEY AND D. CAO, Training support vector machines using
adaptive clustering, in Proceedings of the Fourth SIAM International
Conference on Data Mining, M. W. Berry, U. Dayal, C. Kamath, and D.
Skillicorn, eds., SIAM Press, Philadelphia, 2004, pp. 126-137.[4] C. BURGES, A tutorial on support vector machines for pattern
recognition, Data Min. Knowl. Discov., 2 (1998), pp. 121-167.[5] C.-C. CHANG AND C.-J. LIN, LIBSVM: a library for support vector
machines, 2001. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm.[6] O. CHAPELLE, Training a support vector machine in the primal,
Neural Comput., 19 (2007), pp. .[7] T.-J. CHIN, K. SCHINDLER, AND D. SUTER, Incremental kernel SVD
for face recognition with image sets, in FGR '06: Proc. of the 7th
International Conf. on Automatic Face and Gesture Recognition (FGR06),
Washington, 2006, IEEE Computer Society, pp. 461-466.[8] C. CORTES AND V. VAPNIK, Support-vector networks, Mach. Learn.,
20 (1995), pp. 273-297.[9] G. B. DANTZIG AND Y. YE, A build-up interior point method for
linear programming: Affine scaling form, Tech. Report, University of
Iowa, Iowa City, July 1991.[10] D. DEN HERTOG, J. KALISKI, C. ROOS, AND T. TERLAKY, A
logarithmic barrier cutting plane method for convex programming, Ann.
Oper. Res., 58 (1995), pp. 67-98.[11] D. DEN HERTOG, C. ROOS, AND T. TERLAKY, A build-up variant of
the path-following method for LP, Tech. Report DUT-TWI-91-47, Delft
University of Technology, Delft, 1991.[12] --, Adding and deleting constraints in the path-following
method for linear programming, in Advances in Optimization and
Approximation (Nonconvex Optimization and Its Applications), vol. 1,
Kluwer Academic Publishers, Dordrecht, 1994, pp. 166-185.[13] P. DRINEAS AND M. MAHONEY, On the Nystrom method for
approximating a Gram matrix for improved kernel-based learning, J. Mach.
Learn. Res., 6 (2005), pp. .[14] M. C. FERRIS AND T. S. MUNSON, Interior-point methods for
massive support vector machines, SIAM J. Optim., 13 (2002), pp. 783-804.[15] S. FINE AND K. SCHEINBERG, Efficient SVM training using
low-rank kernel representations, J. Mach. Learn. Res., 2 (2002), pp.
243-264.[16] E. M. GERTZ AND J. D. GRIFFIN, Support vector machine
classifiers for large data sets, Tech. Report ANL/MCS-TM-289, Argonne
National Laboratories, Argonne, 2006. Available at:
http://www.osti.gov/energycitations/product.biblio.jsp?osti_id=881587.[17] E. M. GERTZ AND S. J. WRIGHT, Object-oriented software for
quadratic programming, ACM Trans. Math. Software, 29 (2003), pp. 58-81.[18] G. H. GOLUB AND C. F. VAN LOAN, Matrix Computations, The Johns
Hopkins University Press, 1996.[19] S. HARMELING, A. ZIEHE, M. KAWANABE, AND K.-R. MULLER,
Kernel-based nonlinear blind source separation, Neural Comput., 15
(2003), pp. .[20] M. A. HEARST, Trends and controversies: Support vector
machines, IEEE Intell. Syst., 13 (1998), pp. 18-28.[21] T. JOACHIMS, Making large-scale SVM learning practical, in
Advances in Kernel Methods - Support Vector Learning, B. Scholkopf, C.
Burges, and A. Smola, eds., MIT Press, 1998, pp. 169-185.[22] J. H. JUNG, D. P. O'LEARY, AND A. L. TITS, Adaptive
constraint reduction for convex quadratic programming, University of
Maryland, 2007. Available at:
http://www.optimization-online.org/DB_HTML/2 007/10/1820.html.[23] Z.-Q. LUO AND J. SUN, An analytic center based column
generation algorithm for convex quadratic feasibility problems, SIAM J.
Optim., 9 (1998), pp. 217-235.[24] S. MEHROTRA, On the implementation of a primal-dual interior
point method, SIAM J. Optim., 2 (1992), pp. 575-601.[25] E. OSUNA, R. FREUND, AND F. GIROSI, An improved training
algorithm for support vector machines, Proc. IEEE Signal Processing
Society Workshop on Neural Networks and Signal Processing VII, J.
Principe, L. Gile, N. Morgan, and E. Wilson, eds. IEEE Press, New York
(1997), pp. 276-285. Available at:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=622408[26] J. C. Platt, Fast training of support vector machines using
sequential minimal optimization, in Advances in Kernel Methods: Support
Vector Learning, B. Scholkopf, C. Burges, and A. Smola, eds., MIT Press,
Cambridge, 1999, pp. 185-208.[27] V. C. RAYKAR, C. YANG, R. DURAISWAMI, AND N. GUMEROV, Fast
computation of sums of Gaussians in high dimensions, Tech. Report
CS-TR-4767, Department of Computer Science, University of Maryland,
College Park, 2005. Available at:
http://www.umiacs.umd.edu/~vikas/Software/IFGT/IFGT_code.htm.[28] B. SCHOLKOPF, C. BURGES, AND A. SMOLA, Introduction to support
vector learning, in Advances in Kernel Methods--Support Vector Learning,
B. Scholkopf, C. Burges, and A. Smola, eds., MIT Press, 1998, pp. 1-22.[29] B. SCHOLKOPF, S. MIKA, C. J. C. BURGES, P. KNIRSCH, K.-R.
MULLER, G. RATSCH, AND A. J. SMOLA, Input space versus feature space in
kernel-based methods, IEEE Trans. Neural Networks, 5 (1999), pp.
.[30] B. SCHOLKOPF AND A. J. SMOLA, Learning with Kernels: Support
Vector Machines, Regularization, Optimization, and Beyond, MIT Press,
Cambridge, 2001.[31] A. J. SMOLA AND B. SCHOLKOPF, Sparse greedy matrix
approximation for machine learning, in Proc. 17th International Conf. on
Machine Learning, P. Langley, ed., Morgan Kaufmann, San Francisco, CA,
2000, pp. 911-918.[32] A. L. TITS, P.-A. ABSIL, AND W. P. WOESSNER, Constraint
reduction for linear programs with many inequality constraints, SIAM J.
Optim., 17 (2006), pp. 119-146.[33] C. K. I. WILLIAMS AND M. SEEGER, Using the Nystroom method to
speed up kernel machines, in Neural Information Processing Systems 13,
T. Leen, T. Dietterich, and V. Tresp, eds., MIT Press, 2001, pp.
682-688.[34] L. B. WINTERNITZ, S. O. NICHOLLS, A. L. TITS, AND D. P.
O'LEARY, A constraint-reduced variant of Mehrotra's
predictor-corrector algorithm. Submitted for publication, 2007.
Available at: http://www.optimization-online.org/DB_HTML/4.html.[35] K. WOODSEND AND J. GONDZIO, Exploiting separability in
large-scale support vector machine training, Tech. Report MS-07-002,
School of Mathematics, University of Edinburgh, 2007. Available at:
http://www.optimization-online.org/DB_HTML/0.html.[36] S. J. WRIGHT, Primal-Dual Interior-Point Methods, SIAM,
Philadelphia, 1997.[37] C. Yang, R. DURAISWAMI, AND L. DAVIS, Efficient kernel
machines using the improved fast Gauss transform, in Advances in Neural
Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou,
eds., MIT Press, Cambridge, 2005, pp. .[38] Y. YE, A "build-down" scheme for linear programming,
Math. Program., 46 (1990), pp. 61-72.[39] --, A potential reduction algorithm allowing column
generation, SIAM J. Optim., 2 (1992), pp. 7-20.JIN HYUK JUNG ([dagger]) DIANNE P. O'LEARY ([double dagger])
AND ANDRE L. TITS ([section])* Received November 29, 2007. Accepted September 26, 2008.
Published online on February 23, 2009. Recommended by Martin H.
Gutknecht. This work was supported by the US Department of Energy under
Grant DEFG.([dagger]) Department of Computer Science, University of Maryland,
College Park, MD 20742 (jjung@cs.umd.edu).([double dagger]) Department of Computer Science and Institute for
Advanced Computer Studies, University of Maryland, College Park, MD
20742 (oleary@cs.umd.edu).([section]) Department of Electrical and Computer Engineering and
the Institute for Systems Research, University of Maryland, College
Park, MD 20742 (andre@umd.edu).(1) They use a scalar [tau].(2) In many articles discussing the decomposition method, the terms
"in-bound" and "bound support vectors" are used to
denote on-boundary and off-boundary support vectors, respectively. The
former terms are based on the bound constraints (2.16), whereas the
latter terms are based on geometry.(3) It is sometimes called the duality measure.(4) Researchers in the machine learning field often use
"positive definite" to denote "positive
semidefinite" and "strictly positive definite" for
"positive definite". Here we use the more common definitions
from mathematics.(5) The kernel is symmetric if k(x, [bar.x]) = k([bar.x], x) for
every x, [bar.x] [member of] X. It is positive semidefinite if the Gram
matrix is positive semidefinite for every finite collection of pattern
vectors. In this case, there is an associated reproducing kernel map
that allows us to define K in the w see [4] and
[30] for more details.(6) This is the default setting of Gaussian kernel in LIBSVM.(7) IFGT supports dense input only.Table 2.1: Classification of support vectors and nonsupport
vectors. Here [s.sup.*.sub.i] is the optimal slack variable,
defined as [s.sup.*.sub.i] : = [y.sup.*.sub.i] (([w.sup.*],
[x.sub.i]) -[[gamma].sup.*]) + [[epsilon].sup.*.sub.i] -1,
associated with the ith constraint in (2.12).
Pattern Type
.sup.*.sub.i]
[s.sup.*.sub.i]
Off-boundary support vector
[[tau].sub.i]
On-boundary support vector
(0, [[tau].sub.i])
Nonsupport vector
(0, [infinity])
Pattern Type
[[epsilon].sup.*.sub.i]
Off-boundary support vector
(0, [infinity])
On-boundary support vector
Nonsupport vector
Table 5.1: Properties of the problems. ISD: Input space
dimension. FSD: Feature space dimension using the map (5.1).
SVs: support vectors. On-SVs: on-boundary support vectors.
Patterns ()
52 (31/21)
186 (74/112)
186 (74/112)
228 (110/118)
543 (266/277)
40 (10/30)
TABLE 5.2: Accuracy shown as percent
of testing patterns correctly classified.
CRSVM(+EIGS)
CRSVM(+CHOL)
Gale Copyright:
Copyright 2008 Gale, Cengage Learning.
All rights
Previous Article: Next Article:
& 2004-. All rights reserved.}

我要回帖

更多关于 madein 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信