Showing posts with label kernels. Show all posts
Showing posts with label kernels. Show all posts

Friday, February 04, 2011

POTD: Reproducing Kernel Banach Spaces with the ℓ1 Norm

Reproducing Kernel Banach Spaces with the ℓ1 Norm
Guohui Song, Haizhang Zhang, and Fred J. Hickernell

Abstract:
Targeting at sparse learning, we construct Banach spaces B of functions on an input space X with the properties that (1) B possesses an l1 norm in the sense that it is isometrically isomorphic to the Banach space of integrable functions on X with respect to the counting measure; (2) point evaluations are continuous linear functionals on B and are representable through a bilinear form with a kernel function; (3) regularized learning schemes on B satisfy the linear representer theorem. Examples of kernel functions admissible for the construction of such spaces are given. 

Notes:
This one probably requires some explanation, for the non-ML folks. Reproducing Kernel Hilbert spaces are the coin of the realm in machine learning and for good reason. They allow much of ML to be "ported" from linear classifiers to non-linear classifiers: the kernel mapping essentially linearizes (via lifting) the nonlinear classifiers so you can get the benefit of the nonlinearity while operating algorithmically in a linear world. Even though the induced Hilbert space is typically a function space and is therefore infinite-dimensional, the representer theorem allows us in most cases to operate in a finite dimensional space (where the dimension is bounded by the number of samples). From a metric embedding perspective, kernels completely characterize the class of metrics isometrically embeddable in Euclidean space.

So RKHSs are great. So what's the deal with this paper ? What it tries to do is combine the power of RKHSs with the regularity and sparsity properties guaranteed by $\ell_1$ norms. Even though your typical Banach space doesn't admit an inner product (what you need for the kernel mapping), they show that you can define special Banach spaces in which kernels can be defined as before, and the representer theorem holds, but that you can get sparse bases for solutions because of the nice $\ell_1$ properties.

I promised SHORT summaries, so I'm not going to go further. But the takeaway message here is the ability to extend the nice properties of RKHSs to Banach spaces. For completeness I should mention that there are other approaches that have tried to do this, but using different mathematical constructs that are in some way less well suited.

Saturday, August 08, 2009

Negative-type distances and kernels

This is a story of two constructions that actually end up being essentially the same thing, as I recently discovered.

1. Kernels

The story of kernels in machine learning goes somewhat like this:

Take a positive definite function $$K(x,y)$$ that captures some notion of similarity between objects (a standard example of such a kernel is the Gaussian $$K(x,y) = \exp(-\|x - y\|^2)$$). A positive definite function, btw, is like the generalization of a p.d matrix: the integral $$\int f(x)f(y)k(x,y)dx dy \ge 0$$.

You can then construct a Hilbert space that captures the structure of the kernel. Specifically, for a fixed set of points S, construct a vector space from the basis $$\{k_x | x \in S\}$$, where $$k_x(y) = K(x,y)$$, and then define an inner product of two vectors in this space in the usual way: If $$v_a = \sum a_x k_x$$ and $$v_b = \sum b_x k_x$$, then $$v_a \cdot v_b = \sum a_x b_y K(x,y)$$.

You get nice properties from this construction: the so-called reproducing property is one of them, which tells you that $$K(x,y) = k_x \cdot k_y$$. In other words, we can capture the similarity function K(x,y) by a "standard" inner product in a Hilbert space.

What's even neater is that by invoking Mercer's theorem, you can construct an orthogonal basis, and make sure that the Hilbert space is actually a Euclidean space. The squared Euclidean distance in this space can be written in kernel form, as
\[d_K(x,y) = K(x,x) + K(y,y) - 2K(x,y)\]
which is what you'd expect when treating K(.) as an inner product.

2. Negative-type distance.
The second story comes from Deza and Laurent. When can a distance function be embedded in Euclidean space ? It turns out that it's more convenient to talk about the square of the distance function, which we'll call D.

There's an elegant characterization of when D can be embedded isometrically into $$\ell^2_2$$: this can be done if and only if D satisfies the negative-type inequality:
\[ \sum b_i b_j D(x_i, x_j) \le 0, \sum b_i = 0\]
for all possible assignments to $$b_i$$.

The proof works via a construction called a covariance mapping that takes $D$ to the function $$k_D$$ defined as:
\[ k_D(x,y) = (1/2)(d(x, x_0) + d(y, x_0) - d(x,y)) \]
which differential geometry folks will recognize as the Gromov product.

The proof completes by showing that the negative-type condition on D implies positive definiteness of $$k_D$$, and this in turn means that $$k_D$$ can be expressed as an R-covariance:
\[ k_D(x,y) = \int f_x(\omega)f_y(\omega) d\mu(\omega) \]
for some measure space $\mu$.

Note that the RHS of the equation is an infinite-dimensional inner product.

3. Where the two stories come together
The mapping that takes a kernel to a distance is the inverse of the covariance mapping used to map a distance to a metric. In other words, if we take a kernel K, compute $$d_K$$, and then use the distance to kernel mapping to compute $$k_{d_K}$$, we get back K. Further, since we can show that the negative-type condition on D implies a positive-definite condition on $$k_D$$, we can start off with either "D satisfies negative-type inequalities" or "K is a positive definite kernel" and yield the same conclusion on $$\ell_2^2$$ embeddability.

What's interesting is that the two ways of thinking about this don't seem to have been connected together explicitly before.

p.s This is not a fully rigorous correspondence except possibly in the finite dimensional case, but I find that it's often convenient to replace the negative-type argument with a kernel positive-definiteness argument in my head.

Disqus for The Geomblog