16-824: Learning-based Methods in Vision (F'13): reading for 11/12

Thursday, November 7, 2013

reading for 11/12

R. Fergus, Y. Weiss, and A. Torralba Semi-supervised Learning in Gigantic Image Collections. NIPS 2009

and optionally:

A. Shrivastava, S. Singh and A. Gupta. Constrained Semi-Supervised Learning Using Attributes and Comparative Attributes. In ECCV 2012.

Guillaumin, M., Verbeek, J., Schmid, C.: Multimodal semi-supervised learning for image classification. In: CVPR. (2010)

40 comments:

UnknownNovember 11, 2013 at 12:02 AM
The goal of this work is to develop techniques for image search that can be applied to the billions of images on the Internet. The authors use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators to obtain highly efficient approximations for semi-supervised learning that are linear in the number of images. This method starts with building a graph Laplacian where each image is a vertex in a graph and the weight of the edge between the vertices is given by an affinity using Gaussian kernel. So for n data points, the affinity matrix W is of size n-by-n, from which the normalized graph Laplacian L can be computed.
In semi-supervised learning, a label function f over the data points is to be solved. The graph Laplacian measures the smoothness of the label function f while the second term constrains f to agree with the labels, using a weighting lambda according the reliability of the label. These lambda can be rewritten in matrix form and then the optimal f is given by the solution to an n-by-n linear system. The objective is to minimize a combination of the smoothness and the training loss which has a closed form solution. However, it requires of solving an n-by-n linear system and when n is large, there could be numerical difficulties.
Instead of directly solving the n-by-n linear system, we can instead model f as a linear combination of the smallest few eigenvectors of the Laplacian. These smallest eigenvectors are smooth with respect to the data density. The smallest is just a DC term, but the 2nd smallest splits the data horizontally and the 3rd splits it vertically. Therefore, when using the k smallest eigenvectors as a basis, k being a value we select, typically being 100 or so, then we can rewrite the linear system we had previously. This now means that instead of solving an n-by-n system, we just need to solve a k-by-k system, from which we can compute the label function f.
However, eigenvectors still need to be found in the first place which requires inverting a huge n-by-n matrix. The authors assume that the data samples are from a distribution p(x) and analyze the eigenfunctions of the smoothness operator defined by p(x). The density defines a weighted smoothness operator on any function F(x) defined on Rd. The eigenvalues of the eigenfunctions is similar to the eigenvalue of the discrete generalized eigenvector. The following convergence can be observed: when n goes to infinity, the eigenfunctions can be seen as the limit of the eigenvectors.
ReplyDelete
Replies
AnonymousNovember 11, 2013 at 12:38 PM
In this paper, the authors present an efficient, approximate approach for doing semi-supervised learning through the use of the affinity graph Laplacian. Rather than compute the eigenfunctions of the Laplacian directly, they instead approximate them using weighted hitograms on the PCA. They claim that the technique is linear in the number of examples, unlike direct approaches which require polynomial time.

In their example, they take millions of images from the Internet, and apply a single Gist descriptor to the image as a feature. Some of the images are labeled by humans, others are not, and still others have "noisy" labels. They show favorable precision/recall results versus other methods.

Honestly, I don't know enough about spectral clustering, or the graph Laplacian to comment directly on the mathematical details, but their experimental setup seems a bit odd. Why a single Gist descriptor? Also, it seems to overfit a bit in the qualatative results. Notice how (most) all the airbuses are flying to the right (and "autos" are apparently mostly cars which drive diagonally to the left)? This seems to be a problem with unsupervised learning in general.

-- Matt K
ReplyDelete
Replies
Mike McCannNovember 11, 2013 at 1:21 PM
This comment has been removed by the author.
ReplyDelete
Replies
Mike McCannNovember 11, 2013 at 1:23 PM
(Critique)

Felix has summed up the method well so I won’t retread that part of the paper. Some thoughts on the algorithm:

- From the brief description of [11], it seems that the only difference is that [11] summarizes p(x) using some exemplars while the current work uses binning; this seems like a really subtle difference and it’s not clear which makes more sense on real data.
- No attempt is made to suggest that either of the main assumptions of the approximation (that the number of samples is large enough to well-approximate p(x) and that p(x) is separable) are met in practice.

And to summarize the experiments:
The authors perform two experiments to show the promise of the method. The first uses a subset of the images in the Tiny Images Dataset. These images comprise 126 classes for which there are at 200 positive and 300 negative labels, giving 63,000 images in total. For each class, 100 positive and 200 negative examples are selected for testing and t positive/negative pairs are selected for training. The results show that the proposed method outperforms supervised baselines suchs as SVM and nearest neighbor, as well as other approximate semi-supervised approaches when the number of training examples is small.

- When the number of training examples is still fairly small (around 2,000), the performance of SVM is as good or better than the proposed method, so what are we really gaining?
- In all cases performance is really terrible: 65% recall and 15%. Do modest gains in this regime even matter? This is definitely an “incremental”-type paper, rather than a paradigm changer.

The second experiment is on the whole Tiny Images Dataset (~80 million images, so justifying the word “gigantic” in the title). They compare compare qualitatively to nearest neighbor for 3 categories.

- Why not compare to other approximate semi-supervised methods such as [18]?
- This problem will bedevil and semi-supervised work, but: can we go beyond qualitative performance evaluation on this large dataset? If not, how can we possibly say one method is better than another?

Some notes on presentation:
- The term “weighted Laplace-Beltrami operators” appears only in the abstract and is not described anywhere in the paper.
- The abbreviations “+ve” and “-ve” are unnecessary and tough to read; they are also not used consistently.
ReplyDelete
Replies
M AravindhNovember 11, 2013 at 8:37 PM
If a point is labeled as class A and 10 other points are clustered with it, then these 10 points are also from class A. So, clustering is expected to do most of the work and is guided by existing labels. But clustering in high dimensions is difficult unless we have a lot of data.
This paper provides an elegant solution to allow the clustering algorithm to use a lot of data - thus making it work better.

I am, however, concerned about the separability assumption. The data comes from multiple complex interleaved manifolds in high dimension. These feel more like figure 3(right) than figure 3(left) in that most of the space is unoccupied and that things are around each other and not as uncorrelated as figure 3(left). Any thoughts?
ReplyDelete
Replies
Humphrey HuNovember 11, 2013 at 10:31 PM
I haven't read any of their reference papers, but I have a small concern that hopefully someone more familiar with these graphical methods can address:

The authors seem to suggest that they have recovered the eigenfunctions for the LP operator, but in reality equation 2 only gives the eigenfunction evaluated at discrete points. Since a histogram is also used to approximate the data distribution, is there a possibility of artifacts arising from poor sampling choices?
ReplyDelete
Replies
IshanNovember 11, 2013 at 10:42 PM
Overall, I like this paper for showing a principled approach to SSL. I was slightly disappointed with the experimental section, but I guess for a ML conference it is ok.
I was really happy to see the Eigen Function approximation scheme. It has become a general trend in the ML community to use Eigen functions to approximate Kernel Matrices (like FastFood). I believe these methods derive their power by exploiting smoothness in the data similarity metrics.

The RBF-kernel may not be the best baseline. It is known to overfit with small number of training data. A different choice of kernel would have been nice.

The number of dimensions being used is dropped from 64 to 32 without explanation.

As a minor side note, the eigen function re-ranking may not be the best idea for a search engine's output. It has lots of redundancy. You expect a search engine to give you slightly diverse results. I guess the authors have a more "matching" use-case in mind.
ReplyDelete
Replies
ArunNovember 11, 2013 at 11:02 PM
The authors in the paper present a viable approach to handling ssl for large data. The math is generally explained well, and gives at least a decent understanding even though I am unfamiliar with the material and didn't read further in the referenced material. It was cool that they got it working on a "face pc" in under 1ms!
ReplyDelete
Replies
UnknownNovember 11, 2013 at 11:47 PM
This paper demonstrated a novel approach for leveraging lots of noisy data for image matching retrieval. I was a bit confused as to why they don't talk more about the SVM performance relative to their own in the case of data saturation. Isn't the point to leverage lots of data, and so they showed that their performance was comparable to simple SVM with no unlabeled training data?

Ishan made an interesting point about the "diversity" problem. If you're producing many results, it's probably not optimal to produce things with high similarity to previous matches. It seems like it should be a list prediction problem. I suppose the entirety of the Tiny Images dataset experiments confused me. Their method clusters data, which for the tiny images seemed to be a classification problem --- 1 class per image. What is the "re-ranking" that they keep mentioning? Isn't it just a lookup, once things have been 'clustered'?
ReplyDelete
Replies
Divya HariharanNovember 12, 2013 at 1:25 AM
This paper proposes a novel semi-supervised learning technique for a huge number of images, which may or may not be labeled (including noisy labels). Being completely new to unsupervised and semi-supervised learning techniques, I'm extremely satisfied with the performance of the algorithm. With a dataset of 80 million images random images from the internet and limited annotations, I think the results are incredible, especially the fact that the algorithm is linear in the number of images. Having said that, I'm really not sure what works in their algorithm. They have GIST descriptors, PCA and eigenfunctions (which I don't understand completely) majorly. I have similar concerns as mentioned by a lot of people above. They compare eigenfunctions to other methods, which is good. Reducing dimensions is important for high-dimensional data with large number of samples. But there is no justification of the number of dimensions chosen or why they had to use only PCA. PCA, being a generative model might not be the best thing to do. Also, there is no comparison to how the existing semi-supervised learning algorithms work on their dataset and how this method is better.
ReplyDelete
Replies
UnknownNovember 12, 2013 at 2:05 AM
This paper proposes a state-of-art semi-supervised learning method in huge dataset which contains millions of billions of images. The method mainly solve the problem that calculate eigenvectors of n*n matrix where n is a really large number. In this paper, the author uses the limits of the eigenfunctions to estimate the eigenvectors. In fact, in some cases, the eigenfunctions could be calculated analytically and it saves much more computations. This method could change the calculation of a high dimension matrix of the data points to a low dimension matrix of the parameters of the density functions. The idea in this paper is really interesting. Unlike the methods like PCA which reduce the dimensions of the data points directly, the idea in this paper is to use other low dimension parameter to estimate the origin problem.
ReplyDelete
Replies
GauravNovember 12, 2013 at 6:41 AM
A key assumption for this paper is density of data being in product form aka no strong dependencies between dimensions after PCA. The authors discuss this in section 3.1 features but I am concerned they use one dataset and show how mutual information between dimensions is low. It seems authors are happy to find a dataset where the assumption holds.

I find it interesting that rbf kernel SVM performance catches up with the eigen function approach after just 64 labeled examples. This makes me doubt whether I want to go through the trouble of eigen functions when I can label a few more images.

I like how noisy labels can be used in the eigen function formulation by giving them a positive label with small weight. This for me is the best thing to come out of this paper.

On a side note-
I really like how the authors explained the intuition behind their semi supervised approach and use of eigen vectors and eigen functions in figure 1 and 2.

ReplyDelete
Replies
UnknownNovember 12, 2013 at 8:52 AM
I like the idea of approximating large SVD problem by estimating the pdf of data and thus compute the eigenfunctions. Basically, the method interpolates infinite number of points by the empirical distribution of observed data points and they show that one does not need to solve SVD problem with infinite number of dimensions, instead, a functional minimization problem could achieve that as long as we can solve it easily.
For the concern raised by Aravindh, I agree that the assumption of data can be well approximiated using a Gaussian-like clique (Figure 3 left) may not be true, however, in the situation where we do not know what does the true distribution look like (prior), a Gaussian-like approximation is the best in terms of asymptotic performance.
ReplyDelete
Replies
UnknownNovember 12, 2013 at 8:56 AM
It seems to be very hard to tackle large amount of unlabeled data using this type of SSL techniques (seems not much paper published about this topic recently, correct me if I'm wrong), trying to propagate labels to all unlabeled points may just be too hard. One possible alternative thing to do is to try to grab unlabeled data that you are really confident and propagate labels to them, and then use them to strengthen your classifier or something and do this process iteratively.
ReplyDelete
Replies
UnknownNovember 12, 2013 at 10:20 AM
After reading all the supervised learning papers, which (to me) felt like they weren't easily generalizable due to prohibitive human labeling time, it's nice to read a paper directly addressing this problem - "While impressive, these manual efforts have no hope of scaling to the many billions of images on the Internet."

I'm unfamiliar with the math and don't have an intuitive sense of what is happening, but the results seem quite promising. It's interesting that each label tends to have images of the object in the same orientation. I wonder if additional orientations are simply missing, or if the algorithm is classifying them as something else.
ReplyDelete
Replies
Priya DeoNovember 12, 2013 at 10:23 AM
One concern that I had was that this paper really focused on scaling to millions and billions of images but the datasets that they tested on only had a tiny fraction of the dataset they had in mind when they created the algorithm. I think that one of the reasons that their performance is limited is the "tiny" size of their datasets especially since they make some performance assumptions as n approaches infinity. I think their algorithm has the potential to shine if they did manage to get millions of billions of images. Though from what we've talked about before, it would be interesting to see their performance against nearest-neighbors for millions of images.
ReplyDelete
Replies
UnknownNovember 12, 2013 at 10:24 AM
After reading the paper I get several intuitions from the paper. They are generalized here:

1. They are proposing a spectral clustering like method biased by the available labels. (It would be interesting to compare their method with other methods, such as "biased normalized cuts" (CVPR 11) and "Grouping with Bias" (NIPS 2001))

2. Their final method emphasize more on density than traditional spectral methods by incorporating density estimates into their generalized eigendecomposition problem. The authors think density continuity is a strong indicator of co-class correlation.

The proposed method is reasonable and make a lot of sense to me. However, I was frustrated to see in the experiment, their method failed on a very simple toy example. From the toy experiments I could infer that their method prefers datasets with sample point clouds rather than manifold structures. This indicates that the method is laying somewhere between mean-shift-like clustering methods and spectral-methods.
ReplyDelete
Replies

Add comment