R. Fergus, Y. Weiss, and A. Torralba Semi-supervised Learning in Gigantic Image Collections. NIPS 2009

and optionally:

A. Shrivastava, S. Singh and A. Gupta. Constrained Semi-Supervised Learning Using Attributes and Comparative Attributes. In ECCV 2012.

Guillaumin, M., Verbeek, J., Schmid, C.: Multimodal semi-supervised learning for image classification. In: CVPR. (2010)

The goal of this work is to develop techniques for image search that can be applied to the billions of images on the Internet. The authors use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators to obtain highly efficient approximations for semi-supervised learning that are linear in the number of images. This method starts with building a graph Laplacian where each image is a vertex in a graph and the weight of the edge between the vertices is given by an affinity using Gaussian kernel. So for n data points, the affinity matrix W is of size n-by-n, from which the normalized graph Laplacian L can be computed.

ReplyDeleteIn semi-supervised learning, a label function f over the data points is to be solved. The graph Laplacian measures the smoothness of the label function f while the second term constrains f to agree with the labels, using a weighting lambda according the reliability of the label. These lambda can be rewritten in matrix form and then the optimal f is given by the solution to an n-by-n linear system. The objective is to minimize a combination of the smoothness and the training loss which has a closed form solution. However, it requires of solving an n-by-n linear system and when n is large, there could be numerical difficulties.

Instead of directly solving the n-by-n linear system, we can instead model f as a linear combination of the smallest few eigenvectors of the Laplacian. These smallest eigenvectors are smooth with respect to the data density. The smallest is just a DC term, but the 2nd smallest splits the data horizontally and the 3rd splits it vertically. Therefore, when using the k smallest eigenvectors as a basis, k being a value we select, typically being 100 or so, then we can rewrite the linear system we had previously. This now means that instead of solving an n-by-n system, we just need to solve a k-by-k system, from which we can compute the label function f.

However, eigenvectors still need to be found in the first place which requires inverting a huge n-by-n matrix. The authors assume that the data samples are from a distribution p(x) and analyze the eigenfunctions of the smoothness operator defined by p(x). The density defines a weighted smoothness operator on any function F(x) defined on Rd. The eigenvalues of the eigenfunctions is similar to the eigenvalue of the discrete generalized eigenvector. The following convergence can be observed: when n goes to infinity, the eigenfunctions can be seen as the limit of the eigenvectors.

The algorithm is as follows:

Delete1. Rotate data to maximize separability using PCA

2. For each of the d input dimensions

- Construct 1D histogram

- Solve numerically for eigenfunction/eigenvalues

- Order eigenfunctions from all dimensions by increasing eigenvalue and take the first k

4. Interpolate data into k eigenfunctions

- Yield approximate eigenvectors of Laplacian

Each image is represented with a single global descriptor: the Gist descriptor of Oliva & Torralba which captures the energy of Gabor filters at different scales and orientations over the image. Then, PCA is utilized to reduce the dimensionality to 64. The PCA step can also de-correlate the strong dependencies, measured by the mutual information score, between dimensions in the Gist space. They have evaluated on the CIFAR data set and the Tiny Image data set and shown improved results on both data sets.

Advantages:

1. In Nystrom method, the number of data points is reduced down to set a set of landmarks. By contrast in this approach, the authors consider the limit as the number of points goes to infinity and they have a continuous density. A key points is that this approach is linear in the number of examples.

2. The approximation using eigenfunctions can make large-scale semi-supervised learning practical.

Concerns:

1.In this approach, PCA is used to de-correlate the dimension dependencies. It is better that the authors can show the comparison of using PCA versus ICA in doing the same task.

2. A big assumption in this method is that the input distribution is separable. So for the toy 2D data, the joint density is modeled as a product of the two marginal distributions p(x1) and p(x2). And the eigenfunctions of these marginals are eigenfunctions of the joint density. It is better that the authors can show some analysis when the input distribution is not separable.

3. When n is not so large, the quality of the approximation can be further discussed.

4. How will the class separability impact the performance on real datasets?

5. The performance is shown as precision at 15% recall, how about precisions at even lower recall values? Would this method still shine?

Well I think the most important contribution of the paper is the algorithm is able to cope with a gigantic collection of data, while other sub-sampling methods still cannot scale well. Take the inverse of a graph Laplacian O(n^3) or compute the smallest eigenvectors O(n^2) is not possible but with the independence assumption it makes the underlying distribution approachable.

DeleteThis comment has been removed by the author.

DeleteIt appears that a major assumption is that the function f is smooth over the graph. Is this really a safe assumption we can make for entities as complex as whole images even even when we do things like gist/pca?

DeleteI think the assumption (or key idea) that function f is smooth over the graph is the backbone of any graph-based SSL technique (and not just this paper). If there is no such f that is smooth, we have no hope of labeling the unlabeled data by just relying on few labeled samples (in general?).

DeleteTo make the approach feasible, i.e.,scaling possible in gigantic data collections, it has to introduce several additional constraints, such as smooth functions and separable distributions. Although in theory, when these constraints are not satisfied, the derivation of the approach doesn't hold. However, justifying these constraints in practice is still hard. So it will be interesting to see how the method works or to what bad extent it performs when the constraints are not satisfied or partially satisfied.

DeleteIn this paper, the authors present an efficient, approximate approach for doing semi-supervised learning through the use of the affinity graph Laplacian. Rather than compute the eigenfunctions of the Laplacian directly, they instead approximate them using weighted hitograms on the PCA. They claim that the technique is linear in the number of examples, unlike direct approaches which require polynomial time.

ReplyDeleteIn their example, they take millions of images from the Internet, and apply a single Gist descriptor to the image as a feature. Some of the images are labeled by humans, others are not, and still others have "noisy" labels. They show favorable precision/recall results versus other methods.

Honestly, I don't know enough about spectral clustering, or the graph Laplacian to comment directly on the mathematical details, but their experimental setup seems a bit odd. Why a single Gist descriptor? Also, it seems to overfit a bit in the qualatative results. Notice how (most) all the airbuses are flying to the right (and "autos" are apparently mostly cars which drive diagonally to the left)? This seems to be a problem with unsupervised learning in general.

-- Matt K

I'm not sure how much of the overfitting is just due to dataset biases. As we saw before, most images of cups on the internet have handles on the right side. Perhaps most images are of cars driving diagonally to the left? This may even be some psychological thing that makes cars look more appealing or a pose that showcases the car's features best. In my opinion the search engine rankings and nearest neighbor rankings have the same bias. They just have more visual variety in the images so it is less obvious.

DeleteI agree with Priya about the overfitting point. And it goes back to our question of whether the goal of our algorithms is to overfit the real world or not. Also, overfitting is not just a problem in unsupervised learning. In fact, it is a bigger problem in supervised learning, which is why people started moving towards semi-supervised and unsupervised methods. And clearly, the latter is better when you don't have enough annotations for your data.

DeleteI think the GIST descriptor provides some notion of similarity between two images, considering the appearance and spatial layout of things in the image as a whole. The reason for most cars and airbuses moving in the same direction is probably due to bias in the way pictures are taken in general. What really puzzled me initially is the ostrich example where the algorithm seems to cluster two different poses together - one with the ostrich standing sideways and another where it is looking straight into the camera. These two are quite different visually and that is what I would expect the GIST descriptor to see. Clearly, the few labeled examples they started out with must have contained both poses and the algorithm managed to cluster these two into a single class. I think this is a good indicator for this algorithm's performance. If the dataset had indeed included mirror versions of the cars and airbuses pointing the other way and some of them were included in the initial set of labeled examples, I am fairly sure the algorithm would have found them.

DeleteThis comment has been removed by the author.

ReplyDelete(Critique)

ReplyDeleteFelix has summed up the method well so I won’t retread that part of the paper. Some thoughts on the algorithm:

- From the brief description of [11], it seems that the only difference is that [11] summarizes p(x) using some exemplars while the current work uses binning; this seems like a really subtle difference and it’s not clear which makes more sense on real data.

- No attempt is made to suggest that either of the main assumptions of the approximation (that the number of samples is large enough to well-approximate p(x) and that p(x) is separable) are met in practice.

And to summarize the experiments:

The authors perform two experiments to show the promise of the method. The first uses a subset of the images in the Tiny Images Dataset. These images comprise 126 classes for which there are at 200 positive and 300 negative labels, giving 63,000 images in total. For each class, 100 positive and 200 negative examples are selected for testing and t positive/negative pairs are selected for training. The results show that the proposed method outperforms supervised baselines suchs as SVM and nearest neighbor, as well as other approximate semi-supervised approaches when the number of training examples is small.

- When the number of training examples is still fairly small (around 2,000), the performance of SVM is as good or better than the proposed method, so what are we really gaining?

- In all cases performance is really terrible: 65% recall and 15%. Do modest gains in this regime even matter? This is definitely an “incremental”-type paper, rather than a paradigm changer.

The second experiment is on the whole Tiny Images Dataset (~80 million images, so justifying the word “gigantic” in the title). They compare compare qualitatively to nearest neighbor for 3 categories.

- Why not compare to other approximate semi-supervised methods such as [18]?

- This problem will bedevil and semi-supervised work, but: can we go beyond qualitative performance evaluation on this large dataset? If not, how can we possibly say one method is better than another?

Some notes on presentation:

- The term “weighted Laplace-Beltrami operators” appears only in the abstract and is not described anywhere in the paper.

- The abbreviations “+ve” and “-ve” are unnecessary and tough to read; they are also not used consistently.

I think that the results will be good if we use a lot of unlabeled data with the available labeled data. Am I right in saying that the asymptotic solution for clustering is better than the SVM and NN baselines (asymptotic as unlabeled N goes to infinity but labeled N is constant).

DeleteI think it is hard to extrapolate what the performance would be as N goes to infinity, but the main explanation the authors offer for SVM's poor performance with a small number of examples is overfitting. Once they get 2000 examples as Mike mentions, SVM seems to surpass their method.

DeleteIf a point is labeled as class A and 10 other points are clustered with it, then these 10 points are also from class A. So, clustering is expected to do most of the work and is guided by existing labels. But clustering in high dimensions is difficult unless we have a lot of data.

ReplyDeleteThis paper provides an elegant solution to allow the clustering algorithm to use a lot of data - thus making it work better.

I am, however, concerned about the separability assumption. The data comes from multiple complex interleaved manifolds in high dimension. These feel more like figure 3(right) than figure 3(left) in that most of the space is unoccupied and that things are around each other and not as uncorrelated as figure 3(left). Any thoughts?

I was also puzzled by that assumption for a bit. It seems reasonable to me now. Considering high dimensional spaces, it is known that most of the volume is away from the origin. If you consider a hypersphere then most of the data is in it's crust. It is also known that in high dimensions you tend to have "pockets", or some points that are in the NN of a lot of points. If you visualize these as spikes (http://www.northforktrails.com/RussellTowle/Mathematica/star_polytope.jpg), you can probably visualize that the axis maybe aligned with these spikes. In this case, the separability along dimensions may make some sense.

DeleteTo add to what Ishan said, their representation is PCA'd GIST, so by construction, their dimensions are more-or-less independent (more-or-less because of their binning step, which is still puzzling me..).

DeleteI wonder how their algorithm behaves for not-so-high-dim not-so-separable data (10-20d) with no explicit step added for separability..

As Abhinav said, dimensions are independent since they are output of PCA on GIST. It is interesting, because it means that PCA has an important role in the algorithm. I think that fact can point to a domain where their method performs reasonably well

DeleteI haven't read any of their reference papers, but I have a small concern that hopefully someone more familiar with these graphical methods can address:

ReplyDeleteThe authors seem to suggest that they have recovered the eigenfunctions for the LP operator, but in reality equation 2 only gives the eigenfunction evaluated at discrete points. Since a histogram is also used to approximate the data distribution, is there a possibility of artifacts arising from poor sampling choices?

Overall, I like this paper for showing a principled approach to SSL. I was slightly disappointed with the experimental section, but I guess for a ML conference it is ok.

ReplyDeleteI was really happy to see the Eigen Function approximation scheme. It has become a general trend in the ML community to use Eigen functions to approximate Kernel Matrices (like FastFood). I believe these methods derive their power by exploiting smoothness in the data similarity metrics.

The RBF-kernel may not be the best baseline. It is known to overfit with small number of training data. A different choice of kernel would have been nice.

The number of dimensions being used is dropped from 64 to 32 without explanation.

As a minor side note, the eigen function re-ranking may not be the best idea for a search engine's output. It has lots of redundancy. You expect a search engine to give you slightly diverse results. I guess the authors have a more "matching" use-case in mind.

I agree with RBF SVM point. Using an RBF and then mentioning overfitting with few training examples is going to happen due to the projection to infinite dim feature vector. That would have been more convincing with linear kernel. It also makes me wonder if that because the SVM beats their system for 'many' labeled training examples (>2000), if it's worth doing their approach compared to the simple svm.

DeleteI'm not sure about the linear SVM argument. Isn't it possible that even a linear SVM can overfit when the number of labeled examples is very small? The synthetic example used by the authors seems to show this (or maybe this is a problem only in low dimensional spaces?)

DeleteThe SVM does beat SSL for the case of 2000 training examples and above, but if the data is clustered reasonably well, the point is that their algorithm can work even with very little training data which does not necessarily have to be well-spread over the cluster of similar images. In applications like image-based internet search for uncommon categories, this should be useful.

SSL based methods with less annotated data can perform as good as fully supervised methods will much more annotated data. This is what the results section conveys. They have done a good job of taking the graph so far as to show how much annotation is required to match their method - that's why the SVM meets their meets their method at the end.

DeleteIn my opinion if they used 10 times more unlabeled data, it will take the SVM even more annotated data to reach them. Unfortunately, this 10 times more data is currently unavailable to perform this quantitative experiment and comparison.

The authors in the paper present a viable approach to handling ssl for large data. The math is generally explained well, and gives at least a decent understanding even though I am unfamiliar with the material and didn't read further in the referenced material. It was cool that they got it working on a "face pc" in under 1ms!

ReplyDeleteAgree, I wasn't familiar with the material either, but it is explained better than many textbooks would explain it

DeleteThis paper demonstrated a novel approach for leveraging lots of noisy data for image matching retrieval. I was a bit confused as to why they don't talk more about the SVM performance relative to their own in the case of data saturation. Isn't the point to leverage lots of data, and so they showed that their performance was comparable to simple SVM with no unlabeled training data?

ReplyDeleteIshan made an interesting point about the "diversity" problem. If you're producing many results, it's probably not optimal to produce things with high similarity to previous matches. It seems like it should be a list prediction problem. I suppose the entirety of the Tiny Images dataset experiments confused me. Their method clusters data, which for the tiny images seemed to be a classification problem --- 1 class per image. What is the "re-ranking" that they keep mentioning? Isn't it just a lookup, once things have been 'clustered'?

With lots of labeled data, I think this turns into a supervised classification problem and therefore an SVM or maybe even some other methods can perform well. I guess the point is that with lots of unlabeled data and only very few labeled instances, their semi-supervised method uses the assumption of smoothness in data density to cluster images in the same class. In this sense, the algorithm leverages structure in big data to learn from very few examples and as the number of examples provided increases, the returns are diminishing.

DeleteSure, but the whole argument is based on the assumption that there isn't enough labeled data, yet they didn't show improvement over letting the best supervised method have all of its labeled data. Then what's the point? Although it's cool that they can do well in cases with small amounts of unlabeled data.

DeleteTypo in the second to last word. unlabeled -> labeled

DeleteThe justification for SSL is diluted when the authors show 64 labeled examples and SVM does as well as their method.

DeleteThis paper proposes a novel semi-supervised learning technique for a huge number of images, which may or may not be labeled (including noisy labels). Being completely new to unsupervised and semi-supervised learning techniques, I'm extremely satisfied with the performance of the algorithm. With a dataset of 80 million images random images from the internet and limited annotations, I think the results are incredible, especially the fact that the algorithm is linear in the number of images. Having said that, I'm really not sure what works in their algorithm. They have GIST descriptors, PCA and eigenfunctions (which I don't understand completely) majorly. I have similar concerns as mentioned by a lot of people above. They compare eigenfunctions to other methods, which is good. Reducing dimensions is important for high-dimensional data with large number of samples. But there is no justification of the number of dimensions chosen or why they had to use only PCA. PCA, being a generative model might not be the best thing to do. Also, there is no comparison to how the existing semi-supervised learning algorithms work on their dataset and how this method is better.

ReplyDeleteI think the PCA will also cost a lot of computation which is the main problem this paper wants to solve. Another problem is that will PCA be suitable for such problem?

DeleteThis paper proposes a state-of-art semi-supervised learning method in huge dataset which contains millions of billions of images. The method mainly solve the problem that calculate eigenvectors of n*n matrix where n is a really large number. In this paper, the author uses the limits of the eigenfunctions to estimate the eigenvectors. In fact, in some cases, the eigenfunctions could be calculated analytically and it saves much more computations. This method could change the calculation of a high dimension matrix of the data points to a low dimension matrix of the parameters of the density functions. The idea in this paper is really interesting. Unlike the methods like PCA which reduce the dimensions of the data points directly, the idea in this paper is to use other low dimension parameter to estimate the origin problem.

ReplyDeleteA key assumption for this paper is density of data being in product form aka no strong dependencies between dimensions after PCA. The authors discuss this in section 3.1 features but I am concerned they use one dataset and show how mutual information between dimensions is low. It seems authors are happy to find a dataset where the assumption holds.

ReplyDeleteI find it interesting that rbf kernel SVM performance catches up with the eigen function approach after just 64 labeled examples. This makes me doubt whether I want to go through the trouble of eigen functions when I can label a few more images.

I like how noisy labels can be used in the eigen function formulation by giving them a positive label with small weight. This for me is the best thing to come out of this paper.

On a side note-

I really like how the authors explained the intuition behind their semi supervised approach and use of eigen vectors and eigen functions in figure 1 and 2.

I like the idea of approximating large SVD problem by estimating the pdf of data and thus compute the eigenfunctions. Basically, the method interpolates infinite number of points by the empirical distribution of observed data points and they show that one does not need to solve SVD problem with infinite number of dimensions, instead, a functional minimization problem could achieve that as long as we can solve it easily.

ReplyDeleteFor the concern raised by Aravindh, I agree that the assumption of data can be well approximiated using a Gaussian-like clique (Figure 3 left) may not be true, however, in the situation where we do not know what does the true distribution look like (prior), a Gaussian-like approximation is the best in terms of asymptotic performance.

It seems to be very hard to tackle large amount of unlabeled data using this type of SSL techniques (seems not much paper published about this topic recently, correct me if I'm wrong), trying to propagate labels to all unlabeled points may just be too hard. One possible alternative thing to do is to try to grab unlabeled data that you are really confident and propagate labels to them, and then use them to strengthen your classifier or something and do this process iteratively.

ReplyDeleteAfter reading all the supervised learning papers, which (to me) felt like they weren't easily generalizable due to prohibitive human labeling time, it's nice to read a paper directly addressing this problem - "While impressive, these manual efforts have no hope of scaling to the many billions of images on the Internet."

ReplyDeleteI'm unfamiliar with the math and don't have an intuitive sense of what is happening, but the results seem quite promising. It's interesting that each label tends to have images of the object in the same orientation. I wonder if additional orientations are simply missing, or if the algorithm is classifying them as something else.

One concern that I had was that this paper really focused on scaling to millions and billions of images but the datasets that they tested on only had a tiny fraction of the dataset they had in mind when they created the algorithm. I think that one of the reasons that their performance is limited is the "tiny" size of their datasets especially since they make some performance assumptions as n approaches infinity. I think their algorithm has the potential to shine if they did manage to get millions of billions of images. Though from what we've talked about before, it would be interesting to see their performance against nearest-neighbors for millions of images.

ReplyDeleteAfter reading the paper I get several intuitions from the paper. They are generalized here:

ReplyDelete1. They are proposing a spectral clustering like method biased by the available labels. (It would be interesting to compare their method with other methods, such as "biased normalized cuts" (CVPR 11) and "Grouping with Bias" (NIPS 2001))

2. Their final method emphasize more on density than traditional spectral methods by incorporating density estimates into their generalized eigendecomposition problem. The authors think density continuity is a strong indicator of co-class correlation.

The proposed method is reasonable and make a lot of sense to me. However, I was frustrated to see in the experiment, their method failed on a very simple toy example. From the toy experiments I could infer that their method prefers datasets with sample point clouds rather than manifold structures. This indicates that the method is laying somewhere between mean-shift-like clustering methods and spectral-methods.