16-824: Learning-based Methods in Vision (F'13): Reading for 11/14

Tuesday, November 12, 2013

Reading for 11/14

S. Vijayanarasimhan and K. Grauman. Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds. CVPR 2011.

And optionally:

J. Donahue and K. Grauman. Annotator Rationales for Visual Recognition. ICCV 2011

B. Siddiquie and A, Gupta, Beyond Active Noun Tagging: Modeling Contextual Interactions for Multi-Class Active Learning, In CVPR 2010.

30 comments:

AnonymousNovember 13, 2013 at 12:35 PM
In this paper, the authors present a framework for actively learning object models for object detection and recognition to maximize the usefulness of human labelers. Their approach relies on training a simple part-based model based on SVM in SIFT-space over several pre-defined parts, and "context" from surrounding image patches. They use jumping window prediction to find candidate bounding boxes for objects in unlabeled images. They predict, over all unlabeled windows, which are likely to be the most informative given their current linear hyperplane, and actively select them to be labeled by humans. They take the consensus of human labelings to create newly annotated data as ground truth. This process iterates until a desired number of labeled windows is reached.

Active learning is quite an interesting topic which asks the question: what is the next best thing to know, and how should I learn it? Using active learning, we're able to get the "best of both worlds" in terms of supervised and unsupervised approaches. The assumption that humans will be in the loop as "oracles" also adds a dimension of human-computer interaction to the problem, which is pretty interesting.

My main complaint with thsi paper is that, while the system itself seems novel and interesting, the underlying techniques (SVM over simple part-based models) seem a bit weak. What would be cool is to see a combination of this overall system with some semi-supervised learning approaches we discussed last class. Your Mturkers can't give you enough labeled data to make your supervised classifiers work well.

-- Matt K
ReplyDelete
Replies
Jacob WalkerNovember 13, 2013 at 1:53 PM
This paper combines active learning with crowd-sourcing for object detection.
First, the authors develop a fast, parts-based SVM detector based on SIFT features. They further speed up their approach by using the jumping window method to select candidate windows. Using a hash-based method, the authors find uncertain instances in the data (the active learning part) and then send these to be annotated by Mechanical Turkers (crowd-sourcing part). In order to compensate for bad human-annotations, the authors cluster the annotated bounding boxes. The authors show results comparable to
state of the art on the PASCAL VOC 2007 dataset even without any active learning. They test their active learning framework on the PASCAL dataset as well as a dataset obtained from Flickr.The authors then argue that the active learning improves performance on some difficult categories such as bird and dog.

1. Personally I think the strongest contribution isn't necessarily the active-learning or the crowd-sourcing but the efficient detection method they develop. Training time is faster on a few orders of magnitude compared to state of the art methods, and the Mean Average performance is similar.

2. Because of their fast method, they show active-learning + crowd-sourcing is plausible on a large scale.

3. The authors suggest that (a), active learning results in a steeper learning curve (less data is needed to learn good models) and (b). active learning can help with hard categories such as animals.

4. That being said, when one looks at Table 1, within each category, there is a lot of variability as to which method is the top performer, and even with the Flickr dataset in figure 8 the method seems to lose to the baseline for chairs and dogs. Is there a statistically rigorous way to measure when a method is truly beating the state-of-the-art versus random chance?
ReplyDelete
Replies
Humphrey HuNovember 13, 2013 at 6:18 PM
This comment has been removed by the author.
ReplyDelete
Replies
Humphrey HuNovember 13, 2013 at 6:26 PM
I think the active learning and crowd sourcing aspects of the work are interesting, but as Jacob and Matt have mentioned, their detector pipeline is also a bit of a departure from the baselines. As usual, we are left looking at mixed results with no idea as to what about this method is working and what is not. In fact, I would say the results in Fig. 6, and 8, where the active learner is compared to MTurking random images, are worrying since the random method performs comparably for most classes and "dramatically" better for one.

ReplyDelete
Replies
Divya HariharanNovember 13, 2013 at 9:16 PM
The paper describes a lot of interesting ideas for active learning for training object detectors. The authors are trying to solve multiple hard problems and they succeed in getting reasonable results.

Things that I liked:
1. The major idea of combining active learning and crowdsourcing - I think it is an amazing way to generate good training data.
2. The part-based SMP model for object detection - We know it works reasonably. Simple and takes advantage of existing techniques.
3. Hyperplane-hashing for selecting active windows - Definitely contributes to the reduction in computation time.
4. In the Results - The comparison of computation time. As Jacob had mentioned, I think this is the most significant contribution of the paper - making live active learning feasible on large scale data. The authors are trying to minimize the time complexity in each step of their algorithm and it definitely looks like tremendous improvement from the given results.

Things that I would have liked to see:
1. All the algorithms that they have mentioned make a lot of intuitive sense. But I would have definitely preferred to see more numbers to get a quantitative sense of how much improvement they get (like the computation time comparison).
2. Since there are too many existing methods involved, it would have been useful to see what exactly works as opposed to just comparing the proposed algorithm with the overall existing techniques.
3. Addressing the above two points would have explained better why this algorithm beats the "state-of-the-art" in some cases and why it doesn't in others.
ReplyDelete
Replies
ArunNovember 13, 2013 at 10:54 PM
The paper was structured well was overall pretty easy to read. Their use of sparse coding and max pooling for detection was interesting. The max pooling over the sparse selection of visual words was interesting compared to just doing max pooling over feature dimensions. This sparse + pooling probably capture more salient information.

Their hashing method for selecting an example for active learner was also interesting, thought it doesn't seem to be the novel part of this paper (they cite their own NIPS 2010 paper [http://www.cs.utexas.edu/~grauman/papers/jain_NIPS2010.pdf] , which is also in the domain of active learning). The novel step here seems to be the application of it to a real problem in a not 'sandbox' setting. Their NIPS paper focuses on datasets instead of on an unlabeled pool of Flikr images.
ReplyDelete
Replies
IshanNovember 13, 2013 at 11:35 PM
I like this paper because it actually builds a system that is able to use active learning with real non-expert humans. Most active learning papers I have read do not have this component, and hence don't care about computational time. They motivate their active learning from a practical viewpoint, but completely neglect it during the evaluations.

I am slightly worried by the fact (as Humphrey noted) that for bottle and bicyle their methods is much worse than baselines. I am not happy with the fact that the authors did not analyze this. Was it because of the fact that they were using a linear SVM without a "part deformation penalty"? Was it just that the data added was too diverse?
ReplyDelete
Replies
UnknownNovember 14, 2013 at 12:10 AM
I like the paper because it aims at exploring a very important problem, i.e. active labeling for object detection, and proposes an elegant "solution" to it. Technically, there are several points which I think inspire readers a lot.
1. Combination of SP based method and deformable part template method. Which potentially combines advantages of both lines of methods, i.e. robustness to part internal deformation and linear runtime complexity.

2. Hash coding for generating root windows and top window selection. The similar idea has been adopted by the later prize-winner paper:
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/40814.
Also they have a full-blown implementation of the real system which is also very impressive.
Some random thoughts: There seems a lack of analysis about the effects of continuously increasing the pool of images for learning and annotation in terms of both performance and computation. Dean et al. proved the feasibility of using hash function to do very large scale object detection, which may be useful here as the system goes to a very large scale.
ReplyDelete
Replies
Srivatsan VaradharajanNovember 14, 2013 at 5:16 AM
The main focus of this paper seems to be scalability, which is something I really appreciate. Creating super-clean labeled datasets for different applications requires an unjustifiable amount of effort and patience on the part of the researcher. I think anyone who has ever had to spend hours cleaning up annotations from MTurkers for their projects would agree with me. Limiting the role of annotators to answering only specific queries would make the learning quicker and less expensive. Another good thing about this paper is that a lot of the techniques employed by the paper, such as hyperplane-hashing, usage of linear kernels and the jumping window selection, are geared towards reducing computational cost. I also liked the sparse max pooling, which seems to be a smarter scheme to encode features using visual words than just taking histograms. One thing I didn't understand about their Keyword + Random Image baseline (that seems to be performing really well) is whether annotators had to select bounding boxes by themselves. Getting binary labels for pre-computed bounding boxes is a lot easier and though they might have required to use more images, they could have gotten away with paying lesser in this case. This would have still made their active selection more viable than the random image selection.
ReplyDelete
Replies
UnknownNovember 14, 2013 at 6:03 AM
The authors of this paper develop and deploy a system for "live learning" in which they do not constrain their system to "sandbox" datasets for obtaining novel examples. They demonstrate reasonable performance gains on a few class categories over a "passive" baseline.

A few things to point out:
1. Their approach depends on having semantic categories (something that can be named and queried).
2. I want to echo Matt's point that the learning scheme they employ doesn't seem likely to scale with additional data, and therefore is not likely yield performance gains in the realm of having LOTS of data. This is the same issue I had with the paper from Tuesday. The point of these systems is, in my understanding, to leverage massive quantities of data, not simply show performance gains with much less supervised data than other methods. These methods for crawling and obtaining new data are only relevant when the number of examples is orders of magnitudes greater than the amount of supervised data used in "state-of-the-art" methods.
ReplyDelete
Replies
UnknownNovember 14, 2013 at 6:24 AM
Generally speaking, I really like this paper because it indicates several crucial limits of the current active learning and crowd-sourced labeling methods and solves them.

This paper builds an active learning system which requires few human intervene and is suitable for large scale data. The idea of part based detector is interesting. It is amenable to linear classifiers and hash functions enable sub-linear time to map a query hyperplane to its nearest points. The part based model could increase the performance of detection which ensures that even using linear kernel, the performance could still be comparable with the other state-of-art algorithms.
ReplyDelete
Replies
M AravindhNovember 14, 2013 at 7:09 AM
This comment echos the general concern that active learning isn't going very far from random selection. This, in my opinion, is more of a computer vision (object detection) problem. When using internet images, each image is very different in both appearance space and low level feature space (sparse coding + pooling in this case). Thus having labels for it gives us very little information about other images/data points. This is perhaps also why we need a lot of mixture components for good detection accuracy.

That said, some of the problems are alleviated by using strong parametric assumptions (like linear decision boundary). But these unfortunately have a large approximation error (decision boundary isn't linear I think) and we can see scores saturating very soon (as the number of data points increase).
ReplyDelete
Replies
UnknownNovember 14, 2013 at 9:55 AM
After reading this paper, I get a sense that it is not a paper with a lot of fancy ideas and equations. Rather, it is a paper with a lot of engineering which makes large scale active learning feasible. The overall method seems to be a combination of many existing methods. I'm not against this type of paper, as long as it works well. The paper is easy to read and the methods are well-described.

I agree that the contribution of this paper is its own detector with fast training. This makes the large-scale active learning feasible. However I'm also concerned about the learning curve of its active learning part. It's not improving the results too much. I remember at CVPR 11 someone asked the same question to Grauman. (But sorry I could not remember too much of her answer. That was too long ago and I have not paid too much attention at that time)

And I also think their detector has problems too. Their detector is performing very bad on specific objects such as bottle and chair. From their active learning experiments, it seems their limitation on active learning may have sth to do with their detectors as well! (eg. the chair category)
ReplyDelete
Replies
UnknownNovember 14, 2013 at 10:15 AM
What I like in the methods of collecting labeled data in general is that they developed into a separate field (as mentioned int the paper) and that this field has to solve many more problems of computational nature, because of huge datasets and limited time and budget. Hence, more of pure computer science tricks applied, for example, hashing method in this paper. That
ReplyDelete
Replies

Add comment