Tuesday, November 12, 2013

Reading for 11/14

And optionally:

J. Donahue and K. Grauman. Annotator Rationales for Visual Recognition.  ICCV 2011

30 comments:

  1. In this paper, the authors present a framework for actively learning object models for object detection and recognition to maximize the usefulness of human labelers. Their approach relies on training a simple part-based model based on SVM in SIFT-space over several pre-defined parts, and "context" from surrounding image patches. They use jumping window prediction to find candidate bounding boxes for objects in unlabeled images. They predict, over all unlabeled windows, which are likely to be the most informative given their current linear hyperplane, and actively select them to be labeled by humans. They take the consensus of human labelings to create newly annotated data as ground truth. This process iterates until a desired number of labeled windows is reached.

    Active learning is quite an interesting topic which asks the question: what is the next best thing to know, and how should I learn it? Using active learning, we're able to get the "best of both worlds" in terms of supervised and unsupervised approaches. The assumption that humans will be in the loop as "oracles" also adds a dimension of human-computer interaction to the problem, which is pretty interesting.

    My main complaint with thsi paper is that, while the system itself seems novel and interesting, the underlying techniques (SVM over simple part-based models) seem a bit weak. What would be cool is to see a combination of this overall system with some semi-supervised learning approaches we discussed last class. Your Mturkers can't give you enough labeled data to make your supervised classifiers work well.

    -- Matt K

    ReplyDelete
  2. This paper combines active learning with crowd-sourcing for object detection.
    First, the authors develop a fast, parts-based SVM detector based on SIFT features. They further speed up their approach by using the jumping window method to select candidate windows. Using a hash-based method, the authors find uncertain instances in the data (the active learning part) and then send these to be annotated by Mechanical Turkers (crowd-sourcing part). In order to compensate for bad human-annotations, the authors cluster the annotated bounding boxes. The authors show results comparable to
    state of the art on the PASCAL VOC 2007 dataset even without any active learning. They test their active learning framework on the PASCAL dataset as well as a dataset obtained from Flickr.The authors then argue that the active learning improves performance on some difficult categories such as bird and dog.

    1. Personally I think the strongest contribution isn't necessarily the active-learning or the crowd-sourcing but the efficient detection method they develop. Training time is faster on a few orders of magnitude compared to state of the art methods, and the Mean Average performance is similar.

    2. Because of their fast method, they show active-learning + crowd-sourcing is plausible on a large scale.

    3. The authors suggest that (a), active learning results in a steeper learning curve (less data is needed to learn good models) and (b). active learning can help with hard categories such as animals.

    4. That being said, when one looks at Table 1, within each category, there is a lot of variability as to which method is the top performer, and even with the Flickr dataset in figure 8 the method seems to lose to the baseline for chairs and dogs. Is there a statistically rigorous way to measure when a method is truly beating the state-of-the-art versus random chance?

    ReplyDelete
    Replies
    1. The 'steeper learning curve' only seems significant for the 'dog' and 'boat' classes. The rest in Fig. 6 seem pretty comparable to the passive learner, which is surprising. This suggests that, as you have said, active learning helps for 'hard classes'.

      I think it is also worth pointing out that each of the axes in Fig. 6 are different in scale. It looks like the method consistently meets or beats the baselines, but in fact it does quite a bit poorer on bicycles and bottles.

      Delete
    2. With respect to point 4, I think for any computer vision problem, it is hard to find such a measure. In fact, even "state-of-the-art" is kind of a vague term because of factors like dataset bias, what the dataset was actually created for, what the proposed algorithm is trying to solve, etc.

      Delete
    3. For point 4, I agree with Divya's comment.
      Also, there was a really interesting article in Nature about how statistical tests maybe flawed. In fact, there is (a rather disturbingly strong) claim that 17-25% of scientific reportings which are validated by a "significance test" in statistics may actually be wrong.

      Delete
    4. Here is the Nature article - http://www.nature.com/news/weak-statistical-standards-implicated-in-scientific-irreproducibility-1.14131
      Here is the paper - http://www.pnas.org/content/early/2013/10/28/1313476110.full.pdf

      Delete
    5. Agree with Jacob that the fast detection method they developed is a big contribution of the paper. But I think this is a natural result from their goal of building a real system to make turkers better do their jobs. I think what I want to say is that, I really like their attempts to tackle active annotation problem in "real" scenario which makes the whole paper interesting and useful.

      Delete
    6. I think the active learning part has some merit to it. Divya's point about defining what state-of-the-art really means is definitely valid. The true measure of an algorithm's performance has to include things like training time and scalability to new categories. In my opinion an algorithm doesn't have to necessarily improve upon state-of-the-art detectors (as they are currently defined) for every (or even any) category, as long as it can be used for new categories with minimal intervention from the researcher.

      Delete
    7. For point 4, there seems existing a theorem called “no free lunch” by Wolpert, according to which no learner can beat random guessing over all possible functions to be learned (to tell that none of any algorithm can beat other alternatives in the overall problem domain). That is to say, their performance integral is the same constance. Hence, similar here, the state-of-the-art description is only applicable to the detection for certain categorizes.

      Delete
    8. I have heard of the "no free lunch" theorem. Even if it's hard to be statistically rigorous, I would still like it if they tried these approaches on a variety of datasets to get a sense of what works when and where.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. I think the active learning and crowd sourcing aspects of the work are interesting, but as Jacob and Matt have mentioned, their detector pipeline is also a bit of a departure from the baselines. As usual, we are left looking at mixed results with no idea as to what about this method is working and what is not. In fact, I would say the results in Fig. 6, and 8, where the active learner is compared to MTurking random images, are worrying since the random method performs comparably for most classes and "dramatically" better for one.

    ReplyDelete
    Replies
    1. I agree. It would have been pretty interesting to see what exactly is working in this method and how much advantage it gives us. The authors have mentioned in a lot of places things like "we try to take advantage of the positives in both the methods", which makes intuitive sense. But a more quantitative argument would have shown the true value of the proposed methods.

      Delete
    2. I agree with the comment regarding the supposed usefulness of the active learner. The results with their 'strong baselines' are indeed quite strong (And as you pointed out does way better on PASCAL chair). The passive selection seems to also do nearly as well with more annotations added. However, they mention steeper curves, which gives us hope that we can pay turkers less and get better results on a limited budget.

      Delete
    3. @Arun re: steeper curves -- Steeper curves in the realm of small amounts of data. Do we really care about this realm? Isn't the point of leveraging lots of data via active learning or unsupervised learning to outperform the methods that use all available data, not to show performance in the case of data sparsity? That's exactly what the experiments in Figure 6 are - showing that their method does well when data is sparse. Why should we care about this realm at all? Surely I can get 6000 annotations for every conceivable object once and then never have to deal with such data sparsity ever again...

      Delete
    4. Actually I think active vs random selection is key problem to entire active learning field. From my experiences of using active selection in the past project, active learner can only help slightly in some cases while most of the time random selection can simply yield comparable results. This is somewhat counterintuitive as active learner held better chance to increase the margin of benefits at each selection. I suspect at that time, the representation of data is simply not enough and require more special design to realize it.

      Delete
    5. I have the same question that what is the main factor leads to the better performance. I think they need to compare with other part based algorithms.

      Delete
  5. The paper describes a lot of interesting ideas for active learning for training object detectors. The authors are trying to solve multiple hard problems and they succeed in getting reasonable results.

    Things that I liked:
    1. The major idea of combining active learning and crowdsourcing - I think it is an amazing way to generate good training data.
    2. The part-based SMP model for object detection - We know it works reasonably. Simple and takes advantage of existing techniques.
    3. Hyperplane-hashing for selecting active windows - Definitely contributes to the reduction in computation time.
    4. In the Results - The comparison of computation time. As Jacob had mentioned, I think this is the most significant contribution of the paper - making live active learning feasible on large scale data. The authors are trying to minimize the time complexity in each step of their algorithm and it definitely looks like tremendous improvement from the given results.

    Things that I would have liked to see:
    1. All the algorithms that they have mentioned make a lot of intuitive sense. But I would have definitely preferred to see more numbers to get a quantitative sense of how much improvement they get (like the computation time comparison).
    2. Since there are too many existing methods involved, it would have been useful to see what exactly works as opposed to just comparing the proposed algorithm with the overall existing techniques.
    3. Addressing the above two points would have explained better why this algorithm beats the "state-of-the-art" in some cases and why it doesn't in others.

    ReplyDelete
    Replies
    1. Also, I think there are two points that can be improved: the first is the active sampling scheme, which could be more accurate than the simple margin heuristic but also scale well to detection problems; the second is the search of good candidate windows, which needs to find relevant windows more efficiently and scale better. It would be good for the authors to show some analysis on why choosing the jump window to the sliding-window approach.

      Delete
  6. The paper was structured well was overall pretty easy to read. Their use of sparse coding and max pooling for detection was interesting. The max pooling over the sparse selection of visual words was interesting compared to just doing max pooling over feature dimensions. This sparse + pooling probably capture more salient information.

    Their hashing method for selecting an example for active learner was also interesting, thought it doesn't seem to be the novel part of this paper (they cite their own NIPS 2010 paper [http://www.cs.utexas.edu/~grauman/papers/jain_NIPS2010.pdf] , which is also in the domain of active learning). The novel step here seems to be the application of it to a real problem in a not 'sandbox' setting. Their NIPS paper focuses on datasets instead of on an unlabeled pool of Flikr images.

    ReplyDelete
  7. I like this paper because it actually builds a system that is able to use active learning with real non-expert humans. Most active learning papers I have read do not have this component, and hence don't care about computational time. They motivate their active learning from a practical viewpoint, but completely neglect it during the evaluations.

    I am slightly worried by the fact (as Humphrey noted) that for bottle and bicyle their methods is much worse than baselines. I am not happy with the fact that the authors did not analyze this. Was it because of the fact that they were using a linear SVM without a "part deformation penalty"? Was it just that the data added was too diverse?

    ReplyDelete
  8. I like the paper because it aims at exploring a very important problem, i.e. active labeling for object detection, and proposes an elegant "solution" to it. Technically, there are several points which I think inspire readers a lot.
    1. Combination of SP based method and deformable part template method. Which potentially combines advantages of both lines of methods, i.e. robustness to part internal deformation and linear runtime complexity.

    2. Hash coding for generating root windows and top window selection. The similar idea has been adopted by the later prize-winner paper:
    http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/40814.
    Also they have a full-blown implementation of the real system which is also very impressive.
    Some random thoughts: There seems a lack of analysis about the effects of continuously increasing the pool of images for learning and annotation in terms of both performance and computation. Dean et al. proved the feasibility of using hash function to do very large scale object detection, which may be useful here as the system goes to a very large scale.

    ReplyDelete
  9. The main focus of this paper seems to be scalability, which is something I really appreciate. Creating super-clean labeled datasets for different applications requires an unjustifiable amount of effort and patience on the part of the researcher. I think anyone who has ever had to spend hours cleaning up annotations from MTurkers for their projects would agree with me. Limiting the role of annotators to answering only specific queries would make the learning quicker and less expensive. Another good thing about this paper is that a lot of the techniques employed by the paper, such as hyperplane-hashing, usage of linear kernels and the jumping window selection, are geared towards reducing computational cost. I also liked the sparse max pooling, which seems to be a smarter scheme to encode features using visual words than just taking histograms. One thing I didn't understand about their Keyword + Random Image baseline (that seems to be performing really well) is whether annotators had to select bounding boxes by themselves. Getting binary labels for pre-computed bounding boxes is a lot easier and though they might have required to use more images, they could have gotten away with paying lesser in this case. This would have still made their active selection more viable than the random image selection.

    ReplyDelete
  10. The authors of this paper develop and deploy a system for "live learning" in which they do not constrain their system to "sandbox" datasets for obtaining novel examples. They demonstrate reasonable performance gains on a few class categories over a "passive" baseline.

    A few things to point out:
    1. Their approach depends on having semantic categories (something that can be named and queried).
    2. I want to echo Matt's point that the learning scheme they employ doesn't seem likely to scale with additional data, and therefore is not likely yield performance gains in the realm of having LOTS of data. This is the same issue I had with the paper from Tuesday. The point of these systems is, in my understanding, to leverage massive quantities of data, not simply show performance gains with much less supervised data than other methods. These methods for crawling and obtaining new data are only relevant when the number of examples is orders of magnitudes greater than the amount of supervised data used in "state-of-the-art" methods.

    ReplyDelete
    Replies
    1. Is the second problem more general?

      A classifier that can deal with noisy annotation (no pruning after Mturk annotation) is probably one that is searching a smaller hypothesis space. Such a classifier is less likely to take advantage of lots of data.

      Delete
    2. I agree with the first point. Why not combine descriminativeness with active learning? This reminds me of a recent paper by Fei-Fei Li: Fine-Grained Crowdsourcing for Fine-Grained Recognition. This paper is actually a work that incorporated a lot of descriminativeness elements.

      Delete
  11. Generally speaking, I really like this paper because it indicates several crucial limits of the current active learning and crowd-sourced labeling methods and solves them.

    This paper builds an active learning system which requires few human intervene and is suitable for large scale data. The idea of part based detector is interesting. It is amenable to linear classifiers and hash functions enable sub-linear time to map a query hyperplane to its nearest points. The part based model could increase the performance of detection which ensures that even using linear kernel, the performance could still be comparable with the other state-of-art algorithms.

    ReplyDelete
  12. This comment echos the general concern that active learning isn't going very far from random selection. This, in my opinion, is more of a computer vision (object detection) problem. When using internet images, each image is very different in both appearance space and low level feature space (sparse coding + pooling in this case). Thus having labels for it gives us very little information about other images/data points. This is perhaps also why we need a lot of mixture components for good detection accuracy.

    That said, some of the problems are alleviated by using strong parametric assumptions (like linear decision boundary). But these unfortunately have a large approximation error (decision boundary isn't linear I think) and we can see scores saturating very soon (as the number of data points increase).

    ReplyDelete
  13. After reading this paper, I get a sense that it is not a paper with a lot of fancy ideas and equations. Rather, it is a paper with a lot of engineering which makes large scale active learning feasible. The overall method seems to be a combination of many existing methods. I'm not against this type of paper, as long as it works well. The paper is easy to read and the methods are well-described.

    I agree that the contribution of this paper is its own detector with fast training. This makes the large-scale active learning feasible. However I'm also concerned about the learning curve of its active learning part. It's not improving the results too much. I remember at CVPR 11 someone asked the same question to Grauman. (But sorry I could not remember too much of her answer. That was too long ago and I have not paid too much attention at that time)

    And I also think their detector has problems too. Their detector is performing very bad on specific objects such as bottle and chair. From their active learning experiments, it seems their limitation on active learning may have sth to do with their detectors as well! (eg. the chair category)

    ReplyDelete
  14. What I like in the methods of collecting labeled data in general is that they developed into a separate field (as mentioned int the paper) and that this field has to solve many more problems of computational nature, because of huge datasets and limited time and budget. Hence, more of pure computer science tricks applied, for example, hashing method in this paper. That

    ReplyDelete