Tuesday, November 5, 2013

Reading for 11/7

Y. J. Lee and K. Grauman.  Object-Graphs for Context-Aware Category Discovery. CVPR 2010

And optionally:

Bryan Russell, Alexei A. Efros, Josef Sivic, Bill Freeman, Andrew Zisserman. Using Multiple Segmentations to Discover Objects and their Extent in Image Collections. CVPR 2006

36 comments:

  1. ================
    ### Summary ###
    ================
    This paper try to address the problem of unsupervised object discovery by using the context. By context it means the probability of each known category to occur in nearby regions, and the paper proposes a new descriptor called object-graph to encode such information. The novel part is that they use known categories to discover unknown ones, which is actually using objects as features (like the idea of object-bank) to add to pure appearance based models. They have done quite convincing experiments to demonstrate the improvements in object discovery over other methods.
    ====================
    ### Contribution ###
    =====================
    As stated in the paper, they have two contributions:
    * A method to determine whether regions are known or unknown, which is based on the entropy of the probabilistic output of known classes. Intuitively, it says that a class is familiar as long as it is similar to a few classes, and it is unfamiliar if it is not similar to any of the classes. This is, of course, in the closed world setting, for example, the hierarchical taxonomy of imagenet or even caltech 256 can not be fit into setting.
    * A object-graph descriptor to encode object-level context. As stated above, this is a very similar idea to the idea of using object classifier outputs as features, and it ignores other other geometrical information.
    ====================================
    ### Points of concern / interest ###
    ====================================
    * According to some people who have done work in object discovery, all these methods cannot go beyond a few, or a few tens at most object categories. If it goes further, the intra-class variation and inter-class variation cannot be distinguished (Like two different views of cars would be more dissimilar than each of them to a truck in the similar view points). This can be resolved with the change of notion from object "category" discovery to "subcategory" discovery, where each subcategory represents a cohesive, tight cluster inside the object category.

    * Another issue that makes this stuff hardly work is that they just not try hard enough. The original segments are generated from NCuts with multiple parameter settings. And from each image the algorithm only produces tens or hundreds of regions. Which is clearly biased and not enough for further processing. I would compare a lot more regions for the discovery setting.

    * The object-graph descriptor works in a similar sense as the idea of object bank, or even any hierarchical models (like deep learning, inference machine), and I think they should include experiments that show the performance change with regard to the R, which can be also seen as a parameter that controls the amount of spatial information the descriptor has.

    ReplyDelete
    Replies
    1. Just talking about the concerns above:-
      (1) This paper proposes that by adding object graph descriptors they are able to control the intra class variance to be less than inter-class variance. It seems to work well. We do need some way of clustering sub-categories into categories to obtain objects and may be this method can be applied there too.

      (2) I think that one would love to produce a lot more segments for further processing. But that means we have a large number of unknown segments and therefore a giant affinity matrix for spectral clustering. Its very hard to do spectral clustering on giant matrices.

      Delete
    2. The authors seem to have not mentioned about why supervised training (SVM) is preferred over unsupervised training. Unsupervised spectral clustering would build more tolerance in the training process, which will help the unsupervised discovery of unknown categories.

      Delete
    3. @Felix: The authors motivate in para2 of intro that why completely unsupervised discovery might be too daunting task, and why they chose to go weakly supervised.

      Delete
  2. The selected paper presents an intuitively appealing way to leverage information about known object classes to discover unknown categories by modeling their combined spatial layouts. The proposed method works by finding how likely each segment is to contain a known/unknown object using entropy computed from the scores of multiple known-object classifiers. Their simple object-graph descriptor is then used to encode context in terms of relative positions of known-objects using predicted probabilities. One thing I really liked about this paper is how the authors have included a lot of detail in terms of implementation. The results section shows that their context+appearance approach performs consistently better than appearance-only discovery of unknown categories. It would be interesting to study which known classes would provide maximum information about the other unknown classes, so that we can have a principled way to select the initial classes for labeling by hand. Also I wonder if this would tell us something about some kind of bias in the dataset - whether a particular dataset always captures certain objects in certain relative locations with respect to other objects.

    ReplyDelete
  3. The object graph descriptor has two parts
    H_0 and the rest

    H_0 describe how similar the unknown segment is to known objects while the remaining entries describe how similar neighboring regions are to known objects. The experiments do not analyze to differentiate the contribution of each of these. In my opinion, H_0 is doing most of the work as it is a semantic representation of unknown region. This itself will complement well with appearance information and give a good boost to the results. Is context equally important?

    ReplyDelete
    Replies
    1. H_0 is itself obtained by running classifiers on appearance based features of the unknown object. How does it add anything new to the appearance based features?

      Delete
    2. @Aravindh: If H_0 was very similar to any particular class, it would have been selected as "known" in the first place (I think).
      @Srivatsan: I think you meant "known" objects, and not "unknown objects".

      Delete
    3. I think Aravindh has a point. They assume that more context is better, and fix the number of above & below neighbors to add to the descriptor to 20. However, they never examine/report how well the algorithm does (specifically for the spectral clustering step at the end) if we dont use any context versus if we do. I suppose that using R=20 is better than using R=0, but how much better?

      Delete
  4. Datasets like PASCAL VOC 2008 have a large number of unknown categories in the image beyond the 20 picked for the challenge. I am confused as to how they are able to evaluate the discovered regions - is an algorithm that picked category "grass" considered better than one which did not (grass is not in the 20 pascal categories).

    ReplyDelete
    Replies
    1. I agree. Autonomous discovery of new categories by itself sounds really hard. Categories can be arbitrarily defined. As Xinlei noted, if we try to find too many new categories we could wind up defining excessively fine ones such as front-of-cars vs cars, etc. I could see how using spatial correlations could be useful prior knowledge, but not all object categories are going to have a strong spatial relationship with other known objects (basketballs flying through the air, balloons could lie on a table float around, etc.)

      Delete
    2. The authors acknowledge that the classes with the greatest in-class variance get the biggest performance boost. I suspect the boost is also related to consistence in spatial relations, as cows are likely on grass, bikes are likely in an urban environment, and airplanes are found on runways.

      Delete
    3. This is a well-known issue of category fragmentation in object discovery. I will cover some aspects of it tomorrow.

      Delete
    4. I think I must be misunderstanding this question. I thought the authors just tried to get each cluster to be as pure as possible. Certainly, with more clusters, you'll get each one to be more pure, but you may have many excessively fine ones, as Jacob mentioned. However, I don't think this would be a big problem once the authors extended this to an active learning framework. If we label both front-of-cars and cars as just "cars," then the two clusters will be merged under one label - no problem.

      I didn't see anything about comparing algorithms, and seeing if one was able to discover a cluster with a particular semantic label. As Priya mentioned, this algorithm is sort of cool because it allows the machine to decide what level to cluster things at. The only criteria is that the clusters be as pure as possible.

      Delete
    5. Don't images from databases have lots of clutter? I mean the "useful" objects take just a portion of an image, and all the rest can easily look similar, but it's not labeled though. I'm wondering how this system is supposed to aid humans in labeling, if humans don't won't to label everything

      Delete
  5. Summary

    This paper proposes a novel unsupervised object detection method using the prior knowledge of interaction between the familiar and unfamiliar categories. There are three steps to implement this method. First, detect known and unknown regions in the image; second, model the surrounding contextual information of the unknown regions using the novel object-graph descriptors; third, cluster the unfamiliar regions based on the cues.

    Contributions:
    This paper learns the information between the known and unknown regions in the image unlike previous works focus on the information only within the known or unknown parts.

    The object-graph descriptor could contain the information of the layout of the known regions around the unknown parts.

    Interest:
    The object-graph could only describe the layout information of the known parts surround the unknown parts, however, the co-occurrence of the different unknown parts such as the same pair of unknown parts always appear together in the same layout is also a good cue.

    ReplyDelete
  6. I would have liked to see more about the classification entropy. Why is half the maximum entropy a good cutoff? Also, I am concerned that a bimodal posterior could pass this cutoff if sufficiently peaky.

    ReplyDelete
    Replies
    1. I think it does not matter that much if it is bimodal or not, since what we want, according to the authors, is to put our confidence as concentrate as possible.

      Delete
    2. I don't know a bimodal or trimodal posterior distribution would be that bad for them. That would mean that 2 (or 3) of the classifiers were pretty confident in the score. This could something like cow vs horse, which is hard for a classifier, especially since these are just on super pixels (I believe without context). In such a case, you would see a bi modal distribution, but you know it's a known category, you just aren't sure which of the known categories.

      Delete
    3. I agree. It's also not too clear to me that the entropy of classifier responses is even a correct proxy for uncertainty, both because using a classifier to obtain a certainty value is suspect and because there is no particular reason that an unseen example should not (by accident) give high certainty. They are asking a lot out of their classifiers.

      Delete
  7. The authors propose a new representation -- object-graphs, and use it to capture context for new category discovery in an unsupervised manner. It's unsupervised in the sense that they already have few known categories, and the context descriptor is built using classifier outputs for such known categories.

    I think this was a neat idea to capture context for discovery, and object-graph-like representations can be extended to other applications, where context might be important. I really liked the clarity of presentation and attention to details (algorithmic and implementation).

    ReplyDelete
    Replies
    1. It does feel like they've told you enough to allow you to go implement most of it!

      Delete
  8. Contributions:

    I really like the idea of unsupervised methods compared to supervised methods for object classification because of the lack of clarity of object boundaries. Does a shirt count as a seperate object, or is it just part of the person? I think unsupervised methods have the ability to generalize beyond this ambiguity by letting the algorithm decide the cutoffs for what qualifies as an entity/object based on intra-class similarity.

    I think the proposed method for adding context is very simple and elegent. The paper models context as just a descriptor that gets factored in when calculating the similarity metric. I think this is a much better way to go than trying to represent context as a graphical model with several parameters that you need to optimize over.

    Concerns:

    I'm surprised their similarity metric for computing the similarity between two object-graph descriptors does not boost the similarity values for the neighbors that are closer to the unknown object. Intuitvely, context near the object is more cohesive than context that is further away.

    ReplyDelete
  9. The object-graph introduced in this paper is very much similar to the way of using object context in the hierachical inference machine paper. It seems this kind of way to exploit the context information really helps across different tasks.
    As mentioned by the authors, the object graph itself performs almost as well as the full model. So we can think of this paper as clustering unseen objects mainly using its relationships with known objects. I think one possible reason why their appearance feature does not work that well is because of, as shown in the example of clustering HOG, the lacking of a good metric. So, I would imagine they can achieve even better results if they use appearance affinity in a proper way.

    ReplyDelete
    Replies
    1. I agree that the usage of appearance affinity does seem a little vague to me. As Priya had also mentioned before, the similarity metric does not seem to take into consideration anything about the location of the neighbors with respect to the unknown object, which is probably important when you are trying to predict a new category of an object.

      Delete
    2. I think the performance boost of this paper is from the usage of the limiting factors of previous approaches. The failures of approaches with only appearance models are because of the presence of substantial clutter and scenes with multiple objects. However, the clutter in some sense actually indicates strong interaction between different objects. This paper instead used and effectively encoded this kind of information.

      Delete
    3. Agree. It seems it's the context thing that really work.

      Delete
  10. The authors' novel object-graph descriptor seemed to a very smart way to represent the relationship between many neighbors at various distances from a segment all into a single feature vector. It made it easy to later compute affinities (distances) between these features and then use spectral clustering. I would have like to learn more about the spectral clustering, but they give a citation to the NIPS 2001 Andrew Ng et al., so can't complain too much.

    ReplyDelete
  11. The paper describes a new method called object-graph to predict new categories of objects in images with some known categories. This seems like a reasonable step towards unsupervised object discovery and looking at the results, the method performs reasonably well compared to the existing ones. However, some concerns that I had with this paper are:
    1. The similarity metric (which people had mentioned above) - Have the authors used it in the best way and is this the right metric?
    2. Why spectral clustering? What advantage does it give? As Aravindh had pointed above, if there are too many unknown objects, then there will be a huge affinity matrix, which probably cannot be clustered properly!

    ReplyDelete
  12. I really like the idea of this paper. It is well written and explained the idea clearly. The idea of using unsupervised object detection is intuitive and reasonable. I also had the concern about scalability for spectral clustering. But nowadays there are some subsampling methods such as Nystrom approximation which can greatly reduce the computation complexity of spectral clustering.

    ReplyDelete
  13. I also like the idea of this paper - that you can start with a few labeled classes and then cluster unknown segments based on the types of classes that tend to be nearby, in this way "learning" new classes. What I find most excited is the idea that this can then be extended to an active learning framework, where the algorithm asks for class labels for some of the clusters, and repeats the process. This algorithm seems like it is very promising for being able to learn new classes and classify them with minimal human input. I agree with Priya that it's nice for the algorithm to decide what level of segmentation is appropriate - circumventing the whole semantic segmentation issue that humans have difficulty agreeing on.

    In general, this paper was well written and easy to understand, which was a relief!

    ReplyDelete
  14. I am bothered that the actual segmentation step is such an afterthought, when my feeling is that the quality of the initial segmentation(s) will have a large impact on the result. There is some discussion of the properties they need from the segmentation (want coherent objects inside each segment, want 'superpixels'), but I would like more discussion of it.

    ReplyDelete
    Replies
    1. Right, I feel like the algorithm might be able to *improve* its segmentation by leveraging the results from class discovery and existing objects, or try multiple initial segmentations.

      -- Matt K

      Delete
  15. This paper presents a very interesting approach to a notion that is probably always at the back of my mind - forcing every element in the visual world to be one of < N classes is fundamentally flawed. Their approach seems scalable enough, and feedback from annotaters seems like a reasonable approach to improving the system. However, this human-reliant feedback still seems tedious.

    Modeling the context between known and unknown objects is useful, but is it enough? Could they have gone further? Perhaps this is where functional attribute labeling would be useful

    ReplyDelete
  16. This paper is about modeling the relationship between known object classes and unknown object classes to inform the discovery of new classes. Their appraoch relies on classifying superpixels as either "known" or "unknown" based on the sparsity of SVM classifier confidences. Then, they learn a descriptor based on nearby superpixels both "above" and "below" the learned class. Using this descriptor, they are able to add "context" to their description of an object in addition to appearence. For instance, cars have both a certain appearence, *and* are likely to exist above roads but below trees in the image. Now, with newly discovered objects in the scene, they are able to mine new object classes in an unsupervised way by exploiting context.

    The authors admit that it is difficult to quantify their results (as is usually the case with unsupervised clustering methods), but provide compelling examples of discovered object classes. It reminds me of work by Alvaro Colletti on unsupervised object discovery in 3D. The paper also reminds me of the heirarchical clustering segmentation paper we looked at quite a long time ago.

    -- Matt Klingensmith

    ReplyDelete
  17. I think the paper choose chi-square distance because it generally works fine in computer vision. I agree that with more learning on the distance metric side, the performance would be better, but the major boost comes from the second-layer context info component.

    ReplyDelete