16-824: Learning-based Methods in Vision (F'13): Reading for 11/7

Tuesday, November 5, 2013

Reading for 11/7

Y. J. Lee and K. Grauman. Object-Graphs for Context-Aware Category Discovery. CVPR 2010

And optionally:

Bryan Russell, Alexei A. Efros, Josef Sivic, Bill Freeman, Andrew Zisserman. Using Multiple Segmentations to Discover Objects and their Extent in Image Collections. CVPR 2006

Y. J. Lee and K. Grauman. Learning the Easy Things First: Self-Paced Visual Category Discovery. CVPR 2011

36 comments:

UnknownNovember 6, 2013 at 6:59 PM
================
### Summary ###
================
This paper try to address the problem of unsupervised object discovery by using the context. By context it means the probability of each known category to occur in nearby regions, and the paper proposes a new descriptor called object-graph to encode such information. The novel part is that they use known categories to discover unknown ones, which is actually using objects as features (like the idea of object-bank) to add to pure appearance based models. They have done quite convincing experiments to demonstrate the improvements in object discovery over other methods.
====================
### Contribution ###
=====================
As stated in the paper, they have two contributions:
* A method to determine whether regions are known or unknown, which is based on the entropy of the probabilistic output of known classes. Intuitively, it says that a class is familiar as long as it is similar to a few classes, and it is unfamiliar if it is not similar to any of the classes. This is, of course, in the closed world setting, for example, the hierarchical taxonomy of imagenet or even caltech 256 can not be fit into setting.
* A object-graph descriptor to encode object-level context. As stated above, this is a very similar idea to the idea of using object classifier outputs as features, and it ignores other other geometrical information.
====================================
### Points of concern / interest ###
====================================
* According to some people who have done work in object discovery, all these methods cannot go beyond a few, or a few tens at most object categories. If it goes further, the intra-class variation and inter-class variation cannot be distinguished (Like two different views of cars would be more dissimilar than each of them to a truck in the similar view points). This can be resolved with the change of notion from object "category" discovery to "subcategory" discovery, where each subcategory represents a cohesive, tight cluster inside the object category.

* Another issue that makes this stuff hardly work is that they just not try hard enough. The original segments are generated from NCuts with multiple parameter settings. And from each image the algorithm only produces tens or hundreds of regions. Which is clearly biased and not enough for further processing. I would compare a lot more regions for the discovery setting.

* The object-graph descriptor works in a similar sense as the idea of object bank, or even any hierarchical models (like deep learning, inference machine), and I think they should include experiments that show the performance change with regard to the R, which can be also seen as a parameter that controls the amount of spatial information the descriptor has.
ReplyDelete
Replies
Srivatsan VaradharajanNovember 6, 2013 at 8:03 PM
The selected paper presents an intuitively appealing way to leverage information about known object classes to discover unknown categories by modeling their combined spatial layouts. The proposed method works by finding how likely each segment is to contain a known/unknown object using entropy computed from the scores of multiple known-object classifiers. Their simple object-graph descriptor is then used to encode context in terms of relative positions of known-objects using predicted probabilities. One thing I really liked about this paper is how the authors have included a lot of detail in terms of implementation. The results section shows that their context+appearance approach performs consistently better than appearance-only discovery of unknown categories. It would be interesting to study which known classes would provide maximum information about the other unknown classes, so that we can have a principled way to select the initial classes for labeling by hand. Also I wonder if this would tell us something about some kind of bias in the dataset - whether a particular dataset always captures certain objects in certain relative locations with respect to other objects.
ReplyDelete
Replies
M AravindhNovember 6, 2013 at 8:09 PM
The object graph descriptor has two parts
H_0 and the rest

H_0 describe how similar the unknown segment is to known objects while the remaining entries describe how similar neighboring regions are to known objects. The experiments do not analyze to differentiate the contribution of each of these. In my opinion, H_0 is doing most of the work as it is a semantic representation of unknown region. This itself will complement well with appearance information and give a good boost to the results. Is context equally important?
ReplyDelete
Replies
M AravindhNovember 6, 2013 at 8:24 PM
Datasets like PASCAL VOC 2008 have a large number of unknown categories in the image beyond the 20 picked for the challenge. I am confused as to how they are able to evaluate the discovered regions - is an algorithm that picked category "grass" considered better than one which did not (grass is not in the 20 pascal categories).
ReplyDelete
Replies
UnknownNovember 6, 2013 at 9:00 PM
Summary

This paper proposes a novel unsupervised object detection method using the prior knowledge of interaction between the familiar and unfamiliar categories. There are three steps to implement this method. First, detect known and unknown regions in the image; second, model the surrounding contextual information of the unknown regions using the novel object-graph descriptors; third, cluster the unfamiliar regions based on the cues.

Contributions:
This paper learns the information between the known and unknown regions in the image unlike previous works focus on the information only within the known or unknown parts.

The object-graph descriptor could contain the information of the layout of the known regions around the unknown parts.

Interest:
The object-graph could only describe the layout information of the known parts surround the unknown parts, however, the co-occurrence of the different unknown parts such as the same pair of unknown parts always appear together in the same layout is also a good cue.
ReplyDelete
Replies
Humphrey HuNovember 6, 2013 at 9:35 PM
I would have liked to see more about the classification entropy. Why is half the maximum entropy a good cutoff? Also, I am concerned that a bimodal posterior could pass this cutoff if sufficiently peaky.
ReplyDelete
Replies
Abhinav ShrivastavaNovember 6, 2013 at 10:19 PM
The authors propose a new representation -- object-graphs, and use it to capture context for new category discovery in an unsupervised manner. It's unsupervised in the sense that they already have few known categories, and the context descriptor is built using classifier outputs for such known categories.

I think this was a neat idea to capture context for discovery, and object-graph-like representations can be extended to other applications, where context might be important. I really liked the clarity of presentation and attention to details (algorithmic and implementation).
ReplyDelete
Replies
Priya DeoNovember 6, 2013 at 11:32 PM
Contributions:

I really like the idea of unsupervised methods compared to supervised methods for object classification because of the lack of clarity of object boundaries. Does a shirt count as a seperate object, or is it just part of the person? I think unsupervised methods have the ability to generalize beyond this ambiguity by letting the algorithm decide the cutoffs for what qualifies as an entity/object based on intra-class similarity.

I think the proposed method for adding context is very simple and elegent. The paper models context as just a descriptor that gets factored in when calculating the similarity metric. I think this is a much better way to go than trying to represent context as a graphical model with several parameters that you need to optimize over.

Concerns:

I'm surprised their similarity metric for computing the similarity between two object-graph descriptors does not boost the similarity values for the neighbors that are closer to the unknown object. Intuitvely, context near the object is more cohesive than context that is further away.
ReplyDelete
Replies
UnknownNovember 7, 2013 at 12:08 AM
The object-graph introduced in this paper is very much similar to the way of using object context in the hierachical inference machine paper. It seems this kind of way to exploit the context information really helps across different tasks.
As mentioned by the authors, the object graph itself performs almost as well as the full model. So we can think of this paper as clustering unseen objects mainly using its relationships with known objects. I think one possible reason why their appearance feature does not work that well is because of, as shown in the example of clustering HOG, the lacking of a good metric. So, I would imagine they can achieve even better results if they use appearance affinity in a proper way.
ReplyDelete
Replies
ArunNovember 7, 2013 at 12:16 AM
The authors' novel object-graph descriptor seemed to a very smart way to represent the relationship between many neighbors at various distances from a segment all into a single feature vector. It made it easy to later compute affinities (distances) between these features and then use spectral clustering. I would have like to learn more about the spectral clustering, but they give a citation to the NIPS 2001 Andrew Ng et al., so can't complain too much.
ReplyDelete
Replies
Divya HariharanNovember 7, 2013 at 12:57 AM
The paper describes a new method called object-graph to predict new categories of objects in images with some known categories. This seems like a reasonable step towards unsupervised object discovery and looking at the results, the method performs reasonably well compared to the existing ones. However, some concerns that I had with this paper are:
1. The similarity metric (which people had mentioned above) - Have the authors used it in the best way and is this the right metric?
2. Why spectral clustering? What advantage does it give? As Aravindh had pointed above, if there are too many unknown objects, then there will be a huge affinity matrix, which probably cannot be clustered properly!
ReplyDelete
Replies
UnknownNovember 7, 2013 at 2:13 AM
I really like the idea of this paper. It is well written and explained the idea clearly. The idea of using unsupervised object detection is intuitive and reasonable. I also had the concern about scalability for spectral clustering. But nowadays there are some subsampling methods such as Nystrom approximation which can greatly reduce the computation complexity of spectral clustering.
ReplyDelete
Replies
UnknownNovember 7, 2013 at 4:47 AM
I also like the idea of this paper - that you can start with a few labeled classes and then cluster unknown segments based on the types of classes that tend to be nearby, in this way "learning" new classes. What I find most excited is the idea that this can then be extended to an active learning framework, where the algorithm asks for class labels for some of the clusters, and repeats the process. This algorithm seems like it is very promising for being able to learn new classes and classify them with minimal human input. I agree with Priya that it's nice for the algorithm to decide what level of segmentation is appropriate - circumventing the whole semantic segmentation issue that humans have difficulty agreeing on.

In general, this paper was well written and easy to understand, which was a relief!
ReplyDelete
Replies
Mike McCannNovember 7, 2013 at 5:53 AM
I am bothered that the actual segmentation step is such an afterthought, when my feeling is that the quality of the initial segmentation(s) will have a large impact on the result. There is some discussion of the properties they need from the segmentation (want coherent objects inside each segment, want 'superpixels'), but I would like more discussion of it.
ReplyDelete
Replies
UnknownNovember 7, 2013 at 6:02 AM
This paper presents a very interesting approach to a notion that is probably always at the back of my mind - forcing every element in the visual world to be one of < N classes is fundamentally flawed. Their approach seems scalable enough, and feedback from annotaters seems like a reasonable approach to improving the system. However, this human-reliant feedback still seems tedious.

Modeling the context between known and unknown objects is useful, but is it enough? Could they have gone further? Perhaps this is where functional attribute labeling would be useful
ReplyDelete
Replies
AnonymousNovember 7, 2013 at 8:25 AM
This paper is about modeling the relationship between known object classes and unknown object classes to inform the discovery of new classes. Their appraoch relies on classifying superpixels as either "known" or "unknown" based on the sparsity of SVM classifier confidences. Then, they learn a descriptor based on nearby superpixels both "above" and "below" the learned class. Using this descriptor, they are able to add "context" to their description of an object in addition to appearence. For instance, cars have both a certain appearence, *and* are likely to exist above roads but below trees in the image. Now, with newly discovered objects in the scene, they are able to mine new object classes in an unsupervised way by exploiting context.

The authors admit that it is difficult to quantify their results (as is usually the case with unsupervised clustering methods), but provide compelling examples of discovered object classes. It reminds me of work by Alvaro Colletti on unsupervised object discovery in 3D. The paper also reminds me of the heirarchical clustering segmentation paper we looked at quite a long time ago.

-- Matt Klingensmith
ReplyDelete
Replies
UnknownNovember 7, 2013 at 10:22 AM
I think the paper choose chi-square distance because it generally works fine in computer vision. I agree that with more learning on the distance metric side, the performance would be better, but the major boost comes from the second-layer context info component.
ReplyDelete
Replies

Add comment