Comments on 16-824: Learning-based Methods in Vision (F'13): Reading for 11/7

Don't images from databases have lots of clutt...

2013-11-07T10:36:32.471-08:00

Don't images from databases have lots of clutter? I mean the "useful" objects take just a portion of an image, and all the rest can easily look similar, but it's not labeled though. I'm wondering how this system is supposed to aid humans in labeling, if humans don't won't to label everything

I think the paper choose chi-square distance becau...

2013-11-07T10:22:13.469-08:00

I think the paper choose chi-square distance because it generally works fine in computer vision. I agree that with more learning on the distance metric side, the performance would be better, but the major boost comes from the second-layer context info component.

Agree. It seems it's the context thing that re...

2013-11-07T10:15:45.115-08:00

Agree. It seems it's the context thing that really work.

Right, I feel like the algorithm might be able to ...

2013-11-07T08:28:50.103-08:00

Right, I feel like the algorithm might be able to *improve* its segmentation by leveraging the results from class discovery and existing objects, or try multiple initial segmentations.

-- Matt K

This paper is about modeling the relationship betw...

2013-11-07T08:25:09.507-08:00

This paper is about modeling the relationship between known object classes and unknown object classes to inform the discovery of new classes. Their appraoch relies on classifying superpixels as either "known" or "unknown" based on the sparsity of SVM classifier confidences. Then, they learn a descriptor based on nearby superpixels both "above" and "below" the learned class. Using this descriptor, they are able to add "context" to their description of an object in addition to appearence. For instance, cars have both a certain appearence, *and* are likely to exist above roads but below trees in the image. Now, with newly discovered objects in the scene, they are able to mine new object classes in an unsupervised way by exploiting context.

The authors admit that it is difficult to quantify their results (as is usually the case with unsupervised clustering methods), but provide compelling examples of discovered object classes. It reminds me of work by Alvaro Colletti on unsupervised object discovery in 3D. The paper also reminds me of the heirarchical clustering segmentation paper we looked at quite a long time ago.

-- Matt Klingensmith

This paper presents a very interesting approach to...

2013-11-07T06:02:29.679-08:00

This paper presents a very interesting approach to a notion that is probably always at the back of my mind - forcing every element in the visual world to be one of < N classes is fundamentally flawed. Their approach seems scalable enough, and feedback from annotaters seems like a reasonable approach to improving the system. However, this human-reliant feedback still seems tedious.

Modeling the context between known and unknown objects is useful, but is it enough? Could they have gone further? Perhaps this is where functional attribute labeling would be useful

I agree. It's also not too clear to me that th...

2013-11-07T05:58:40.000-08:00

I agree. It's also not too clear to me that the entropy of classifier responses is even a correct proxy for uncertainty, both because using a classifier to obtain a certainty value is suspect and because there is no particular reason that an unseen example should not (by accident) give high certainty. They are asking a lot out of their classifiers.

I am bothered that the actual segmentation step is...

2013-11-07T05:53:33.431-08:00

I am bothered that the actual segmentation step is such an afterthought, when my feeling is that the quality of the initial segmentation(s) will have a large impact on the result. There is some discussion of the properties they need from the segmentation (want coherent objects inside each segment, want 'superpixels'), but I would like more discussion of it.

I think the performance boost of this paper is fro...

2013-11-07T05:48:04.447-08:00

I think the performance boost of this paper is from the usage of the limiting factors of previous approaches. The failures of approaches with only appearance models are because of the presence of substantial clutter and scenes with multiple objects. However, the clutter in some sense actually indicates strong interaction between different objects. This paper instead used and effectively encoded this kind of information.

I think I must be misunderstanding this question. ...

2013-11-07T04:52:15.163-08:00

I think I must be misunderstanding this question. I thought the authors just tried to get each cluster to be as pure as possible. Certainly, with more clusters, you'll get each one to be more pure, but you may have many excessively fine ones, as Jacob mentioned. However, I don't think this would be a big problem once the authors extended this to an active learning framework. If we label both front-of-cars and cars as just "cars," then the two clusters will be merged under one label - no problem.

I didn't see anything about comparing algorithms, and seeing if one was able to discover a cluster with a particular semantic label. As Priya mentioned, this algorithm is sort of cool because it allows the machine to decide what level to cluster things at. The only criteria is that the clusters be as pure as possible.

I also like the idea of this paper - that you can ...

2013-11-07T04:47:19.710-08:00

I also like the idea of this paper - that you can start with a few labeled classes and then cluster unknown segments based on the types of classes that tend to be nearby, in this way "learning" new classes. What I find most excited is the idea that this can then be extended to an active learning framework, where the algorithm asks for class labels for some of the clusters, and repeats the process. This algorithm seems like it is very promising for being able to learn new classes and classify them with minimal human input. I agree with Priya that it's nice for the algorithm to decide what level of segmentation is appropriate - circumventing the whole semantic segmentation issue that humans have difficulty agreeing on.

In general, this paper was well written and easy to understand, which was a relief!

I really like the idea of this paper. It is well w...

2013-11-07T02:13:45.094-08:00

I really like the idea of this paper. It is well written and explained the idea clearly. The idea of using unsupervised object detection is intuitive and reasonable. I also had the concern about scalability for spectral clustering. But nowadays there are some subsampling methods such as Nystrom approximation which can greatly reduce the computation complexity of spectral clustering.

The paper describes a new method called object-gra...

2013-11-07T00:57:28.014-08:00

The paper describes a new method called object-graph to predict new categories of objects in images with some known categories. This seems like a reasonable step towards unsupervised object discovery and looking at the results, the method performs reasonably well compared to the existing ones. However, some concerns that I had with this paper are:
1. The similarity metric (which people had mentioned above) - Have the authors used it in the best way and is this the right metric?
2. Why spectral clustering? What advantage does it give? As Aravindh had pointed above, if there are too many unknown objects, then there will be a huge affinity matrix, which probably cannot be clustered properly!

I agree that the usage of appearance affinity does...

2013-11-07T00:49:45.418-08:00

I agree that the usage of appearance affinity does seem a little vague to me. As Priya had also mentioned before, the similarity metric does not seem to take into consideration anything about the location of the neighbors with respect to the unknown object, which is probably important when you are trying to predict a new category of an object.

I don't know a bimodal or trimodal posterior d...

2013-11-07T00:19:56.055-08:00

I don't know a bimodal or trimodal posterior distribution would be that bad for them. That would mean that 2 (or 3) of the classifiers were pretty confident in the score. This could something like cow vs horse, which is hard for a classifier, especially since these are just on super pixels (I believe without context). In such a case, you would see a bi modal distribution, but you know it's a known category, you just aren't sure which of the known categories.

I think it does not matter that much if it is bimo...

2013-11-07T00:18:09.600-08:00

I think it does not matter that much if it is bimodal or not, since what we want, according to the authors, is to put our confidence as concentrate as possible.

It does feel like they've told you enough to a...

2013-11-07T00:17:41.541-08:00

It does feel like they've told you enough to allow you to go implement most of it!

The authors' novel object-graph descriptor see...

2013-11-07T00:16:04.862-08:00

The authors' novel object-graph descriptor seemed to a very smart way to represent the relationship between many neighbors at various distances from a segment all into a single feature vector. It made it easy to later compute affinities (distances) between these features and then use spectral clustering. I would have like to learn more about the spectral clustering, but they give a citation to the NIPS 2001 Andrew Ng et al., so can't complain too much.

The object-graph introduced in this paper is very ...

2013-11-07T00:08:42.594-08:00

The object-graph introduced in this paper is very much similar to the way of using object context in the hierachical inference machine paper. It seems this kind of way to exploit the context information really helps across different tasks.
As mentioned by the authors, the object graph itself performs almost as well as the full model. So we can think of this paper as clustering unseen objects mainly using its relationships with known objects. I think one possible reason why their appearance feature does not work that well is because of, as shown in the example of clustering HOG, the lacking of a good metric. So, I would imagine they can achieve even better results if they use appearance affinity in a proper way.

Contributions: I really like the idea of unsuperv...

2013-11-06T23:32:39.663-08:00

Contributions:

I really like the idea of unsupervised methods compared to supervised methods for object classification because of the lack of clarity of object boundaries. Does a shirt count as a seperate object, or is it just part of the person? I think unsupervised methods have the ability to generalize beyond this ambiguity by letting the algorithm decide the cutoffs for what qualifies as an entity/object based on intra-class similarity.

I think the proposed method for adding context is very simple and elegent. The paper models context as just a descriptor that gets factored in when calculating the similarity metric. I think this is a much better way to go than trying to represent context as a graphical model with several parameters that you need to optimize over.

Concerns:

I'm surprised their similarity metric for computing the similarity between two object-graph descriptors does not boost the similarity values for the neighbors that are closer to the unknown object. Intuitvely, context near the object is more cohesive than context that is further away.

I think Aravindh has a point. They assume that mor...

2013-11-06T23:02:52.330-08:00

I think Aravindh has a point. They assume that more context is better, and fix the number of above & below neighbors to add to the descriptor to 20. However, they never examine/report how well the algorithm does (specifically for the spectral clustering step at the end) if we dont use any context versus if we do. I suppose that using R=20 is better than using R=0, but how much better?

The authors propose a new representation -- object...

2013-11-06T22:19:06.484-08:00

The authors propose a new representation -- object-graphs, and use it to capture context for new category discovery in an unsupervised manner. It's unsupervised in the sense that they already have few known categories, and the context descriptor is built using classifier outputs for such known categories.

I think this was a neat idea to capture context for discovery, and object-graph-like representations can be extended to other applications, where context might be important. I really liked the clarity of presentation and attention to details (algorithmic and implementation).

@Aravindh: If H_0 was very similar to any particul...

2013-11-06T22:10:24.591-08:00

@Aravindh: If H_0 was very similar to any particular class, it would have been selected as "known" in the first place (I think).
@Srivatsan: I think you meant "known" objects, and not "unknown objects".

@Felix: The authors motivate in para2 of intro tha...

2013-11-06T22:08:23.187-08:00

@Felix: The authors motivate in para2 of intro that why completely unsupervised discovery might be too daunting task, and why they chose to go weakly supervised.

H_0 is itself obtained by running classifiers on a...

2013-11-06T21:57:18.790-08:00

H_0 is itself obtained by running classifiers on appearance based features of the unknown object. How does it add anything new to the appearance based features?