16-824: Learning-based Methods in Vision (F'13): Reading for 9/26

Tuesday, September 24, 2013

Reading for 9/26

Daniel Munoz, Drew Bagnell and Martial Hebert, Stacked Hierarchical Labeling, ECCV 2010.

And optionally:

TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-Class Object Recognition and Segmentation, J. Shotton, J. Winn, C. Rother, A. Criminisi, ECCV, 2006.

Nonparametric Scene Parsing via Label Transfer, C. Liu, J. Yuen and A. Torralba, IEEE TPAMI, May, 2011.

52 comments:

Priya DeoSeptember 24, 2013 at 10:13 PM
This paper presents a heirarchical approach for labelling images. In contrast with work done using probabilistic grapical models, which are difficult to learn, this approach decomposes the labeling problem into a chain of subproblems.

The success of the heirarchical model is in analyzing at increasingly fine resolutions while maintaining global context. The authors train a probabilistic classifier for each level and pass the estimated label probabilities to the next level. The authors improve context by including not only the parent region's probabilities, but also neighboring region's probabilities and the entire previous level's probabilities. Classifiers are trained sequentially using the predictions from the previous level. In order to avoid overfitting, the authors employ stacking to test on unseen data. They also train a secondary classifier at each level to refine the first-round predictions. The algorithm performs comparably to others on the MSRC-21 dataset.

Strengths:

Use of stacking to avoid overfitting.
Algorithm can recover from original incorrect labellings.
Maintain probability distribution of labels.

Weaknesses:

Why stop at 8 levels in the labelling heirarchy? We can see that the classification improvement starts to level off after 8 levels on the MSRC-21 dataset, but it seems that adding a level will improve accuracy on the Stanford Background dataset.

The authors propose using a two-stage classification at each level, but never quantify how much value the second stage contributes.

I think it would be interesting to see how many of the times when the algorithm was "uncertain" of the label that the second-highest choice was correct. Was the model close or completely off?
ReplyDelete
Replies
IshanSeptember 25, 2013 at 2:56 PM
This paper presents a neat idea for semantic
segmentation. It exploits the fact that the segmentation
task may be difficult to solve at once scale of the image,
but easier at another. Also, solutions to at different
scales help each other.

The approach starts with a given segmentation tree. Ideally,
one would like all the levels (and regions) of the tree to
"talk to each other". However, the graphical model becomes
very dense, and since the inference on such graphs is
NP-Hard this becomes untractable. The authors use two steps to make this practical
1. They restrict the number of connections so that only adjacent levels are connected.
2. Instead of approximating the inference, they make use of a structured model at each stage.

At each level, they train a bunch of 1-vs-all Random Forest
classifiers for each region, combining them using a maximum
entropy framework. They use standard context features
(e.g. labels of parent regions etc.) so that classifiers at
each level have some dependence.

Things I liked
- I like the simple approach of hierarchically tackling the segmentation problem. Also, since the approach takes a segmentation tree as input, it can be easily combined with existing approaches to yield multiple segmentations for a scene.
- Soft decision making: The approach gives a probabilistic output/confidence scores for labels for regions.
- Stacking: It is a great ensemble learning method and has regained popularity in recent years (e.g. Netflix challenge).

Concerns
- The model has way too many parameters to tune (I lost a count of how many).
- Pixel CRF is consistently better for "smoother/bigger" categories like sky/road/tree. Why is that?
- Experiments should have included some analysis as to which regions are confused the most, and why. The experimental section is just pure numbers. It provides no insight into the technique. I would have liked to see more results on how the levels interact. How stacking helps. If the main contribution is indeed the new graphical model formulation/inference procedure, then the authors should provide more insight into it.
ReplyDelete
Replies
Arne SuppeSeptember 25, 2013 at 3:16 PM
I have used Daniel’s code (as I know a number of people in the class have) and it works quite well. In fact, we have applied it to a number of scenarios with different cameras, little to no tweaking, and have had good results. In this regard, I would call the algorithm robust.

However, one thing that has bothered me is the stability of the predictions. That is, given a prediction at time t, even if one changes viewpoint slightly, the prediction at time t+1 should be substantially similar. What I have seen is that small changes can influence the prediction quite dramatically. Daniel addresses this in a later paper “Efficient Temporal Consistency for Streaming Video Scene Analysis,” but I don’t think that was an online process and I have always felt that the difference between individual classifications should be smaller. Looking at the frame-to-frame data, I see this problem starting with the super-pixel segmentation that is derived from the Arbaleaz et al. method we saw on Tuesday. Even with a completely still camera and viewing a scene with no moving objects, the segmentation can change dramatically due to sensor noise.

This is an over-segmentation of the image and life is good if the individual segments can combine into the proper regions to assign a single label. However, if the over-segmentation is wrong, then there is no way to win. Shadows and other lighting effects can cause this, but it often seems to happen without reason. One thing I have always wanted to try was to replace the segmentation algorithm with something more stable, and then see how stable the results are so that we can see if the problem is in the segmentation or something in the classifier.
ReplyDelete
Replies
AnonymousSeptember 25, 2013 at 5:07 PM
Munoz, Bagnell and Herbert present a scene segmentation approach based on a new kind of graphical model: rather than representing the image as a sparsely connected graph of pixels or superpixels they instead represent it as a graph where larger parent nodes are connected to smaller child nodes. Label data is propogated from parent nodes to child nodes, and a complex graphical representation of the system allows inference taking both local information from particular nodes, and more global "context" information from parent nodes into account.

In a way, this reminds me of using octrees and quadtrees for occupancy mapping. It allows a sort of "multi-resolution" labeling of the image which is allowed to take into account hierarcical information to make faster inferences over larger volumes. You could imagine a variant of this where levels are only "split" if certain amounts of variation are detected for instance.

Another thing I really like about this algorithm is its ability to assign probabalistic labelings, and confidence of labelings. In my opinion, its always important to propogate down as much probabalistic information as possible into the later stages of any pipeline which takes in sensor data. This is because other algorithms which use the information from this pipeline ought to use probablistic information as well, which shouldn't be thrown out by earlier stages.
ReplyDelete
Replies
Srivatsan VaradharajanSeptember 25, 2013 at 5:38 PM
This paper takes the position in the introduction that using graphical models for semantic segmentation is not a great way of doing things. Because, though graphical models form intuitive representations of the problem, the interconnections between different levels are highly complex and render exact inference intractable.
Even approximate inference seems to be a bad thing to do, though I am not entirely clear why this is and the paper does not go into too many details regarding the degree to which it is bad. Does it not work in general or is it not suited for just the semantic segmentation problem? It would be good if someone can shed more light on this.
The algorithm in this paper trains a set of classifiers for different classes at each level of the hierarchy to predict the probability distribution over different labels in each region. Modeling label proportions over regions seems to be a neat and robust thing to do rather than training classifiers to predict single labels, especially considering their use as features at each lower level in the hierarchy. The paper considers and handles events such as partial labelings in a sensible manner, which is important in datasets labeled using crowdsourcing as the images often tend to be only partially labeled. Features encoding the context of each region at every level are obtained from the parent level region's predicted labels, weighted average of the neighboring regions' probability distributions and the weighted overall distributions from the parent level. At this point the paper describes the hierarchial stacking procedure which is employed to overcome two problems that result in cascading errors. Overall the algorithm seems to be quite robust and the results look good.
One thing that struck me as odd was the way in which features from neighboring regions are weighted when computing the context features. The area of intersection of the dilated mask of the current region with the neighboring regions is used as the weight for the features from the corresponding neighboring region. The way in which the extent of dilation is determined is not mentioned anywhere and I feel that the level of dilation could affect the weights considerably - at the smallest levels it will be linear in terms of the number of common boundary pixels with each neighbor but with heavier dilations it could quickly become non-linear. In addition, though it may not make much of a difference, I feel it intuitively makes more sense to dilate the neighboring regions and compute the weights based on how much each of them intersect the current region.
Finally the results on the MSRC-21 dataset give rise to some interesting observations which the paper again doesn't seem to mention. The leaf-level classifier seems to work better (significantly in some cases) than the hierarchial classifier for the classes building, grass, tree, sky and road. The pattern is interesting because these are the individual classes that usually take up the maximum number of pixels in an image. For the same reason probably the overall accuracy of the hierarchial classifier is only 4% higher than the leaf level classifier. Even in the Stanford Background Dataset, the performance in these classes is not all that different between the leaf level and the hierarchial classifiers. This seems to indicate that hierarchial classification may not be a preferred method for predicting such classes.
ReplyDelete
Replies
M AravindhSeptember 25, 2013 at 9:04 PM
I am reminded of myself trying to write a tree shaped graphical model for semantic segmentation. The choice of architecture is most intuitively the architecture returned by an unsupervised segmentation algorithm like gPb-oct-ucm. The problem with this intuitive choice is that the architecture varies from image to image. The authors solve this by using the same phi function across all the nodes at level l. This choice of shared phi function is not deliberated upon but I think it is critical.
ReplyDelete
Replies
UnknownSeptember 25, 2013 at 9:25 PM
In respect to Carl's comment on Greedy Algorithms vs. Approximate Inference, I want to mention one paper here:

Philipp Krähenbühl and Vladlen Koltun, Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials, NIPS 2011 Best Student Paper.

In this paper, I saw a state of the art performance in semantic segmentation using CRF with approximate inference. They are doing inference over a fully connected CRF. They are doing pixel-wise inference. Yet their speed is unbelievably fast: 0.2s for one image. What they show are labeling results with image-matting-level details, not bulky labels generated by traditional graphical models.

And here's their quantitative results on MSRC-21:
86.0% Overall (MSRC Ground Truth)
88.2% Overall (Really Accurate Ground Truth)

I'm kind of curious why this paper is not discussed.
ReplyDelete
Replies
M AravindhSeptember 25, 2013 at 10:17 PM
I'm slightly confused about the details of stacked hierarchical labeling. But the big picture looks very similar to the drop out technique used to prevent overfitting in deep neural networks. Its similar to saying that all weights inside the same layer are shared but I deactivate a random subset of these neurons to train using the data flowing into the rest of them.

The question asked earlier about the depth of this network also reminds me of similar questions asked about deepnets. Why do they stop at 4 layers? why not 5 layers? The deep learning papers (atleast back in 2006) tried different number of layers and observed that it did not help much to add more layers beyond those used in their experiments.
ReplyDelete
Replies
GauravSeptember 26, 2013 at 3:55 AM
Context is important and the authors have addressed that in the paper but Semantic segmentation of the scene seems incomplete to me without some idea of the 3-D structure of the world. Humans will use depth information to segment out objects before using ways to label them. I'm not sure if this is addressed in any paper but would be interesting to see semantic segmentation results using depth estimates for outdoor scenes using just monocular camera data. --Gaurav
ReplyDelete
Replies
UnknownSeptember 26, 2013 at 7:23 AM
I think this paper introduces a good method for segmentation + labeling. In particular, I liked the fact that their algorithm 1. never assigned discrete labelings to a region (always soft decisions / probability), and 2. used a multi-level approach (some regions and some levels may be easier to segment / label). What I would have liked to see more of, however, was different types of evaluation and reasoning about why certain parts of the method worked / didn't work (besides just presenting the regular performance vs other algorithms on labeling). Some things that might have been interesting to see (some of which have been mentioned by others), could be 1. initial segmentation 2. # levels for each level (it could be interesting if certain things segmented better at consistently different levels) 3. explanation for the large variance in performance on some categories (ex. in table 1 leaf on boat, dog, chair, bird is very low, also in some cases leaf is better than hierarchy).
ReplyDelete
Replies
UnknownSeptember 26, 2013 at 8:02 AM
One of the biggest issues bothering the authors, I think, might be the cascading error issue. The authors use a "stacking" approach to alleviate this problem. In the following section, the authors also introduce a second round of "stacking" to solve the problem of having too many small regions in the next level. If I understand correctly, can we view this "stacking" approach as an extended or complicated version of cross-validation? Of course with a smarter combining strategy. If so, would the trial on various partition choices of training/validating set be another source of over-fitting or data abuse?
ReplyDelete
Replies
UnknownSeptember 26, 2013 at 8:12 AM
Points I like about this paper:
1. The idea of breaking intractable inference problem into smaller subproblems. This method provides a good insight on avoiding intractability from the computation perspective.
2. The way they do hierarchical stacking is interesting since they use leave-one-out policy to simulate the distribution of the predictions during test time. This method is also applied in many applications to avoid overfitting and thus make the model generalizable, e.g. discriminative patch discovery [1].

Things not clear to me:
Due to the page limit, the author refer to some references for more details about many detailed design choice (e.g. why choosing max-ent classifier as building block, what about using other classification method?). Another concern is from the lack of results validating the key contribution of the paper which is to decompose a big intractable inference problem into smaller subproblems without performance loss. It would be great if there are some comparisons to let us see directly how this simplification works.

ReplyDelete
Replies
ArunSeptember 26, 2013 at 8:56 AM
The key point in this paper is the ability to try to solve the segmentation problem by solving each level in a hierarchy of segmentation problems and using message passing between the layers. I wonder if this could also work for the idea of object detection at multiple scales. Instead of feeding multi-scale features into a single classifier, we could train multiple classifiers at each scale (the hierarchy) and pass back-and-forth the predicted detection score. Seems like something that probably has been done, but would be interesting to see.
ReplyDelete
Replies
UnknownSeptember 26, 2013 at 9:27 AM
Good idea:
The idea in this paper that breaking the intractable inference problem into subproblems is really interesting. The layered hierarchical graphical model could be decomposed into subproblems for each level.

Question:
The question to me is that the way they get those different levels of image segmentation is not clear. For me, this model will work only in the case that in the same level, each region or super pixel should have the similar size or segmented completeness which means that each level, the fine or coarse of the segments should be similar. However, it is not clear to me how we could constrain this. For me, it is hard to keep this if we segment complicated images and simple images.
ReplyDelete
Replies
UnknownSeptember 26, 2013 at 9:59 AM
Ehh... I am curious is there a way to compare results from this paper and from the Spatial Pyramid Matching paper from one week ago. They use different databases and maybe different image features so I guess not really. I mean the "Spatial Pyramid Matching" looked so robust and straightforward compared to this one..
ReplyDelete
Replies
UnknownSeptember 26, 2013 at 10:24 AM
I found the methodology of this paper pretty difficult to understand. I think I understand the broad idea - performing segmentation at coarse to fine levels, passing probabilities between levels, not making any hard decisions. However, I understand almost none of the details. How do you separate the image into different regions? What, exactly, do the different levels mean? What, exactly, is passed between levels? If each level simply predicts a proportion of labels, is that what is passed down? Do we hope that at the finest level, each region has a huge majority of one label, and that is the classification?

In general, I really like the idea of not making a hard decision and passing probabilities along (even if I don't quite understand how they're doing this in this paper). It seems like at the top level, the algorithm gets a general idea of the content of the image, and at the lower levels, figures out the finer details of the contours. It makes sense that getting a general idea and passing this information downward for figuring out details works well over trying to do everything at once.
ReplyDelete
Replies

Add comment