Tuesday, September 24, 2013

Reading for 9/26

Daniel Munoz, Drew Bagnell and Martial Hebert, Stacked Hierarchical Labeling, ECCV 2010.

And optionally:


Nonparametric Scene Parsing via Label Transfer, C. Liu, J. Yuen and A. Torralba, IEEE TPAMI, May, 2011.

52 comments:

  1. This paper presents a heirarchical approach for labelling images. In contrast with work done using probabilistic grapical models, which are difficult to learn, this approach decomposes the labeling problem into a chain of subproblems.

    The success of the heirarchical model is in analyzing at increasingly fine resolutions while maintaining global context. The authors train a probabilistic classifier for each level and pass the estimated label probabilities to the next level. The authors improve context by including not only the parent region's probabilities, but also neighboring region's probabilities and the entire previous level's probabilities. Classifiers are trained sequentially using the predictions from the previous level. In order to avoid overfitting, the authors employ stacking to test on unseen data. They also train a secondary classifier at each level to refine the first-round predictions. The algorithm performs comparably to others on the MSRC-21 dataset.

    Strengths:

    Use of stacking to avoid overfitting.
    Algorithm can recover from original incorrect labellings.
    Maintain probability distribution of labels.

    Weaknesses:

    Why stop at 8 levels in the labelling heirarchy? We can see that the classification improvement starts to level off after 8 levels on the MSRC-21 dataset, but it seems that adding a level will improve accuracy on the Stanford Background dataset.

    The authors propose using a two-stage classification at each level, but never quantify how much value the second stage contributes.

    I think it would be interesting to see how many of the times when the algorithm was "uncertain" of the label that the second-highest choice was correct. Was the model close or completely off?

    ReplyDelete
    Replies
    1. I have looked at the distributions for each super pixel label to see if there is anything to learn from it. For example, can we measure the entropy to determine how certain the classifier was or take the ratio of the 1st and 2nd most probable labels, etc. Unfortunately I have not found this distribution to be very informative. When its wrong, it really doesn't warn you. If you look at the confusion matrices, its also hard to argue that it confuses one class with another for some rational or consistent reason.

      Delete
    2. I think the reason why author stopped at level-8 might be due to the computational reason.
      And the paper really contained a lot of information. ECCV is too short to contain too much details.

      Delete
    3. The depth of the network is always a practical yet unsolved problem. It is dataset dependent. Stopping at a certain level, i.e., eight here, I think, is to try to balance effectiveness and efficiency. Probably, one of the improvements is to find some criterion, i.e., entropy, to stop the tree growing or select the number of levels self-adaptively.

      Delete
  2. This paper presents a neat idea for semantic
    segmentation. It exploits the fact that the segmentation
    task may be difficult to solve at once scale of the image,
    but easier at another. Also, solutions to at different
    scales help each other.

    The approach starts with a given segmentation tree. Ideally,
    one would like all the levels (and regions) of the tree to
    "talk to each other". However, the graphical model becomes
    very dense, and since the inference on such graphs is
    NP-Hard this becomes untractable. The authors use two steps to make this practical
    1. They restrict the number of connections so that only adjacent levels are connected.
    2. Instead of approximating the inference, they make use of a structured model at each stage.

    At each level, they train a bunch of 1-vs-all Random Forest
    classifiers for each region, combining them using a maximum
    entropy framework. They use standard context features
    (e.g. labels of parent regions etc.) so that classifiers at
    each level have some dependence.

    Things I liked
    - I like the simple approach of hierarchically tackling the segmentation problem. Also, since the approach takes a segmentation tree as input, it can be easily combined with existing approaches to yield multiple segmentations for a scene.
    - Soft decision making: The approach gives a probabilistic output/confidence scores for labels for regions.
    - Stacking: It is a great ensemble learning method and has regained popularity in recent years (e.g. Netflix challenge).

    Concerns
    - The model has way too many parameters to tune (I lost a count of how many).
    - Pixel CRF is consistently better for "smoother/bigger" categories like sky/road/tree. Why is that?
    - Experiments should have included some analysis as to which regions are confused the most, and why. The experimental section is just pure numbers. It provides no insight into the technique. I would have liked to see more results on how the levels interact. How stacking helps. If the main contribution is indeed the new graphical model formulation/inference procedure, then the authors should provide more insight into it.

    ReplyDelete
    Replies
    1. I think at certain places, the paper could have been written more clearly, but, may be, that's just me! Also, adding to what Ishan said, I would have liked to see more analysis of the algorithm and it's various components in the experiments..

      Delete
    2. It would have indeed been nice to see the authors' insight into how this method does better on some categories and not on others. I don't have any intuition of why it does worse on some categories than others. We also see that it performs worse than [20],and that is attributed to their use of more discriminative features. I wonder how this paper's performance would do with those features. It may be a fairer comparison then.

      Delete
  3. I have used Daniel’s code (as I know a number of people in the class have) and it works quite well. In fact, we have applied it to a number of scenarios with different cameras, little to no tweaking, and have had good results. In this regard, I would call the algorithm robust.

    However, one thing that has bothered me is the stability of the predictions. That is, given a prediction at time t, even if one changes viewpoint slightly, the prediction at time t+1 should be substantially similar. What I have seen is that small changes can influence the prediction quite dramatically. Daniel addresses this in a later paper “Efficient Temporal Consistency for Streaming Video Scene Analysis,” but I don’t think that was an online process and I have always felt that the difference between individual classifications should be smaller. Looking at the frame-to-frame data, I see this problem starting with the super-pixel segmentation that is derived from the Arbaleaz et al. method we saw on Tuesday. Even with a completely still camera and viewing a scene with no moving objects, the segmentation can change dramatically due to sensor noise.

    This is an over-segmentation of the image and life is good if the individual segments can combine into the proper regions to assign a single label. However, if the over-segmentation is wrong, then there is no way to win. Shadows and other lighting effects can cause this, but it often seems to happen without reason. One thing I have always wanted to try was to replace the segmentation algorithm with something more stable, and then see how stable the results are so that we can see if the problem is in the segmentation or something in the classifier.

    ReplyDelete
    Replies
    1. As for your time problem, I think in general what's missing is the contextual link *between* successive frames. Another dimension of time ought to be added to each feature, so that frames near each other in subsequent frames ought to be more likely to share the same label. We could imagine using a similar graphical model approach to link subsequent frames together.

      -- Matt Klingensmith

      Delete
    2. In response to Arne's comment regarding segmentation-tree input: that's exactly something that I would have liked to see -- trying out few different approaches for segmentation-tree and showing which one works best and why, or showing that this approach is robust to choice of segmentation-tree method (which would have been a strong point!).

      Delete
    3. The problem of 'flickering' labels caused by slight changes in illumination and perspective is quite ubiquitous. In fact, even in other domains, wherever a classifier makes predictions on some kind of time varying data this problem can be seen. I would say that this is one of the big reasons why results from video segmentation algorithms may fail to impress people outside the field. An algorithm that predicts an object appearing out of nowhere and promptly vanishing in the next frame looks extremely silly in demo videos.

      The "Efficient Temporal Consistency ..." paper referred to by Arne does, in fact, address this issue in an online setting. The authors highlight in the introduction how their work compares favourably with video interpretation systems that only work in batch mode. Since their algorithm is aimed towards implementation on mobile robots, it uses efficient and simple techniques such as optical flow, temporal smoothing and computing pixel appearance similarities using the Mahalanobis distance with a precomputed covariance matrix learnt from data. It doesn't rely on any kind of semantic context or even geometrical reasoning to enforce temporal consistency. I think the same algorithm can even be applied to fix temporal consistency issues at the segmentation level.

      Delete
    4. I'm wondering the same thing w.r.t. the initial tree. How robust is this method to changes in the tree? Did we argue last time that we don't like 'generic segmentation' methods, and yet this paper relies on such work existing.

      Delete
  4. Munoz, Bagnell and Herbert present a scene segmentation approach based on a new kind of graphical model: rather than representing the image as a sparsely connected graph of pixels or superpixels they instead represent it as a graph where larger parent nodes are connected to smaller child nodes. Label data is propogated from parent nodes to child nodes, and a complex graphical representation of the system allows inference taking both local information from particular nodes, and more global "context" information from parent nodes into account.

    In a way, this reminds me of using octrees and quadtrees for occupancy mapping. It allows a sort of "multi-resolution" labeling of the image which is allowed to take into account hierarcical information to make faster inferences over larger volumes. You could imagine a variant of this where levels are only "split" if certain amounts of variation are detected for instance.

    Another thing I really like about this algorithm is its ability to assign probabalistic labelings, and confidence of labelings. In my opinion, its always important to propogate down as much probabalistic information as possible into the later stages of any pipeline which takes in sensor data. This is because other algorithms which use the information from this pipeline ought to use probablistic information as well, which shouldn't be thrown out by earlier stages.

    ReplyDelete
    Replies
    1. Btw this was Matt Klingensmith again (6ccc9cd...)

      Delete
    2. I agree with your idea that levels are "split" if certain amounts of variation are detected. I think this would help in having an adaptive approach to how many levels deep the algorithm goes.

      Playing the devil's advocate however, I could imagine an image with few small but bright colored regions that get segmented out in the first few segmentation layers that might trigger the stopping condition prematurely. Whereas more fine resolutions of segmentation actually parse out the correct details. You might have to ensure that the algorithm reaches some base level of resolution first.

      Why do you think the probabilistic information is so important? Would it be enough to pass the top K labels at each level. Passing down all the information will become more and more difficult as this algorithm is extended to predict more labels. If we had to predict 100 labels, I would assume that passing the top 10 label probabilities would be more helpful than passing all 100 label probabilities. It might even be better to reduce the number of labels passed at each successive level as the algorithm becomes more sure of its predictions (i.e. pass 50 at level 1, pass 10 at level 5).

      Delete
    3. It will certainly be an interesting experiment to try and see what happens when only the top K labels are passed down at each level. Ideally, the algorithm should give the same (or atleast very similar) performance as it does now.

      Delete
    4. I'll bring some arguments against this… :)
      To me, one of the best things in the algorithm is the soft decision making, where each region-label pair is given some probability at the output. By passing only most probable labels to the next level, the most probable of them to even next level, I think you have a very good chance to kill the soft decision making. Besides, at fig.5 the author gives an example when an algorithm recovers over layers from an almost certain mislabeling a person as a building.

      Delete
  5. This paper takes the position in the introduction that using graphical models for semantic segmentation is not a great way of doing things. Because, though graphical models form intuitive representations of the problem, the interconnections between different levels are highly complex and render exact inference intractable.
    Even approximate inference seems to be a bad thing to do, though I am not entirely clear why this is and the paper does not go into too many details regarding the degree to which it is bad. Does it not work in general or is it not suited for just the semantic segmentation problem? It would be good if someone can shed more light on this.
    The algorithm in this paper trains a set of classifiers for different classes at each level of the hierarchy to predict the probability distribution over different labels in each region. Modeling label proportions over regions seems to be a neat and robust thing to do rather than training classifiers to predict single labels, especially considering their use as features at each lower level in the hierarchy. The paper considers and handles events such as partial labelings in a sensible manner, which is important in datasets labeled using crowdsourcing as the images often tend to be only partially labeled. Features encoding the context of each region at every level are obtained from the parent level region's predicted labels, weighted average of the neighboring regions' probability distributions and the weighted overall distributions from the parent level. At this point the paper describes the hierarchial stacking procedure which is employed to overcome two problems that result in cascading errors. Overall the algorithm seems to be quite robust and the results look good.
    One thing that struck me as odd was the way in which features from neighboring regions are weighted when computing the context features. The area of intersection of the dilated mask of the current region with the neighboring regions is used as the weight for the features from the corresponding neighboring region. The way in which the extent of dilation is determined is not mentioned anywhere and I feel that the level of dilation could affect the weights considerably - at the smallest levels it will be linear in terms of the number of common boundary pixels with each neighbor but with heavier dilations it could quickly become non-linear. In addition, though it may not make much of a difference, I feel it intuitively makes more sense to dilate the neighboring regions and compute the weights based on how much each of them intersect the current region.
    Finally the results on the MSRC-21 dataset give rise to some interesting observations which the paper again doesn't seem to mention. The leaf-level classifier seems to work better (significantly in some cases) than the hierarchial classifier for the classes building, grass, tree, sky and road. The pattern is interesting because these are the individual classes that usually take up the maximum number of pixels in an image. For the same reason probably the overall accuracy of the hierarchial classifier is only 4% higher than the leaf level classifier. Even in the Stanford Background Dataset, the performance in these classes is not all that different between the leaf level and the hierarchial classifiers. This seems to indicate that hierarchial classification may not be a preferred method for predicting such classes.

    ReplyDelete
    Replies
    1. Srivatsan raises an good point here which I (somewhat selfishly) hope people can discuss: that approximate inference techinques with well-established theory and good statistical properties have, in general, performed worse than dumb greedy inference techniques in vision problems. I see two reasons for this. First is that in general, approximate inference aims to be convex, and convexity is a bad idea; remember the ambiguous images I showed in my lecture. In general there may be multiple "good" solutions, and the best approach is to pick exactly one. Convex relaxations, on the other hand, smooth away local minima, which tends to put the minimum of the relaxation somewhere between the good solutions, as a sort of average. The other issue is that the way the relaxations measure how "close" the approximate distribution is to the true distribution isn't designed with any knowledge of the problem. They tend to put bounds on the distance between their approximate distribution and the true model distribution. I also mentioned in my presentation that in vision, the distances used in vision tend to be nonsense in many cases, and it's likely true here. If you learn a CRF model and then run it in generative mode, I expect that it would happily produce a good deal of nonsense; they certainly won't produce a realistic distribution of image segmentations. However, variational approximations minimize KL-divergence between the approximate distribution and the model's ENTIRE distribution, even the parts of the distribution which haven't been sampled well in the data. Hence, there's no reason they wouldn't model idiosyncracies of the learned distribution rather than fitting to the actual data.

      Not that I've used variational inference very much; this is just my impression from reading Jordan/Wainwright's 2008 textbook beast. I'd be curious what people think of my guesses here.

      Delete
    2. I haven't studied theoretical properties of the approximate inference methods. But from the two posts above, I have the following opinion:-

      The KL divergence is heavily dependent on the tail of the distribution. It's probably not the right distance function for this problem because the tail can contains very strange artifacts.

      As for loopy belief propagation, I don't think it guarantees a very good solution. I might be terribly wrong ... please correct me if that's the case.

      Convexity is good or convexity is bad is hard to say. A RBF kernel SVM learner is solving a convex optimization problem, but it can fit arbitrary decision boundaries (there's some technical conditions which I'm skipping here). So trying to convexity the problem is not necessarily a bad design choice.

      The idea of making a tree is very good as that avoids loops and ensures exact inference is possible. The catch is that the tree's architecture is going to vary with the image and this makes no sense as far as graphical model learning is concerned. The very neat trick is to tie together the parametric function and it's parameters used for estimating the probability values for the nodes at the same level. I'm not a graphical models expert but if someone could try writing the message passing algorithm for such a design ... does it give the same algorithm? Do the two iterations .. forward pass and backward pass turn out to be similar to those described in section 3.4.

      I can't pull up the citation here but people have tried this tree structure with a fixed size architecture in the past. They used a grid based hard coded segmentation to fix the architecture

      Delete
    3. @Carl, modeling the probability of likely solutions may be more informative than just picking one of possible solutions outright. For instance, probability information combined with scene classification information can help correctly distinguish a red colored ball as a cricket or snooker ball. --Gaurav

      Delete
    4. The convex relaxation strategy simplifies the solution to the problem. But it actually changes the original problem, and obtains although globally optimal yet approximate solution to the original problem. And, it is really hard to evaluate the approximation to the exact solution. The further it is, the less meaningful the relaxation is. So, I think the formulation is more important than solution. This is the basic motivation of this paper. That is, to formulate the semantic segmentation according to its own characteristics rather than directly applying the general graphical model.

      Delete
  6. I am reminded of myself trying to write a tree shaped graphical model for semantic segmentation. The choice of architecture is most intuitively the architecture returned by an unsupervised segmentation algorithm like gPb-oct-ucm. The problem with this intuitive choice is that the architecture varies from image to image. The authors solve this by using the same phi function across all the nodes at level l. This choice of shared phi function is not deliberated upon but I think it is critical.

    ReplyDelete
    Replies
    1. I would agree with you on this point. The regions at level l for a well-lit image are going to look completely different than those for a poorly-lit image. Similarly regions in cluttered and un-cluttered images will have different distributions of features. I dont see how intuitively using the same phi function across all images at level would make sense, unless there was some guarantee that images would be segmented at each level to the same degree of separation regardless of number of segments needed.

      Delete
  7. In respect to Carl's comment on Greedy Algorithms vs. Approximate Inference, I want to mention one paper here:

    Philipp Krähenbühl and Vladlen Koltun, Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials, NIPS 2011 Best Student Paper.

    In this paper, I saw a state of the art performance in semantic segmentation using CRF with approximate inference. They are doing inference over a fully connected CRF. They are doing pixel-wise inference. Yet their speed is unbelievably fast: 0.2s for one image. What they show are labeling results with image-matting-level details, not bulky labels generated by traditional graphical models.

    And here's their quantitative results on MSRC-21:
    86.0% Overall (MSRC Ground Truth)
    88.2% Overall (Really Accurate Ground Truth)

    I'm kind of curious why this paper is not discussed.

    ReplyDelete
    Replies
    1. There's another paper like this

      Blending Learning and Inference in Structured Prediction, Tamir Hazan, Alexander Schwing, David McAllester, Raquel Urtasun, http://arxiv.org/pdf/1210.2346.pdf

      They get 86.8 % overall on MSRC-21 dataset.

      They propose a solution to the problem of learning parameters of structured predictors in general graphical models. They test this solution on stereo estimation, semantic segmentation, shape reconstruction, and indoor scene understanding and get state of the art results.

      Delete
    2. Ah that paper! Yes, we very nearly had the class read that one, but I argued to Abhinav that the methodology is too misleading. If you read it quickly you'll get the impression that the convolution is 2d--i.e. it happens in image space. However, if you read carefully you realize that the convolution is in *feature* space. In their case it's 5d; if you wanted to use real texture features, it would be dozens of dimensions and even the hacks they used to make the convolution work in 5-d wouldn't be practical. As it stands, it's actually quite difficult to analyze the approximations made in the convolution, although in practice it seems to work quite well. I can't remember now how well it compares to baselines; If I recall correctly, they don't actually compare to any state-of-the-art baselines in their paper.

      Delete
    3. I would have liked to see this paper. I agree with Carl that the way it is written is a bit misleading, but still it uses handy tricks. The approximations for convolutions used in this paper are more or less covered in [2, 3]. Though the paper makes a strong assumption that pairwise potentials need to be a linear combination of gaussians, I still love the fine-grained segmentation that they get. Following on Carl's point earlier that a lot of algorithms do approximations (which usually don't make sense), I think that some approx. are more reasonable than others.. And in this case, given the speed-up and power of fully connected network, I happily digest the approximation.

      I would recommend looking at the video of the paper [1].

      [1] http://videolectures.net/nips2011_kraehenbuehl_potentials/
      [2] S. Paris and F. Durand. A fast approximation of the bilateral filter using a signal processing approach, IJCV, 2009
      [3] A. Adams, J. Baek, and M. A. Davis. Fast high-dimensional filtering using the permutohedral lattice. Computer Graphics Forum, 2010.

      Delete
    4. Had a quick look at this fully-connect CRF paper, I'm really curious about why their approximate inference would work. Despite of the accuracy as a number, as Abhinav said, the qualitative results of such fine-grained segmentation looks very nice to me.

      Delete
    5. good question: we were not sure what would be a better paper for the class to read. I believe the stacking and series of prediction is definitely a neat approach to solve graphical models...this should be applicable beyond image labeling.....texton-boost is initial and most cited paper and SIFT-flow is a completely different approach and take on this.

      Delete
    6. Yeah I know that NIPS paper as well and I have run their algorithms for the computer vision homework, seems it work pretty well.

      Delete
  8. I'm slightly confused about the details of stacked hierarchical labeling. But the big picture looks very similar to the drop out technique used to prevent overfitting in deep neural networks. Its similar to saying that all weights inside the same layer are shared but I deactivate a random subset of these neurons to train using the data flowing into the rest of them.

    The question asked earlier about the depth of this network also reminds me of similar questions asked about deepnets. Why do they stop at 4 layers? why not 5 layers? The deep learning papers (atleast back in 2006) tried different number of layers and observed that it did not help much to add more layers beyond those used in their experiments.

    ReplyDelete
    Replies
    1. I think in Figure 8, they kind of show that accuracy starts to flatten as the number of levels increases, and in the case of both MSRC-21 and Stanford Backround datasets, once they get to ~L8, performance seems to slow down. Of course, this might not be true of all datasets and choosing the right number is not totally clear, but one nice things about their method is that it seems to be relatively quick (so maybe stopping around L8) or so might be a good tradeoff for accuracy vs. speed.

      Delete
    2. Why it is quite similar to the drop-out technique? I am quite confused... Seems like it uses all the nodes during each phase of training...

      I have actually thought about this for a while, that all things about attributes, and hierarchical labeling leads as to a deeper architecture for vision, and for AI?

      Delete
  9. Context is important and the authors have addressed that in the paper but Semantic segmentation of the scene seems incomplete to me without some idea of the 3-D structure of the world. Humans will use depth information to segment out objects before using ways to label them. I'm not sure if this is addressed in any paper but would be interesting to see semantic segmentation results using depth estimates for outdoor scenes using just monocular camera data. --Gaurav

    ReplyDelete
  10. I think this paper introduces a good method for segmentation + labeling. In particular, I liked the fact that their algorithm 1. never assigned discrete labelings to a region (always soft decisions / probability), and 2. used a multi-level approach (some regions and some levels may be easier to segment / label). What I would have liked to see more of, however, was different types of evaluation and reasoning about why certain parts of the method worked / didn't work (besides just presenting the regular performance vs other algorithms on labeling). Some things that might have been interesting to see (some of which have been mentioned by others), could be 1. initial segmentation 2. # levels for each level (it could be interesting if certain things segmented better at consistently different levels) 3. explanation for the large variance in performance on some categories (ex. in table 1 leaf on boat, dog, chair, bird is very low, also in some cases leaf is better than hierarchy).

    ReplyDelete
  11. One of the biggest issues bothering the authors, I think, might be the cascading error issue. The authors use a "stacking" approach to alleviate this problem. In the following section, the authors also introduce a second round of "stacking" to solve the problem of having too many small regions in the next level. If I understand correctly, can we view this "stacking" approach as an extended or complicated version of cross-validation? Of course with a smarter combining strategy. If so, would the trial on various partition choices of training/validating set be another source of over-fitting or data abuse?

    ReplyDelete
    Replies
    1. That is an interesting way to look at the second round of stacking. I would not completely agree with you on the cross-validation part. I always view stacking as a nice combination strategy, sort of like calibrating the outputs of various classifiers. The "stacked" classifier decides which of the first level classifiers are reliable etc. The held-out data part comes so that they can have an inkling of its test time behavior.

      Delete
    2. > sort of like calibrating the outputs of various classifiers.
      seems like cross-validation to me even here...

      Delete
  12. Points I like about this paper:
    1. The idea of breaking intractable inference problem into smaller subproblems. This method provides a good insight on avoiding intractability from the computation perspective.
    2. The way they do hierarchical stacking is interesting since they use leave-one-out policy to simulate the distribution of the predictions during test time. This method is also applied in many applications to avoid overfitting and thus make the model generalizable, e.g. discriminative patch discovery [1].

    Things not clear to me:
    Due to the page limit, the author refer to some references for more details about many detailed design choice (e.g. why choosing max-ent classifier as building block, what about using other classification method?). Another concern is from the lack of results validating the key contribution of the paper which is to decompose a big intractable inference problem into smaller subproblems without performance loss. It would be great if there are some comparisons to let us see directly how this simplification works.

    ReplyDelete
    Replies
    1. I think the method that we could break the inference problem into smaller subproblems is limited by the hierarchical model they use in this paper. It is more like a layered directed graphical model so we could solve problem in each layer separately. An interesting question is that could we do the same thing to the other graphical model or could we get some conclusion that which type of graphical model could be solved by this way.

      Delete
    2. definitely a very interesting question ...I think I know more papers from Drew Bagnell which have broken other graphical models into series of prediction as well

      Delete
    3. I agree that many of the design choices seem arbitrary. In comparison to this method, the gPb-owt-ucm method seems very straightforward to me. I wonder if this is just because I'm more comfortable with signal processing jargon, or if the current paper is really more complicated.

      Delete
    4. Sorry, forget the reference...
      [1] Unsupervised Discovery of Mid-Level Discriminative Patches, Saurabh Singh, Abhinav Gupta, Alexei A. Efros

      Delete
  13. The key point in this paper is the ability to try to solve the segmentation problem by solving each level in a hierarchy of segmentation problems and using message passing between the layers. I wonder if this could also work for the idea of object detection at multiple scales. Instead of feeding multi-scale features into a single classifier, we could train multiple classifiers at each scale (the hierarchy) and pass back-and-forth the predicted detection score. Seems like something that probably has been done, but would be interesting to see.

    ReplyDelete
    Replies
    1. Again, this is why I think that segmentation (at least at the semantic level) and object detection are almost the same problem approached from different perspectives. When we recognize the existence of a visual entity, do we want to fill the projection on our retina with some color or put a box around it?

      Delete
    2. I slightly view the object detection in different view. Here in this paper, we are essentially talking about "scene", or in a better way, an image consists of multiple objects. And from my understanding, the labels provided in this case are only in the object level, meaning that the relation between different part of the image is object to object relationship.
      On the other hand, multiple scale object detection in such framework wouldn't work because we don't really partition the object into a set of sub-objects. Yes, conceptually it should be do-able by having meticulous labeling to each sub-part of the object, which is not possible as we already facing the lacked of labeling even at whole object level. And with limited labeling, I also doubt if this would perform better.

      Delete
    3. Well I think segmentation produces a more fine-grained shape for the semantic-labeling problem. If segmentation algorithm works perfectly and we have a good representation for it, it should outperform bounding-box based methods because it gets rids of the unnecessary background clutter information automatically, and does not need more data to let the machine determine which region it should look like (training to get the weights). Rectangular regions work fine if we have enough data, yet to have an unbiased look we'd better work with segments, which is of-course not possible at this time since humans are not able to label image to such fine details.

      Delete
  14. Good idea:
    The idea in this paper that breaking the intractable inference problem into subproblems is really interesting. The layered hierarchical graphical model could be decomposed into subproblems for each level.

    Question:
    The question to me is that the way they get those different levels of image segmentation is not clear. For me, this model will work only in the case that in the same level, each region or super pixel should have the similar size or segmented completeness which means that each level, the fine or coarse of the segments should be similar. However, it is not clear to me how we could constrain this. For me, it is hard to keep this if we segment complicated images and simple images.

    ReplyDelete
  15. Ehh... I am curious is there a way to compare results from this paper and from the Spatial Pyramid Matching paper from one week ago. They use different databases and maybe different image features so I guess not really. I mean the "Spatial Pyramid Matching" looked so robust and straightforward compared to this one..

    ReplyDelete
    Replies
    1. I'm not totally clear on exactly what the levels in this paper are, but I'm fairly certain they're not on a grid. It seems, though, that there are many examples of this course to fine idea and that they all seem to perform quite well.

      Also, the Spatial Pyramid Matching paper was trying to match scenes. This, I think, was trying to get better segmentations at the lower levels?

      Delete
  16. I found the methodology of this paper pretty difficult to understand. I think I understand the broad idea - performing segmentation at coarse to fine levels, passing probabilities between levels, not making any hard decisions. However, I understand almost none of the details. How do you separate the image into different regions? What, exactly, do the different levels mean? What, exactly, is passed between levels? If each level simply predicts a proportion of labels, is that what is passed down? Do we hope that at the finest level, each region has a huge majority of one label, and that is the classification?

    In general, I really like the idea of not making a hard decision and passing probabilities along (even if I don't quite understand how they're doing this in this paper). It seems like at the top level, the algorithm gets a general idea of the content of the image, and at the lower levels, figures out the finer details of the contours. It makes sense that getting a general idea and passing this information downward for figuring out details works well over trying to do everything at once.

    ReplyDelete