16-824: Learning-based Methods in Vision (F'13): Reading for 9/24

Sunday, September 22, 2013

Reading for 9/24

P. Arbelaez, M. Maire, C. Fowlkes and J. Malik, Contour Detection and Hierarchical Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence. May 2011.

And optionally:

X. Ren and J. Malik, Learning a Classification Model for Segmentation, ICCV, 2003

Joao Carreira, Cristian Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. CVPR 2010.

70 comments:

Divya HariharanSeptember 22, 2013 at 6:36 PM
Summary:
In this paper, the authors provide a novel way of looking at the contour detection and using it for image segmentation. The proposed method called global Probability of boundary (gPb) uses brightness, color and texture cues and combines them to obtain the probability that a pixel is a boundary pixel at a particular orientation. The original Pb paper by Martin et. al. in PAMI 2004 and the subsequent improvements made by Maire and Arbelaez et. al. in CVPR 2008 and 2009 were the motivation behind this paper. The multi-scale extension to Pb by Ren in ECCV 2008 was a significant contribution to the field. The authors of this paper introduce their own multi-scale version of gPb called “mPb”. In addition to the multi-scale version, the authors propose a method to introduce some global information into their framework using spectral clustering techniques. Once mPb and sPb are computed, gPb is just a linear combination of these.

Using the gPb information, the authors then propose the Oriented Watershed Transform – Ultrametric Contour Map (OWT-UCM) method for image segmentation. They compare the results of this segmentation on the Berkeley Segmentation Dataset with standard segmentation algorithms like Mean Shift, NCuts, etc. The results clearly show that gPb-owt-ucm outperforms all the other previous methods both for boundary detection and segmentation.

Strengths:
1. The intuitive reasoning behind the proposed method for contour detection is very simple and effective.
2. The mPb is a fairly straightforward technique for introducing the multi-scale factor into the algorithm.
3. sPb is a really cool idea. The authors have used the fact that eigenvectors carry contour information and they use this information to obtain the spectral component of the boundary detector.
4. I like the idea of using a modified version of the watershed transform and the way they insist the consistency of the boundary strength to remove artifacts. Using the OWT weights as priors for building the region tree using UCM is interesting.
5. The Interactive Segmentation part, where the method combines human inputs and their algorithm is a nice way of getting better results.
6. The authors have tested their method on other datasets and pointed out why their method performs as mentioned.

Weakness:
1. The paper uses gradient ascent to learn the parameters of mPb and gPb. Since gradient ascent gives only the local maximum, what would happen if some other method, like boosting to learn the weights? Was this explored?
2. Though the results of the hierarchical image segmentation look reasonable, they are not the best that we can get. CPMC adds Gestalt features to the gPb method and clearly it performs better.

gPb is definitely a huge breakthrough for the problem of contour detection and segmentation. There have been a lot of papers that use gPb and significantly changed the way things have been working in solving segmentation problems. I wonder if there can be a better baseline than gPb for this set of problems!
ReplyDelete
Replies
UnknownSeptember 22, 2013 at 8:51 PM
I've long been a follower and paper reader of the Berkeley Segmentation Dataset. The Berkeley group has really produced a lot of canonical papers regarding traditional unsupervised segmentation algorithms, and has been pushing their performance to the limit.

Yet, unsupervised segmentations after all are unsupervised. They have a lot of limits. They helped a lot in our understanding of "perceptual grouping", but they are definitely not going to give a final solution to the segmentation problems. This is my deepest feeling after years of following their work and researching on the same topic. Just like spatial pyramid matching, they are certain milestones in the vision community.
ReplyDelete
Replies
UnknownSeptember 22, 2013 at 9:18 PM
Regarding the paper specifically, I want to bring about an additional discussion point: Contour Finding based Methods vs. Clustering based Methods.

I'm not talking active contour models here, like Snakes, level sets and many other PDE based methods. They seem to be out of the scope of the discussion here and are used more in medical img processing.

There are several other "Contour Finding based Method". "Efficient graph-based image segmentation" by Pedro Felzenszwalb can be counted as another one. And so is "Ultrametric Contour Map". If you look at all these algorithms, the biggest commonality is that they try to model the inter-region (or inter cluster) dissimilarity and use that as a criteria for the decision of splitting or merging any two regions. But for the tractablity of the problems they typically start from merging small regions, which is bottom up.

The good things about formulating problems in this way is that they are often tolerant to intra-region variations which are locally smooth but globally considerably large. Another good thing is that they often generate a hierarachy of regions from coarse and fine, which is quite useful. But the most annoying thing for these algorithms are "Region Leak" caused by "Weak boundaries". This is because they typically start from small regions and the weak boundary portions are merged already in early stages. Weak boundaries are ubiquitous. I once tried the gpb-owt-ucm code before. It is state of the art. It is definitely nice on many images. But it generates annoying over-merged results on many images too. And I believe for subsequent operations such as detection/recognition over-mergings (region leaks) are more disastrous than over-segmentation.
ReplyDelete
Replies
UnknownSeptember 22, 2013 at 9:55 PM
On the other hand, clustering-based segmentation aims at finding regions through elaborate design of intra-region similarity metrics. If you look into the family of clustering based methods there are also many canonical algorithms, such as "The Shift Family" (mean shift, quick shift, really quick shift, dynamic mean shift, medoid shift, median shift, convex shift...), "The Cuts Family" (Min-Cut, Average Cut, Normalized Cuts, Ratio Cut, Graph Cut), spectral clustering. Even energy minimization methods and graph cut belong to this category because their color models and likelihood terms are essentially modelling intra-cluster similarity.

Compared with contour-finding, segmentation by clustering tends to suffer less from overmerging, but may at the cost of generating over-segmentation due to the lack of flexibility with cluster model.

Cuts in some sense looks like boundary estimation but they are not in a bottom-up form. They are actually top-down and often results in NP hard / complete problems. Normalized Cuts bypassed this by relaxing the problem into continuous domain and finding the eigenvectors of a generalized Rayleigh quotient. I seldom use Ncuts cuz they are slow and generates weird over-segmentations even at very smooth regions. I think this is due to the relaxation which generates smooth eigenvectors that need to be discretized finally.

Mean shift is old but believe it or not people are still using it. Simply because it is robust and doesn't have the over-segmentation problem. In fact it is kind of not fair to compare gpb-owt-ucm to mean shift, cuz mean shift uses very naive features, while gpb-owt-ucm did a lot of feature engineering. Features and similarity metrics matter a lot in segmentation, and directly influences the segmentation quality. In terms of just the segmentation, gpb-owt-ucm acts just like ucm and many other contour based methods. It has its weakness. If one uses better features, and improve the spatial smoothness constraint, the shift algorithms can also generate comparable results.
ReplyDelete
Replies
UnknownSeptember 22, 2013 at 10:17 PM
Finally, I kind of like the CPMC idea most in the sense it represents one of the possible and reasonable prospective directions of combining segmentation with top down understanding.

Humans may be doing it in a similar way: perform low level perceptual grouping and generate multiple hypothesis of segmentations first, based on perceptual similarity. Then the human reject many of the unreasonable ones with higher level information.

I also like graph cut and other interactive segmentation methods very much. In fact it is one of the few segmentation methods that have real commercial applications. They are fast, and they produce really accurate segmentations that are already able to be used for real editing/selection purposes. You can look into "Quick Select" in Photoshop. It performs graph cut over superpixelized images. This also shows how important high level information is in segmentation.
ReplyDelete
Replies
Mike McCannSeptember 23, 2013 at 10:49 AM
A fundamental problem I have with the UCM methodology is that fine-scale segmentations are always refinements of coarse-scale segmentations. But in an image like this one http://www.theinquirer.es/savedfiles/070613_crowd_above_02.jpg, a fine-scale segmentation should outline each person and a coarse-scale segmentation should outline the crowd as one unit.
ReplyDelete
Replies
AnonymousSeptember 23, 2013 at 12:56 PM
Wow, this got deleted last time since I don't have any of the identities the blog wants (I'm really opposed to the blog, btw. Can we go back to handing things in please?)

Anyway, I'll try to write it this time someplace where I can back it up:

====

In this enourmously complicated paper, the authors attempt to solve both contour detection and image segmentation at once by pipelining the output of a multi-scale contour detector into a "hierarchical region tree," which is then used to store spatial information about 2D components of the image. In the first stage of the pipeline, contours are detected first by computing multi-scale local gradient features using an algorithm called "gPb," which uses histograms on oriented, bisected disks. They are then combined using spectral clustering in a process they call "globalization." These "global" contours are then fed into an algorithm called the "Oriented Watershed Transform," which is an iterative procedure that finds local minima of expected distance to boundaries to quantify the centers of regions. This produces what they call an "Ultrametric Contour Map," which is a heirarchical tree of such regions.

Each stage of the pipline represents a sysnthesis of existing state-of-the-art methods.

Human guidance can be inserted into their pipeline during the contour phases, the Oriented Watershed Transform, or during the Ultrametric Contour Map to inform human-guided segmentation and contouring algorithms. They compare their segmentation method to several others using ground truth data obtained by asking humans to segment images. Their method performs favorably under these circumstances.

I unfortunately can't comment directly on the technical details of each phase of the pipeline, since I'm utterly unfamiliar with the field of image segmentation (and especially its state of the art forefront) but I will make a few vague observations:

1. Their use of human-in-the-loop algorithms at each stage of their pipeline opens up exciting new applications. Imagine in a Photoshop-like program, wanting to segment out an object from a cluttered background. Currently, this is a time-consuming task requiring the human to carefully trace (with some assistance), the boundary of the object. What if, instead, the program presented the user with a number of regions which they connect merely by clicking and dragging? Then, the image could be automatically segmented much faster.

2. I'm not convinced that the "ground truth" data is meaningful. Humans clearly segment images in totally different ways. This suggests that, depending on the task at hand, there are probably multiple "right" answers. Just looking at the human-segmented data, we can see that most of the time, people seem to want to segment the scene into the projections of semantic 3D objects onto the image plane (we'll call that the 3D-first approach), wheras some of the humans seem to segment the image into regions of 2D similarity (which I will call the 2D-first approach). Clearly, any method which relies on 2D contours is not going to capture some of the 3D segmentation information that humans "want" from an image.In the human-guided examples, you can see that the humans select arbitrary regions to belong together, some of which are occluded. Even some "unimportant" 3D objects are segmented into homogenous "background" elements, suggesting that some semantic knowledge is also at play there.

So I think we have to ask ourselves, what is a good segmentation? What do we want out of a segmented image? Do we want only to consider image properties, or are there 3D and semantic properties which need to be considered as well?

-- Matt Klingensmith.
ReplyDelete
Replies
Jacob WalkerSeptember 23, 2013 at 4:58 PM
This comment has been removed by the author.
ReplyDelete
Replies
IshanSeptember 23, 2013 at 6:12 PM
Things I liked about this paper
- Bottom-up dendrogram approach to segmentation.
- A nice spin on the N-cuts formulation, and a nice interpretation of eigen vectors of the graph Laplacian matrix. I knew that they represented connected components, but this visualization was nice.
- interactive segmentation application
- The speedup trick using integral images, separable 1D filters in the appendix.

Things I would have liked to see
- how much does each cue help in the gradient computation step. Particularly, how much (and I expect a lot) boost does the filter bank provide?
- A nice way to handle scale invariance for contour detection. As they mention, an adaptive scale invariance depending on what region of the image you are looking at, will help. I think this is crucial to address the segmentation problem. There are lots of "large-smooth-homogenous" regions, where slight errors in segmentation are okay. e.g. you may not care exactly where the sky meets the ground. Your robot may be off by 10m, but it's still fine. But there are other "small-very-missable" regions that may be important. Your robot-monster-truck shouldn't miss a person/child.
- A more easy-to-understand writing style e.g. "is composed of 591 natural images" => "has 591 natural images".
ReplyDelete
Replies
Abhinav ShrivastavaSeptember 23, 2013 at 8:07 PM
I agree with most of the comments above in the things they like about the paper. Here's some more points from my perspective:

Pros:
- Paper written in nice flow. They discuss what they do, what is the problem with that approach, and how the fix the problem. Sometimes, it appears that most things tricks that deal with some limitations of proposed method. But that's what most works are like, in general.
- I like all the different cues that they incorporate for contour detection.
- Their interpretation and use of eigen-vectors.

Cons:
- My first concern is that they exclude all the details for Contour Detection from this journal version. I understand that these details are already covered in another PAMI paper "Learning to Detect Natural Image Boundaries Using Local Brightness, Color, and Texture Cues", but I seriously missed the description and intuition form that paper.
- This work even omits details from their previous works in both contour detection and image segmentation. Frankly, I would have liked to see 2-3 more pages added to this already 20 page paper!
- “In contrast to [2] and [28] which use a logistic regression classiﬁer to combine cues, we learn the weights \alpha{i,s} by gradient ascent on the F-measure using the training images and corresponding ground-truth of the BSDS.” – No justification is provided why they did that. Especially given that they have very detailed analysis of why they used logistic regression in [2], this was a letdown.

MAJOR Concerns for me (Cons):
BSDS dataset and evaluation: This is a generic comment on the dataset rather than this paper. 1) Authors also point out the photographic bias in section 6, so not re-writing it again. 2) The dataset was constructed by asking the users to label boundaries. The users have their inherent biases in labeling regions or boundaries. Specifically, users have their semantic biases, which lead them to mark boundaries for objects/stuff rather than marking perceptual/low-level boundaries. For example, see middle column of Figure 3: there is a clear boundary at the reflection of boat in water, but all users mark the entire water as one region. May be we want the computer to do the same thing, but then it should have access to semantic supervision. The only way I see getting around this bias is by giving the user a small patch (or something) and asking them to mark the boundaries. This way we still keep "perceptual grouping" at low/mid-level but avoid semantic biases.

Top-down information for objects: Frankly, this section was the biggest letdown for me. I expected them to use ground-truth segmentation of objects from MSRC or PASCAL dataset, and use this top-down semantic supervision to guide the gPb. They could have learned gPb specific to different objects, one gPb classifier which corresponds to just objects etc. All they do in this work is use bounding box information to select the right segmentation threshold.
ReplyDelete
Replies
Humphrey HuSeptember 23, 2013 at 8:34 PM
I found it interesting that the BSDS500 results were generated using training results from their BSDS300 evaluation. According to the dataset description, the BSDS300 has 200 training images with segmentations from up to 30 people. In the best case, this is 6000 image examples, which seems a bit on the small side to learn 24 parameters from (2 x 4 channels x 3 scales). At the same time, it looks like all the images in the dataset are composed to be interesting to humans, which could explain why the method extended to BSDS500 well.
ReplyDelete
Replies
M AravindhSeptember 23, 2013 at 9:02 PM
The kmeans typically done after the eigen decomposition step of spectral clustering seems to break the image at smooth transitions. I think this might be something wrong with the way kmeans was used (multiple restarts, normalization, etc.). I do not see any intuition why this clustering problem is fundamentally different so as to require a very custom designed post processing step.
ReplyDelete
Replies
M AravindhSeptember 23, 2013 at 9:13 PM
The related work section has mentioned a large number of heavy machine learning/optimization algorithms - CRF, Markov Processes, Adaboost, Variational optimization. The paper from this perspective is not dense - logistic regression, fitting a parabola. It looks like they have successfully applied computer vision tools to make the feature space so clean that a simple machine learning algorithms are beating everything else.
ReplyDelete
Replies
UnknownSeptember 24, 2013 at 7:06 AM
I think human-in-the loop segmentation is totally useless, unless you're using the segmentation algorithm to label ground truth for a dataset, or using the human feedback to learn how to segment better. It's ludicrously slow, I'm sure. What would be a good framework for using human feedback to substantively improve segmentation performance? Given the fact that human segmentation differs between humans, the segmentation task is subjective. This suggests to me that there should be multiple outputs from a segmentation algorithm (even at each scale), and that perhaps a very simple form of useful feedback would be which one in which the human ranks the segmentations in order of preference.
ReplyDelete
Replies
UnknownSeptember 24, 2013 at 7:06 AM
One thing that the paper didn't talk about (maybe it was mentioned is previous papers), is the contribution of each stage of their contour detector. A major downside of gpb (in my opinion) is its speed and computational efficiency. They essentially combine a lot of different features/stages at the cost of efficiency - it would be useful to know which ones are the most important, both from a computation standpoint and from a human vision standpoint.
ReplyDelete
Replies
Srivatsan VaradharajanSeptember 24, 2013 at 7:46 AM
This comment has been removed by the author.
ReplyDelete
Replies
Srivatsan VaradharajanSeptember 24, 2013 at 7:46 AM
# Really vague and subjective observations:
## Things I liked
- Every stage of the pipeline in this paper gives the impression of having been carefully crafted; it seemed to me while reading almost every section that there possibly couldn't have been a better way to design the part of the pipeline that the section describes (this could also be because I haven't read a lot of literature on segmentation and so I don't have much to compare with).
- I found the spectral clustering part and the way it is used particularly fascinating. I don't remember seeing eigenvectors used this way before. The eigenvector images computed from the affinity matrix, especially for the examples on page 7 appear very 'clean' - which isn't something I've seen very often in computer vision.
- Also, I liked that the OWT-UCM algorithm, the way it is formulated in this paper, can be used with different sources of contours.
- What I thought would be interesting is to investigate other ways of constructing the hierarchial segmentation tree from the output of the Oriented Watershed Transform rather than the way it is done in section 4.2. The paper does discuss segmenting the tree at different thresholds based on criteria like the ODS and OIS. But the tree itself could be constructed using a different heuristic to decide which pair of regions are to be merged at every iteration instead of just the minimum pairwise distance. I realize that this is a somewhat vague observation, but it really appears to me that redefining the similarity metric for combining segments based on some application specific criteria could give significantly different (probably more useful) results.
## Things I didn't like
- I couldn't think of anything even nearly obviously wrong - the only part of the paper that made me flinch a little was reweighting the oriented watershed transform output. While the rest of the paper seemed to use very elegant techniques, this section alone felt too 'involved'.
ReplyDelete
Replies
UnknownSeptember 24, 2013 at 7:58 AM
In general, this paper is written in a very clean way, you can see clearly what has been done in every step. The results stand as the state of the art in the sense of the metric described in this paper. However, I'm skeptical about these metrics. Perhaps I miss something but to me the performance of unsupervised segmentation should be measured by the usefulness of it to higher level understanding of the image (semantics).
There are indeed many clever engineering in this paper, e.g., half-disc histograming, eigen image, OWT, etc. I'm wondering is this the way that we really should follow to make thing more and more complicated and finally we will solve the problem? Or perhaps the nature does never have that many mechanisms and some more principled ways are still there waiting to be discovered?
ReplyDelete
Replies
UnknownSeptember 24, 2013 at 8:21 AM
This paper contains a lot of engineering work which are hand-crafted. Although some of the techniques are clever, to redesign a similar system is complicated and it is too hard to tune a system like this. It is interesting to use some machine learning algorithms to replace some of the hand-crafted work. It will be helpful for future work that we can use machine learning algorithms to realize some of the useful hand-crafted work because we need to understand the principle behind the engineering techniques. Only by understanding the principles we can confidently use these methods and redesign more powerful systems.
ReplyDelete
Replies
UnknownSeptember 24, 2013 at 8:50 AM
I want to add one thing on the "spectral clustering" part of this work. Compared to the ECCV 1998 work from the same group: https://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/papers/lm_eccv98.pdf, the idea of solving the generalized Rayleigh quotient v'(D-W)v/v'Dv is exactly the same, but the way that the affinity matrix W(i,j) is generated sees one big change/improvement. In eccv'98 work, Wij is only a linear combination of local cues (edge, intensity, hue and saturation), but in this work, it also incorporates the "half-disc-histogramming" for gradient information. I admit this is a smart way of getting gradient information, and is probably one of the reasons for better performance, but sometimes it gives the feeling that the half-disc method is so "well-designed", and so "well-crafted".
ReplyDelete
Replies
UnknownSeptember 24, 2013 at 9:05 AM
Since edges are so successful in detection, I wonder why bottom up methods for boundary detection like gPb cannot be used for detection as well? There are faster versions nowadays (like sketch token, http://research.microsoft.com/en-us/um/people/larryz/CVPR13SketchTokens.pdf) and can produce even better results, while edge is still computed as the locally normalized version of histograms in HoG for detection. If we can propose more powerful ways (in the sense that the edge/boundary detectors are learned from data as well), are we able to engineer better detectors?
ReplyDelete
Replies
GauravSeptember 24, 2013 at 10:08 AM
On of the paper's strengths is looking at the problem of segmentation holistically and a wonderful choice for a paper to read on this topic. Some other good contributions:
1. I like the elegant way in which contour detection was globalized using the generalized eigen vectors and how the contour detection was fully framed in a probabilistic way using local and spectral signal information.
2. Oriented Watershed Transform is a very intuitive extension to Watershed transform. The example given wonderfully illustrates the problem of incorrect weighting and it's solution.
3. The optimization of oriented gradient calculation using rectangular scripts shows how approximations make life easier in computer vision and even though this is in the appendix it deserves more attention!

Critical points:
1. Segmentation ultimately is semantic and this "signal processing" job is just the start to a greater problem. Hand tuning of segmentation levels to select the best looking one is a bit like "playing to the gallery". Ultimately without semantic information the correct scale for UWT-UCM is unknown and this segmentation is a starting point for automatic semantic segmentations of the image.
ReplyDelete
Replies
ArunSeptember 24, 2013 at 10:15 AM
As a journal manuscript (at 20 pages), the paper had enough space to be able to describe many things with good detail. In addition, the paper had good images that contributed to the understanding. With that said, there was still quite a lot of algorithmic material and at times was difficult to fully comprehend. One part that I felt was lacking in some explanation was the use of the generalized eigenvectors (i.e. (D-W)v=lambda * Dv ). What is the intuition in eigenvectors for this system? D-W seems to encode some normalization on W (how much does W_{ij} contribute). The contributions with the oriented-watershed transform + ultrametric contour map seem to add value to their contour detector by allowing them to easily get multiscale segmentation.

I'm probably repeating things mentioned above, but this was my two cents.
ReplyDelete
Replies

Add comment