P. Arbelaez, M. Maire, C. Fowlkes and J. Malik, Contour Detection and Hierarchical Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence. May 2011.
And optionally:
X. Ren and J. Malik, Learning a Classification Model for Segmentation, ICCV, 2003
Summary:
ReplyDeleteIn this paper, the authors provide a novel way of looking at the contour detection and using it for image segmentation. The proposed method called global Probability of boundary (gPb) uses brightness, color and texture cues and combines them to obtain the probability that a pixel is a boundary pixel at a particular orientation. The original Pb paper by Martin et. al. in PAMI 2004 and the subsequent improvements made by Maire and Arbelaez et. al. in CVPR 2008 and 2009 were the motivation behind this paper. The multi-scale extension to Pb by Ren in ECCV 2008 was a significant contribution to the field. The authors of this paper introduce their own multi-scale version of gPb called “mPb”. In addition to the multi-scale version, the authors propose a method to introduce some global information into their framework using spectral clustering techniques. Once mPb and sPb are computed, gPb is just a linear combination of these.
Using the gPb information, the authors then propose the Oriented Watershed Transform – Ultrametric Contour Map (OWT-UCM) method for image segmentation. They compare the results of this segmentation on the Berkeley Segmentation Dataset with standard segmentation algorithms like Mean Shift, NCuts, etc. The results clearly show that gPb-owt-ucm outperforms all the other previous methods both for boundary detection and segmentation.
Strengths:
1. The intuitive reasoning behind the proposed method for contour detection is very simple and effective.
2. The mPb is a fairly straightforward technique for introducing the multi-scale factor into the algorithm.
3. sPb is a really cool idea. The authors have used the fact that eigenvectors carry contour information and they use this information to obtain the spectral component of the boundary detector.
4. I like the idea of using a modified version of the watershed transform and the way they insist the consistency of the boundary strength to remove artifacts. Using the OWT weights as priors for building the region tree using UCM is interesting.
5. The Interactive Segmentation part, where the method combines human inputs and their algorithm is a nice way of getting better results.
6. The authors have tested their method on other datasets and pointed out why their method performs as mentioned.
Weakness:
1. The paper uses gradient ascent to learn the parameters of mPb and gPb. Since gradient ascent gives only the local maximum, what would happen if some other method, like boosting to learn the weights? Was this explored?
2. Though the results of the hierarchical image segmentation look reasonable, they are not the best that we can get. CPMC adds Gestalt features to the gPb method and clearly it performs better.
gPb is definitely a huge breakthrough for the problem of contour detection and segmentation. There have been a lot of papers that use gPb and significantly changed the way things have been working in solving segmentation problems. I wonder if there can be a better baseline than gPb for this set of problems!
I'm the other critique-er and am posting here just so it's on top.
DeleteThis paper explores the idea of using contours to inform image segmentation. They begin by extending the work of Martin et al., who defined a Probability of Boundary function, Pb. The authors create:
(1) a multiscale version of the Pb detector (mPb), which computes the Pb function over three scales, and then combines them linearly,
(2) a spectral boundary detector (sPb), which performs spectral clustering on an affinity matrix that is computed using mPb
(3) a globalized probability of boundary (gPb), which is a linear combination of mPb and sPb.
The authors then use the boundary strength metric to create a segmentation using the watershed technique. Segmentation boundaries (watershed arcs) are where separate basins meet. The strength of these arcs is determined by the boundary-ness of the pixels along the arc. Standard watershed algorithms consider only the magnitude of boundary-ness, which causes artifacts (horizontal watershed arcs are weighted highly only because they are in proximity of pixels that have strong vertical edges). The authors therefore consider orientation to remove these artifacts.
The authors then use a sort of segmentation tree, called an Ultrametric Contour Map, to segment the images. This map merges regions by considering the average strength of their common boundary. Leaves are the separate regions defined by the watershed method. The root is the whole image.
The authors briefly explore the idea of using their tree in human assisted segmentation.
Results of the authors' algorithm are compared across a few datasets, and, as (almost) always, the authors outperform existing algorithms.
Strengths:
- multiscale extension of Pb algorithm
- computing gradients on eigenvectors from spectral clustering
- oriented watershed transform extension of traditional watershed method
These contributions by the authors are all simple and effective. Reading the paper, they seem like such obvious solutions to the problems that the contribution seems almost trivial, as if anyone could think up these solutions. However, if at the time, others had not considered such solutions, then I would say that these are very strong contributions of this paper.
I also like the idea of created nested segmentations using the Ultrametric Contour Map. It seems like a great way to order multiple segmentations.
Weaknesses:
- Is segmentation from contours really the best method? It seems like finding contours is difficult in itself. Should we try to solve a difficult problem using the solution of another difficult problem? Are more direct approaches to segmentation better? The paper seemed to jump straight into segmenting using contours, but I didn't find a convincing argument as to why this was the way to go.
- The authors had problems with the weights on their watershed arcs. The cause of this, I think, is that their boundaries were not well localized. I think this is due to the use of histograms - being slightly off of the boundary will still probably produce large differences in the histograms. I'm no expert in segmentation, but I was wondering if this histogram method is really better than using gradients. It seems that gradients would be better localized on the boundary, but I could be totally wrong here.
- The authors mention that it would be nice to represent uncertainty about a segmentation. They then proceed to describe their UCM algorithm. I'm not clear on exactly how this represents uncertainty about a segmentation. Lower in the tree seems to represent sections that should be merged with high confidence. The leaves of the tree are clearly oversegmented, and the root is undersegmented. However, this doesn't provide any estimate for how good a particular segmentation is for a given image.
In reply to Divya's comment:
DeleteWeakness 1: Using gradient methods for optimizing weights.
They have a detailed analysis on using different methods for combining these local cues in Martin et al. [2] (http://www.cs.berkeley.edu/~malik/papers/MFM-boundaries.pdf). I think they do a very good analysis of various ways of combining cues, but for the journal version (which we are supposed to read), they drop all that and just mention "In contrast to [2] and [28], which use a logistic regression classifier to combine cues, we learn the weights \alpha_{i,s} by gradient ascent on F-measure..."
This was a let-down... specially because they didn't justify this change.
On similar note, they removed all the details from original papers in this journal version, which I describe in my comment.
@Ada,In WT mean value calculation of E(x,y) is independent of the orientation of the arc, pixels on a "weak" arc may be close to other "strong" arcs that cause the pixel's E(x,y) response to be strong. E(x,y) is maximal over all orientations. This is non-intuitive and the authors then take into account the orientation of the arcs and calculate E(x,y) only along the orientation of the arc which is being weighted and not in all orientations. I think even gradients will suffer from the fuzzy boundary problem that gPb suffers from.
DeleteI've long been a follower and paper reader of the Berkeley Segmentation Dataset. The Berkeley group has really produced a lot of canonical papers regarding traditional unsupervised segmentation algorithms, and has been pushing their performance to the limit.
ReplyDeleteYet, unsupervised segmentations after all are unsupervised. They have a lot of limits. They helped a lot in our understanding of "perceptual grouping", but they are definitely not going to give a final solution to the segmentation problems. This is my deepest feeling after years of following their work and researching on the same topic. Just like spatial pyramid matching, they are certain milestones in the vision community.
There are several different things that people mean when they say "segmentation problems." If we want to do "person" vs "background" then top-down information is crucial. But even the lower-level problem of splitting an image into consistent regions is tough and may require more than a bottom-up approach.
DeleteI agree. The topics are kind of more related to the problem of "Perceptual Grouping". Gestalt laws are good investigations in perceptual grouping. Yet there are many more things need to be researched to define what really are "perceptually similar".
DeleteI agree with both of you. However, the way these datasets are constructed, the users have their biases in labeling semantic regions for segments. The only way I see around this bias is only giving a user small patches (or something) and asking them to mark the boundaries. This way we still keep "perceptual grouping" at low/mid-level but avoid semantic biases. This problem, though mentioned in previous papers, haven't been explored.
DeleteI particularly agree with the idea of incorporating higher level semantic information into the segmentation task. One thing that really needs to be clarified is: why do we need from image segmentation?
DeleteTo me, the answer is that this could serve as a intermediate step to help us to do higher level vision tasks, which is to really understand the image just like human. However, clearly this is a chicken-and-egg problem since being able to categorize and retrieve "objects" or "stuffs" from the image could help us to segment this image in a semantically meaningful way. I know this is out of the scope of this paper. But, since we know what we want from segmentation and also know what we need to do "useful" segmentation, why don't we directly model them in an joint manner. I particularly like the concept of the following paper, which I think Ishan has also mentioned:
http://www.cs.unc.edu/~jtighe/Papers/CVPR13/jtighe-cvpr13.pdf
I agree with Fanyi. To me, unsupervised segmentation is just a intermediate step of semantic learning. Those segments only have meanings after they are assigned semantic labels. It will be interesting if we can combine segmentation with semantic learning together.
DeleteRegarding the paper specifically, I want to bring about an additional discussion point: Contour Finding based Methods vs. Clustering based Methods.
ReplyDeleteI'm not talking active contour models here, like Snakes, level sets and many other PDE based methods. They seem to be out of the scope of the discussion here and are used more in medical img processing.
There are several other "Contour Finding based Method". "Efficient graph-based image segmentation" by Pedro Felzenszwalb can be counted as another one. And so is "Ultrametric Contour Map". If you look at all these algorithms, the biggest commonality is that they try to model the inter-region (or inter cluster) dissimilarity and use that as a criteria for the decision of splitting or merging any two regions. But for the tractablity of the problems they typically start from merging small regions, which is bottom up.
The good things about formulating problems in this way is that they are often tolerant to intra-region variations which are locally smooth but globally considerably large. Another good thing is that they often generate a hierarachy of regions from coarse and fine, which is quite useful. But the most annoying thing for these algorithms are "Region Leak" caused by "Weak boundaries". This is because they typically start from small regions and the weak boundary portions are merged already in early stages. Weak boundaries are ubiquitous. I once tried the gpb-owt-ucm code before. It is state of the art. It is definitely nice on many images. But it generates annoying over-merged results on many images too. And I believe for subsequent operations such as detection/recognition over-mergings (region leaks) are more disastrous than over-segmentation.
I agree with your opinion that "contour map"-like methods can work pretty well on strong boundaries, but the early merge of weak boundaries are indeed an issue. I once had experience segmenting iris region from ocular images using Chan-Vese's active contour method. Here is a link to a related paper segmenting iris image using active contour: http://www.cse.msu.edu/~rossarun/pubs/ShahRossGACIris_TIFS2009.pdf. The outer boundary for the iris image would appear to be weak-boundaries under NIR illumination. Using energy/PDE-based method like active contour seems to be working pretty well, especially when there is occlusions and reflections. It seems that the energy-based methods can alleviate the early merge of weak boundaries while growing the snake.
DeleteI am always wondering while contour detection is so popular in computer vision, why it cannot be applied back to machine learning as well. The contour can be regarded as boundary between two clusters and in general, can we define a boundary classifier for general machine learning tasks?
DeleteAnother thinking is about why contour would work for vision tasks? From the course I learned that humans tend to look at edges inside an image to recognize objects, and the most successful feature for detection (HoG) is all about edges. There are ways to incorporate color yet they are just quite incremental. It is quite intuitive to check the continuity of an edge and the contrast across it. This might be one reason. A downside of the region based clustering method might be it is very hard to identify a whole segment and the contour separating the regions have different importance?
In machine learning, I think a similar topic for unsupervised learning is: Discriminative Unsupervised Learning. Today most unsupervised learning methods are generative. There are few discriminative ones. But for supervised methods, both families well exist.
DeleteYou may be interested the following works: Maximum Margin Clustering. See:
1. L. Xu, et al., Maximum margin clustering. NIPS, 2005.
2. Kai Zhang, Ivor W. Tsang, James Kwok, Maximum Margin Clustering Made Practical, ICML 2007.
On the other hand, clustering-based segmentation aims at finding regions through elaborate design of intra-region similarity metrics. If you look into the family of clustering based methods there are also many canonical algorithms, such as "The Shift Family" (mean shift, quick shift, really quick shift, dynamic mean shift, medoid shift, median shift, convex shift...), "The Cuts Family" (Min-Cut, Average Cut, Normalized Cuts, Ratio Cut, Graph Cut), spectral clustering. Even energy minimization methods and graph cut belong to this category because their color models and likelihood terms are essentially modelling intra-cluster similarity.
ReplyDeleteCompared with contour-finding, segmentation by clustering tends to suffer less from overmerging, but may at the cost of generating over-segmentation due to the lack of flexibility with cluster model.
Cuts in some sense looks like boundary estimation but they are not in a bottom-up form. They are actually top-down and often results in NP hard / complete problems. Normalized Cuts bypassed this by relaxing the problem into continuous domain and finding the eigenvectors of a generalized Rayleigh quotient. I seldom use Ncuts cuz they are slow and generates weird over-segmentations even at very smooth regions. I think this is due to the relaxation which generates smooth eigenvectors that need to be discretized finally.
Mean shift is old but believe it or not people are still using it. Simply because it is robust and doesn't have the over-segmentation problem. In fact it is kind of not fair to compare gpb-owt-ucm to mean shift, cuz mean shift uses very naive features, while gpb-owt-ucm did a lot of feature engineering. Features and similarity metrics matter a lot in segmentation, and directly influences the segmentation quality. In terms of just the segmentation, gpb-owt-ucm acts just like ucm and many other contour based methods. It has its weakness. If one uses better features, and improve the spatial smoothness constraint, the shift algorithms can also generate comparable results.
doesn't have the over-segmentation problem -> doesn't have the over-merging problem.
DeleteFinally, I kind of like the CPMC idea most in the sense it represents one of the possible and reasonable prospective directions of combining segmentation with top down understanding.
ReplyDeleteHumans may be doing it in a similar way: perform low level perceptual grouping and generate multiple hypothesis of segmentations first, based on perceptual similarity. Then the human reject many of the unreasonable ones with higher level information.
I also like graph cut and other interactive segmentation methods very much. In fact it is one of the few segmentation methods that have real commercial applications. They are fast, and they produce really accurate segmentations that are already able to be used for real editing/selection purposes. You can look into "Quick Select" in Photoshop. It performs graph cut over superpixelized images. This also shows how important high level information is in segmentation.
A fundamental problem I have with the UCM methodology is that fine-scale segmentations are always refinements of coarse-scale segmentations. But in an image like this one http://www.theinquirer.es/savedfiles/070613_crowd_above_02.jpg, a fine-scale segmentation should outline each person and a coarse-scale segmentation should outline the crowd as one unit.
ReplyDeleteCan you justify why the crowd should be considered as one unit at coarser scales without using semantic, cultural knowledge?
Delete-- Matt Klingensmith
Agree with Matt's comment. Without semantics and cultural knowledge, it seems pretty difficult for the algorithm to do this. However, I think it might count towards "perceptual" grouping (using the gestaltist laws, but I don't know...
DeleteThe crowd can be thought of as a texture pattern and not really a group of people. This can be a reason to group them into one region in level 2 of the tree. Cultural info is probably used but not certainly so.
DeleteWow, this got deleted last time since I don't have any of the identities the blog wants (I'm really opposed to the blog, btw. Can we go back to handing things in please?)
ReplyDeleteAnyway, I'll try to write it this time someplace where I can back it up:
====
In this enourmously complicated paper, the authors attempt to solve both contour detection and image segmentation at once by pipelining the output of a multi-scale contour detector into a "hierarchical region tree," which is then used to store spatial information about 2D components of the image. In the first stage of the pipeline, contours are detected first by computing multi-scale local gradient features using an algorithm called "gPb," which uses histograms on oriented, bisected disks. They are then combined using spectral clustering in a process they call "globalization." These "global" contours are then fed into an algorithm called the "Oriented Watershed Transform," which is an iterative procedure that finds local minima of expected distance to boundaries to quantify the centers of regions. This produces what they call an "Ultrametric Contour Map," which is a heirarchical tree of such regions.
Each stage of the pipline represents a sysnthesis of existing state-of-the-art methods.
Human guidance can be inserted into their pipeline during the contour phases, the Oriented Watershed Transform, or during the Ultrametric Contour Map to inform human-guided segmentation and contouring algorithms. They compare their segmentation method to several others using ground truth data obtained by asking humans to segment images. Their method performs favorably under these circumstances.
I unfortunately can't comment directly on the technical details of each phase of the pipeline, since I'm utterly unfamiliar with the field of image segmentation (and especially its state of the art forefront) but I will make a few vague observations:
1. Their use of human-in-the-loop algorithms at each stage of their pipeline opens up exciting new applications. Imagine in a Photoshop-like program, wanting to segment out an object from a cluttered background. Currently, this is a time-consuming task requiring the human to carefully trace (with some assistance), the boundary of the object. What if, instead, the program presented the user with a number of regions which they connect merely by clicking and dragging? Then, the image could be automatically segmented much faster.
2. I'm not convinced that the "ground truth" data is meaningful. Humans clearly segment images in totally different ways. This suggests that, depending on the task at hand, there are probably multiple "right" answers. Just looking at the human-segmented data, we can see that most of the time, people seem to want to segment the scene into the projections of semantic 3D objects onto the image plane (we'll call that the 3D-first approach), wheras some of the humans seem to segment the image into regions of 2D similarity (which I will call the 2D-first approach). Clearly, any method which relies on 2D contours is not going to capture some of the 3D segmentation information that humans "want" from an image.In the human-guided examples, you can see that the humans select arbitrary regions to belong together, some of which are occluded. Even some "unimportant" 3D objects are segmented into homogenous "background" elements, suggesting that some semantic knowledge is also at play there.
So I think we have to ask ourselves, what is a good segmentation? What do we want out of a segmented image? Do we want only to consider image properties, or are there 3D and semantic properties which need to be considered as well?
-- Matt Klingensmith.
The CPMC paper talks about "what makes a good segment". The method basically generates multiple segmentations for an image and assigns a rank to each of them. The performance is then evaluated using the Segment Covering benchmark. Though I'm slightly skeptical about how well their segment ranking scheme works, the discussion of good segmentation is definitely worth exploring.
DeleteMatt, you're still welcome to submit a printed summary if you don't want to make a second blog post (and note that your don't have to summarize the paper in either one of your posts). However, I think that one technical glitch is not a very good reason to oppose the entire blog on principle, especially since most people I know already make a habit of composing long web form submissions in external editors. While I took this class, I remember finding others' comments very helpful in my own understanding, plus reading comments does help presenters prepare. If you still feel this way, though, we'd be happy to discuss further.
DeleteI don't think that humans select 'arbitrary' regions as belonging together - it's just that faced with the task of labeling a group of objects adjacent to each other, at some point the boundaries get too complicated for labeling and they start dumping things together in the same segment. This is quite obvious in the mushroom example on page 16 of the paper. Though the ferns and the rocks are quite different in appearance and 3D depths, the person doing the labeling has grouped everything but the mushroom into a single 'background' segment.
DeleteAnd again, the depth of detail to which each person delves into while labeling an image will be different among different people. An old paper of Malik's group discusses how we can take into account the different levels of granularity in labeling while comparing two segmentations [1].
It may be true that humans, most of the time, attempt to perform '3D-first' (as Matt puts it) segmentation by inferring 3D and semantic meaning from the image before proceeding to segment it. While it might seem that this algorithm and a lot of others rely solely on 2D contours and therefore neglect 3D information, it can also be argued that in most cases segments of 3D objects are often a subset of the 2D segments. The goal of these algorithms, as I understand, is usually to group pixels together based on appearance and location so that the pixels in each segment very likely belong to the same object, though the actual 3D object might be divided across multiple segments sometimes. Then, such segmentations using 2D cues would be a first step towards obtaining segments that humans 'want'.
1. Martin, David, et al. "A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics." Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on. Vol. 2. IEEE, 2001.
I agree with Matt's statement that we should consider what makes a segmentation a good segmentation. Maybe segmentations that computers find helpful are not those that humans would pick.
DeleteAlong that line, maybe we should also consider the importance of segmentation itself.
It seems to me that segmentation is often thought of as an intermediate step to make computer vision easier. For example, segmenting out objects to then do object classification on the segments. Or maybe, in the medical world, segmenting out cells to do cell counting/tracking.
What is the end goal of computer vision? I'll take a stab at answering this question: to allow robots to interact intelligently with the world. (By "intelligently," I mean not running into objects/walls, capable of reasoning about their visual world enough to manipulate objects.) I realize this is still not well defined, but my point is that for this task, maybe segmentation is unnecessary for achieving the end goal.
Humans may use some sort of segmentation, but this may not be necessary for robots particularly since it is clear that humans don't agree on what defines "correct" segmentation. Furthermore, what does it even mean to segment something with soft edges like a wispy cloud, and do we even need to figure out how to do this? I personally would hate it if someone put a picture of a cloud in front of me and asked me to segment it out.
What do you all think? Why is segmentation a necessary problem to solve?
Wow. What a dense paper! One thing that really stands out to me, conceptually, is the idea of representing contours as a signal instead of a simple existence/non-existence feature. There has been published research suggesting that neurons in deeper levels in the visual cortex will respond to specific combinations of contours; there may be a connection between soft contour detection and object recognition:
Deletehttp://jn.physiology.org/content/86/5/2505
I agree with Ada that segmentation by itself is an intermediate step towards solving the vision problem and I do think it is very much essential to solve the segmentation problem. I can't see how else a robot could be trained to understand what each "region" represents in an image of the real world. Also, I think the goal of segmentation is to define the shape of objects better rather than just using a bounding box around them, which should ideally be a better input for object recognition as Jacob mentions. By "objects", I mean things that have a defined shape and size as Arne mentioned in the previous class. Only when you need information about the entire image (if we want to say the given image is a meadow or something), the importance of things like grass, sky, clouds come into picture, which now becomes the problem of scene classification. And segmentation may not be too important for this.
DeleteI agree with Divya. I think segmentation is a good problem to solve especially if you think
Delete1. there is a distinction between "objects" and "stuff".
2. "stuff" can help tasks like detection, navigation etc. by providing context.
Also, I think for a problem like segmentation there are multiple correct answers. I am more inclined towards methods that give multiple segmentations, rather than a single one (e.g. Derek's paper - Recovering Surface Layout from an Image).
DeleteI don't think that segmentation is a particularly good problem, unless some application specifically requires it, but that's just me :)
DeleteI prefer to think that segementation is necessary to solve vision, so that we can reason about objects that have a well defined shape rather than a bounding box. If we could have perfect segmentation then we could reason on each region in isolation and aggregate those results. So I would say that solving segmentation would definitely help solve other vision problems. But we could potentially also solve object detection first, for example, and then use those results to get closed countours of the objects giving us a valid segmentation. And for a robot, I would say that a segmentation consisting of various objects the robot must be aware of and everything else as "background" region, would be a enough to let it interact "intelligently".
DeleteAs for what qualifies as a good segmentation, I would guess that a good segmentation would be the average segmentation provided by a large number of people. Most people will segment out similar features in the image and there may be differences in small details and the resolution of the segmentation, but I think that if we were to take an average over a large number of people, we could easily see what are the important contours (occur in a large majority of human segmentations) and which are more fine-grained details. Then an segmentation algorithm's job would be to mark all the important contours that occur and no contours which do not occur (i.e. not make any mistakes).
@Priya: See this "Learning to Localize Detected Objects" (http://www.cs.uiuc.edu/~dhoiem/publications/cvpr2012_objectsegmentation_qieyun.pdf)
DeleteI agree with this post and Abhinav's comment... segmentation is not a particularly well-posed problem. Considering the fact that humans themselves often cannot agree on what a proper segmentation is, there is only so far one can push the "state-of-the-art" in this problem without needing to rely on more information, e.g. semantic and contextual information. Segmentation is useful as input to top-down and semantic approaches, but pursuing "the best segmentation" is difficult and I think not particularly interesting.
DeleteI think segmentation is not a well posed problem, but it is currently an essential stage for the whole pipeline of a vision system since it explicitly gives the notion of objects of interest. There are several approaches proposed for segmentation evaluation. Most of them focus on segmentation itself. I don't think the ground truth of segmentation is quite meaningful, since different human individuals also focus on different scales or fineness of segmentation. However, I hope to address the intermediate segmentation phase and higher level of recognition and matching as a combinational optimization problem rather than as separate stages, which might have been more effective. The rightness of a segmentation algorithm or the one from a potential segmentation hypothesis sets is supposed to be evaluated in the scenario of the whole pipeline. This seems to become an chicken-egg or coupled problem again.
DeleteThis comment has been removed by the author.
DeleteI find this question of how relevant the segmentation task is to solving the vision problem interesting. The groundtruth data from the human is a result of a very deliberate exercise in perception, human intelligence, and even drawing. (I have seen ground truth labels from people with no artistic skills.) It is difficult to claim that this ground truth is completely seperate from human reasoning about the world. This data is then used to benchmark an algorithm with comparatively little knowledge about the world. I am impressed that their algorithm does so well, but that increases my concern about why.
DeleteI think that metrics are another way in which results can appear better/worse than they are. For example, a single edge, added or deleted, can have a large implication semantically, but its hard to think of a way to capture this in the metrics or groundtruth.
This comment has been removed by the author.
ReplyDeleteThings I liked about this paper
ReplyDelete- Bottom-up dendrogram approach to segmentation.
- A nice spin on the N-cuts formulation, and a nice interpretation of eigen vectors of the graph Laplacian matrix. I knew that they represented connected components, but this visualization was nice.
- interactive segmentation application
- The speedup trick using integral images, separable 1D filters in the appendix.
Things I would have liked to see
- how much does each cue help in the gradient computation step. Particularly, how much (and I expect a lot) boost does the filter bank provide?
- A nice way to handle scale invariance for contour detection. As they mention, an adaptive scale invariance depending on what region of the image you are looking at, will help. I think this is crucial to address the segmentation problem. There are lots of "large-smooth-homogenous" regions, where slight errors in segmentation are okay. e.g. you may not care exactly where the sky meets the ground. Your robot may be off by 10m, but it's still fine. But there are other "small-very-missable" regions that may be important. Your robot-monster-truck shouldn't miss a person/child.
- A more easy-to-understand writing style e.g. "is composed of 591 natural images" => "has 591 natural images".
I agree with most of the comments above in the things they like about the paper. Here's some more points from my perspective:
ReplyDeletePros:
- Paper written in nice flow. They discuss what they do, what is the problem with that approach, and how the fix the problem. Sometimes, it appears that most things tricks that deal with some limitations of proposed method. But that's what most works are like, in general.
- I like all the different cues that they incorporate for contour detection.
- Their interpretation and use of eigen-vectors.
Cons:
- My first concern is that they exclude all the details for Contour Detection from this journal version. I understand that these details are already covered in another PAMI paper "Learning to Detect Natural Image Boundaries Using Local Brightness, Color, and Texture Cues", but I seriously missed the description and intuition form that paper.
- This work even omits details from their previous works in both contour detection and image segmentation. Frankly, I would have liked to see 2-3 more pages added to this already 20 page paper!
- “In contrast to [2] and [28] which use a logistic regression classifier to combine cues, we learn the weights \alpha{i,s} by gradient ascent on the F-measure using the training images and corresponding ground-truth of the BSDS.” – No justification is provided why they did that. Especially given that they have very detailed analysis of why they used logistic regression in [2], this was a letdown.
MAJOR Concerns for me (Cons):
BSDS dataset and evaluation: This is a generic comment on the dataset rather than this paper. 1) Authors also point out the photographic bias in section 6, so not re-writing it again. 2) The dataset was constructed by asking the users to label boundaries. The users have their inherent biases in labeling regions or boundaries. Specifically, users have their semantic biases, which lead them to mark boundaries for objects/stuff rather than marking perceptual/low-level boundaries. For example, see middle column of Figure 3: there is a clear boundary at the reflection of boat in water, but all users mark the entire water as one region. May be we want the computer to do the same thing, but then it should have access to semantic supervision. The only way I see getting around this bias is by giving the user a small patch (or something) and asking them to mark the boundaries. This way we still keep "perceptual grouping" at low/mid-level but avoid semantic biases.
Top-down information for objects: Frankly, this section was the biggest letdown for me. I expected them to use ground-truth segmentation of objects from MSRC or PASCAL dataset, and use this top-down semantic supervision to guide the gPb. They could have learned gPb specific to different objects, one gPb classifier which corresponds to just objects etc. All they do in this work is use bounding box information to select the right segmentation threshold.
Another practical drawback: it is *very* slow to use.
DeleteTo echo the sentiments of other readers, why I believe the segmentation quality is related to the final task. If this task involves human semantics, then wouldn't a semantic bias be ideal?
DeleteAlso, it seems to me that using top-down information from the labeled datasets would do exactly the opposite of your suggestion to keep to low/mid-level perceptual grouping.
"Inconsistently correct" vs. "Consistently wrong". I want to add on your comment about the human labeling of ground truth data. I had experience outsourcing some labeling tasks on m-turk. The task is simple, given a face image, label where the eyes are, where the nose is, etc. There of course are automated ways to do that (e.g. active shape model, active appearance model), but their accuracy drops significantly when non-ideal faces are present (low-res, off-angle, occlusion, illumination variations, etc). Humans, however, can (or at least expected to) perform much better in these cases. What we found out from the m-turk labeling is that human labelings are so inconsistent. A single clicker can produce highly variant labels for a particular fiducial point. Whereas, automated algorithms are always slightly off, but they are very consistent.
DeleteI think that in your example of learning a specific gPb for each object, you could probably do better by using a linear combination of a set of basis segmentations. This would give you a good coarse segmentation though, and would be lacking in finer details. Also, in a big data approach, you could find the k-nearest neighbors (in the training set) in the object class and take the average segmentation of those.
DeleteTo comment on Felix's point:
DeleteIs there a way to fix human inconsistencies by getting redundant human annotations and using the one that occurs most often, similar to the reCAPTCHA concept?
@Humphrey: Yes, these both are separate comments.
DeleteOne concerns the fact that the problem is posed as contour detection but aims to do what humans do (which might have biases). If the task was posed as incorporating these inductive biases and performing similar to what humans think, then what they do is correct.
The top-down information comment was independent of the dataset. It was more aligned to what they exactly do in that section. I would have liked to see how they incorporate semantic labels for boundaries (or at boundary level) in their approach, which is current just unsupervised.
Big data transfering approach: Look at the recent "scene parsing" paper by Tighe et. al. in CVPR, 2013
DeleteThe learning gPb for objects vs stuff reminds me of Ferrari's "objectness" measure, which I particularly find useful.
@Felix: The authors did mention in their previous works (and to some extent in this work) that human labels are not consistent. There are lot of ways these labels may vary. And I agree with what you said as well. My point was more on the lines of human labels having the semantic bias. If they are measuring against that, then in most cases they want gPb to come up with these boundaries/groupings as opposed to what might the pixels suggest (reflection for example: should reflection be a true boundary or not?)
DeleteThere have been multiple studies for trying to "tame mturkers". One useful measure is to let them annotate a "gold standard" set of images, and if these annotations are somewhat consistent, let these mturkers go ahead. This helps prune away people who are going to give bad annotations.
DeleteThe averaging idea sounds nice.
Damn, Ishan stole my reply! :)
Delete@Priya: Yes, I agree that there are lot of different (and probably easy) things you can do in Big Data regime. But I didn't like how they use bounding box for selection object boundary as opposed to doing learning boundary for semantic categories. Not it could be boundary that preserves object contours, that preserves a particular object or that occurs between 2 different objects..
Adding some to the semantic bias vs no semantic bias part of this thread :-
DeletegPb as an unsupervised segmentation technique is going to be used as semantic segmentation candidate generator (one use case). In this setting we want it to contain all the ground truth semantic segments (entire cars etc.) and a semantic bias is probably good.
gPb is generating segments which will then be classified as potholes on a road vs good road patches. In such a situation the bias is very very new ... most turk workers wouldn't have looked at it this way. May be we don't want it to have semantic bias here in the sense of the road being an entire object.
The inconsistently correct and consistently wrong issue of human labeling may be also related to the bias-variance tradeoff.
DeleteI found it interesting that the BSDS500 results were generated using training results from their BSDS300 evaluation. According to the dataset description, the BSDS300 has 200 training images with segmentations from up to 30 people. In the best case, this is 6000 image examples, which seems a bit on the small side to learn 24 parameters from (2 x 4 channels x 3 scales). At the same time, it looks like all the images in the dataset are composed to be interesting to humans, which could explain why the method extended to BSDS500 well.
ReplyDeleteThey also mention the dataset bias. "We did not see any performance benefit on the BSDS by using additional scales...it is a statement about the nature of the BSDS.".
DeleteIt's good to see an example that even with a larger dataset, if the images all come from a similar distribution, it may not actually help with your algorithm's tuning.
The kmeans typically done after the eigen decomposition step of spectral clustering seems to break the image at smooth transitions. I think this might be something wrong with the way kmeans was used (multiple restarts, normalization, etc.). I do not see any intuition why this clustering problem is fundamentally different so as to require a very custom designed post processing step.
ReplyDeleteI think it has a more inherent problem in the selection of K (the major problem with K means anyway). It creates a breaking up into K segments, but you don't know what the optimal K is. I think it's a neat idea that the eigenvectors themselves look like they encode boundaries and tackle it as a contour detection problem (taking gradients of the eigenvector images).
DeleteThe related work section has mentioned a large number of heavy machine learning/optimization algorithms - CRF, Markov Processes, Adaboost, Variational optimization. The paper from this perspective is not dense - logistic regression, fitting a parabola. It looks like they have successfully applied computer vision tools to make the feature space so clean that a simple machine learning algorithms are beating everything else.
ReplyDeleteI think human-in-the loop segmentation is totally useless, unless you're using the segmentation algorithm to label ground truth for a dataset, or using the human feedback to learn how to segment better. It's ludicrously slow, I'm sure. What would be a good framework for using human feedback to substantively improve segmentation performance? Given the fact that human segmentation differs between humans, the segmentation task is subjective. This suggests to me that there should be multiple outputs from a segmentation algorithm (even at each scale), and that perhaps a very simple form of useful feedback would be which one in which the human ranks the segmentations in order of preference.
ReplyDeleteI agree that there is a large bias towards what a human "thinks" is the right segmentation. One argument against this is that its great that we may be able to find contours similar to humans, but so what? For computer vision applications this is probably not the best representation, (although it may be for graphics or vision tasks where humans need to interpret the results). As you mentioned, humans segment things differently, which also makes (I think) justifying human based contours as a good benchmark. Personally for vision applications, I think task based benchmarks not tied to human vision might be more useful.
DeleteOne thing that the paper didn't talk about (maybe it was mentioned is previous papers), is the contribution of each stage of their contour detector. A major downside of gpb (in my opinion) is its speed and computational efficiency. They essentially combine a lot of different features/stages at the cost of efficiency - it would be useful to know which ones are the most important, both from a computation standpoint and from a human vision standpoint.
ReplyDeleteThis comment has been removed by the author.
ReplyDelete# Really vague and subjective observations:
ReplyDelete## Things I liked
- Every stage of the pipeline in this paper gives the impression of having been carefully crafted; it seemed to me while reading almost every section that there possibly couldn't have been a better way to design the part of the pipeline that the section describes (this could also be because I haven't read a lot of literature on segmentation and so I don't have much to compare with).
- I found the spectral clustering part and the way it is used particularly fascinating. I don't remember seeing eigenvectors used this way before. The eigenvector images computed from the affinity matrix, especially for the examples on page 7 appear very 'clean' - which isn't something I've seen very often in computer vision.
- Also, I liked that the OWT-UCM algorithm, the way it is formulated in this paper, can be used with different sources of contours.
- What I thought would be interesting is to investigate other ways of constructing the hierarchial segmentation tree from the output of the Oriented Watershed Transform rather than the way it is done in section 4.2. The paper does discuss segmenting the tree at different thresholds based on criteria like the ODS and OIS. But the tree itself could be constructed using a different heuristic to decide which pair of regions are to be merged at every iteration instead of just the minimum pairwise distance. I realize that this is a somewhat vague observation, but it really appears to me that redefining the similarity metric for combining segments based on some application specific criteria could give significantly different (probably more useful) results.
## Things I didn't like
- I couldn't think of anything even nearly obviously wrong - the only part of the paper that made me flinch a little was reweighting the oriented watershed transform output. While the rest of the paper seemed to use very elegant techniques, this section alone felt too 'involved'.
In general, this paper is written in a very clean way, you can see clearly what has been done in every step. The results stand as the state of the art in the sense of the metric described in this paper. However, I'm skeptical about these metrics. Perhaps I miss something but to me the performance of unsupervised segmentation should be measured by the usefulness of it to higher level understanding of the image (semantics).
ReplyDeleteThere are indeed many clever engineering in this paper, e.g., half-disc histograming, eigen image, OWT, etc. I'm wondering is this the way that we really should follow to make thing more and more complicated and finally we will solve the problem? Or perhaps the nature does never have that many mechanisms and some more principled ways are still there waiting to be discovered?
I also had an impression that every step in a paper is well thought through, with a good understanding what choosing a different step would bring. Also it was interesting for me to see that they had a considerable performance boost both in contour extraction and in segmentation.
DeleteAs for the complicated engineering on contours like half-disk histogram and eigen-s, I have an impression that maybe there is a way which extracts pixel neighbourhood in the best way. I want to mention SIFT here - after so many years, probably still the best descriptor, maybe it uses the pixel info in the best way...
This paper contains a lot of engineering work which are hand-crafted. Although some of the techniques are clever, to redesign a similar system is complicated and it is too hard to tune a system like this. It is interesting to use some machine learning algorithms to replace some of the hand-crafted work. It will be helpful for future work that we can use machine learning algorithms to realize some of the useful hand-crafted work because we need to understand the principle behind the engineering techniques. Only by understanding the principles we can confidently use these methods and redesign more powerful systems.
ReplyDeleteThere is quite a bit of engineering work, but we see this even with using machine learning algorithms. A few parameters (like the scales, \sigma, or the radius for the affinity matrix, r) are hand-tuned, but then parameters like the linear combination to do the mPb are learned using gradient ascent. Similarly, the gPb weights are trained using gradient ascent. Even machine learning algorithms end up having tuned parameters (bandwidths for kernels, etc). I agree they could have tried something more generic (e.g. cross-validation to pick the scales, etc.)
DeleteI want to add one thing on the "spectral clustering" part of this work. Compared to the ECCV 1998 work from the same group: https://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/papers/lm_eccv98.pdf, the idea of solving the generalized Rayleigh quotient v'(D-W)v/v'Dv is exactly the same, but the way that the affinity matrix W(i,j) is generated sees one big change/improvement. In eccv'98 work, Wij is only a linear combination of local cues (edge, intensity, hue and saturation), but in this work, it also incorporates the "half-disc-histogramming" for gradient information. I admit this is a smart way of getting gradient information, and is probably one of the reasons for better performance, but sometimes it gives the feeling that the half-disc method is so "well-designed", and so "well-crafted".
ReplyDeleteSince edges are so successful in detection, I wonder why bottom up methods for boundary detection like gPb cannot be used for detection as well? There are faster versions nowadays (like sketch token, http://research.microsoft.com/en-us/um/people/larryz/CVPR13SketchTokens.pdf) and can produce even better results, while edge is still computed as the locally normalized version of histograms in HoG for detection. If we can propose more powerful ways (in the sense that the edge/boundary detectors are learned from data as well), are we able to engineer better detectors?
ReplyDeleteOn of the paper's strengths is looking at the problem of segmentation holistically and a wonderful choice for a paper to read on this topic. Some other good contributions:
ReplyDelete1. I like the elegant way in which contour detection was globalized using the generalized eigen vectors and how the contour detection was fully framed in a probabilistic way using local and spectral signal information.
2. Oriented Watershed Transform is a very intuitive extension to Watershed transform. The example given wonderfully illustrates the problem of incorrect weighting and it's solution.
3. The optimization of oriented gradient calculation using rectangular scripts shows how approximations make life easier in computer vision and even though this is in the appendix it deserves more attention!
Critical points:
1. Segmentation ultimately is semantic and this "signal processing" job is just the start to a greater problem. Hand tuning of segmentation levels to select the best looking one is a bit like "playing to the gallery". Ultimately without semantic information the correct scale for UWT-UCM is unknown and this segmentation is a starting point for automatic semantic segmentations of the image.
As a journal manuscript (at 20 pages), the paper had enough space to be able to describe many things with good detail. In addition, the paper had good images that contributed to the understanding. With that said, there was still quite a lot of algorithmic material and at times was difficult to fully comprehend. One part that I felt was lacking in some explanation was the use of the generalized eigenvectors (i.e. (D-W)v=lambda * Dv ). What is the intuition in eigenvectors for this system? D-W seems to encode some normalization on W (how much does W_{ij} contribute). The contributions with the oriented-watershed transform + ultrametric contour map seem to add value to their contour detector by allowing them to easily get multiscale segmentation.
ReplyDeleteI'm probably repeating things mentioned above, but this was my two cents.