16-824: Learning-based Methods in Vision (F'13): Reading for 11/5

Tuesday, October 29, 2013

Reading for 11/5

Note: No class on 8/31

Lubomir Bourdev, Jitendra Malik, Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations, ICCV 2009

and optionally:

Saurabh Singh, Abhinav Gupta, Alexei A. Efros. Unsupervised Discovery of Mid-Level Discriminative Patches. In ECCV 2012.

Lubomir Bourdev, Subhransu Maji, Thomas Brox, Jitendra Malik, Detecting People Using Mutually Consistent Poselet Activations, ECCV 2010

38 comments:

Srivatsan VaradharajanNovember 4, 2013 at 7:04 AM
================
### Summary ###
================
This paper addresses the problem of detecting humans and estimating their pose in images using a part-based method. The authors define poselets, that are part of a pose (for instance, right arm crossing torso, right profile and shoulder, etc.) and do not necessarily correspond to an anatomical part of a human such as left forearm, right leg, etc. The annotation for discovering poselets requires 3D position and visibility information from 2D images, which is reminiscent of the earlier paper by Deva Ramanan's group that we read: "Analyzing 3D objects in cluttered scenes". As described in section 2, annotating images this way allows them to easily perform some interesting queries such as estimating probable locations of one part conditioned on the location of another part, statistics of camera views, and extracting images of people in specific, rough poses such as sitting. Their H3D dataset containing these annotations has been made publicly available.
The major sections in their algorithm are:
* Clustering to discover poselets: Poselets are discovered by performing clustering with a squared error metric in the 3D configuration space, followed by pruning to reduce the number of clusters.
* Detecting poselets and humans: For each poselet a classifier is trained on its associated rectangular window. Objects such as human torsos and keypoints in the body are detected by combining the scores obtained from running each poselet classifier using a multiscale scanning window on the image. Results are shown comparing their algorithm with the Dalal and Triggs pedestrian detector as well as a DPM method, on their own H3D dataset and the VOC 2007 dataset.

====================
### Contribution ###
=====================
* The main contribution of this paper is to address the question of "How do we decide what good parts are?" in the context of recognizing humans, by introducing poselets. The data-driven approach to detect parts of a pose, as opposed using a natural definition of parts from the human anatomy is a good idea. The authors also compare this to DPM-like modeling of parts as latent variables that facilitate detection and reason about in which cases their method performs better or worse.

====================================
### Points of concern / interest ###
====================================
* The hard-negatives used for training classifiers for each poselet are derived from pictures without humans in them. This seems to suggest that there is not much merit in discriminating between poselets. However, in the results section the authors mention that poselet classifiers for front-facing persons might fire on back-facing persons leading to wrong detection of shoulders. It seems possible that discriminatively training front and back-facing poselets could have helped here.

* The discovered poselets are pruned from an initial count of 120K to just 300 finally using various criteria. One of them is to remove poselets that do not train well. It would have been good to see how many different examples in their dataset the poselets that do not train well correspond to. If commonly occuring poselets did not train well on their classifier, that might indicate a problem with their features or classifier.

* The authors apply an asymmetric distance metric to describe the distance in configuration space from one example to another. I am not sure how significant this is, but the effect of the asymmetric distance metric is not discussed explicitly.
ReplyDelete
Replies
UnknownNovember 4, 2013 at 8:27 AM
I like this paper because it introduces parts as mid-level cues for the recognition problem. Here 2D information is not enough to discover a good cluster so they switched to 3D, but the second paper proposed a way to discover good patches purely from 2D images (although not in a very principled manner). Lessons learned from this paper is that, we need extra constraint to study the visual world.
ReplyDelete
Replies
Mike McCannNovember 4, 2013 at 10:25 AM
On the topic of poselets that don't train well: The matching in configuration space is invariant to scale and rotation because a similarity transform is applied to the keypoints; however in the appearance space where the SVMs are trained, the features don't have this invariance. I think the authors are getting at this point in footnote 5. I think the framework should allow a many to one mapping between the poselet in appearance space and the poselet in configuration space.
ReplyDelete
Replies
AnonymousNovember 4, 2013 at 12:14 PM
In this paper, the authors present an approach for simultaneous human detection and pose recognition which maps 3D positions of joints, links, and keypoints to arbitrary 2D "poselets," which are similar to the "parts" in a DPM model. They require an annotated 3D database, and leverage this database to produce distributions on keypoint locations given poses, and vice versa. "Poselets" are learned from the data which maximally discriminate human configurations. They discovered that the most informative "poselets" involved various zoom levels of human faces and shoulders -- an expected result given the performance of other human detectors.

While the method the authors present doesn't always do as well as simpler DPM methods in finding humans and measuring the 2D positions of their parts, it is obviously a much more powerful and complete method for detecting humans and their poses. Humans are made up of 3D objects connected by 3D joints. Their appearence is generated by the position and orientation of their center of mass, combined with configurations of each joint. It only makes sense to leverage this generative model to produce *correct* distributions on 2D appearances in the image, just from a theoretical standpoint. 2D discriminative methods, in contrast, assume faulty priors in the image space such as gaussian, linear, or uniform distributions in the image, which don't capture the underlying dynamics of the system. In this way, a 2D parts model is generated *automatically* from the 3D human model and data.

Further, this work comes closer to giving us "what we want" in terms of the "parts" of a human. Really, what's needed in most applications which consider human pose is the real, 3D, continuous configuration of the human, not a 2D bounding box approximation of parts.

I think that further work in this direction for detecting poses is necessary. The data collection problem is very difficult, and I don't believe their interface works well enough to be useful. Further, their model of the human is not, in my opinion, complex enough to give us the best representation. A human is not a collection of "balls and sticks", but rather has more or less rigid links of a certain length, connected by joints with more or less known and constrained properties. Each joint has not only a position, but an orientation (in 3D), as well as a degree of actuation. The joints also have limits. The human spine is another unmodeled aspect which is extremly high dimensional. I think that ultimately, all of these degrees of freedom should be modeled.

-- Matt Klingensmith
ReplyDelete
Replies
Humphrey HuNovember 4, 2013 at 5:13 PM
This comment has been removed by the author.
ReplyDelete
Replies
ArunNovember 4, 2013 at 5:28 PM
The paper seems to present an interesting and different approach to the problem of human pose detection. The paper effectively says that the dataset we're training on are just somewhat wrong since they separate the 3d and 2d during training time. This claim seems obvious from the viewpoint of replicating human understanding of 2D images. We spend our lives interacting with a 3D world to get information, but we are able to properly detect and classify objects, people, poses, etc. in a 2D image. This is seen even the way they construct the dataset by using human labelers to reverse-generate an approximate 3D configuration. It is impressive that that they were able to train on their relatively small H3D dataset
and still achieve state of the art performance on PASCAL. Hopefully, with kinect-like sensors, so much effort won't be needed to create annotated 3D & 2D datasets in the future.

I am also curious to see how it would have worked to just do L1 regularized training instead of the greedy search for picking the top candidates.

ReplyDelete
Replies
Humphrey HuNovember 4, 2013 at 5:28 PM
It seems like the power behind this method is coming from the intricate dataset. How will this extend to other classes? How large of a Turkish army will we need? Perhaps if we had a way to go from other imaging modalities, such as 3D point clouds or 2.5 depth images to the configuration-space data they need, this could extend nicely. On that note, wouldn't skeleton data from Kinect be able to produce useful (albeit biased) data?
ReplyDelete
Replies
M AravindhNovember 4, 2013 at 7:09 PM
The authors find 200K poselets and prune it down because some of them aren't having enough data points or they cannot make a linear SVM work on it. I think they are loosing out on modelling the fat and long tail of the dataset by discarding this rich information.

Based on our estimates of the number of classes required to solve pose estimation as good as humans do, we probably need a large number of poselets too (not as many as classes but probably much more than 300).

If they hang on to everthing till the end and use their cross validation to discard only those which hurt then they might be able to capture odd poses better (thus improving the performance a tiny bit).
ReplyDelete
Replies
IshanNovember 4, 2013 at 8:42 PM
I like this paper for it's use of parts which have some grounding (in configuration space). I believe using a grounding apart from appearance (like in DPMs) is a nice way to enrich these part-based models. A2's paper in ICCV finds such a grounding in geometry space.

I liked the task-specific learning of weights. I think many times we do not exploit domain/task specific information and hence make a problem harder than it should be. The authors show a nice way to overcome this.
ReplyDelete
Replies
Abhinav ShrivastavaNovember 4, 2013 at 9:00 PM
The authors present a new mid-level representation -- poselets, which is another attempt to define "part" for human detection and/or pose estimation. They introduce a new dataset H3D which has 3D keypoint annotations for humans, which they use to discover poselets. The key properties they are looking for in these poselets are that they should be easily detectable and should help in estimating the 3D pose/configuration of a person.

I liked that the paper for its merit, the ease of reading and for describing everything intuitively.. However, here are a few concerns (points of discussions):
1. Going from 120k->2k->300-256 poselets. I was around when Singh et al. were working on their ECCV 2012 mid-level patches paper, and I know how difficult it was to define a good metric to select good patches or even to induce a good ordering.. Apart from the 120k->2k jump, I'm very unclear about how other pruning via cross-validation done or engineered. Anyways, I would have liked to see some more details :)
2. Annotation. The authors try to address this point in page 3 c2 l2, but I was not convinced. I have seen images in PASCAL and they do seem to be pretty arbit and complex. Why couldn't we just do similar 3D annotations on those images??
3. Not related to the paper but in general. It is easy to provide 3D annotations for humans, cars etc. which usually follow a very standard skeleton/structure.. But I doubt such techniques can be extended to most other objects, which is where I think making use of 2.5D or kinect-like data might come in handy.. This point is also raised by few other students above..
ReplyDelete
Replies
UnknownNovember 4, 2013 at 9:17 PM
It looks to me that this method can be even more useful, if it is known that the camera will see objects only in a subspace of configurations. Like cameras in public places will always see people from some pitch angle, or car dash cameras will see people and cars from the height of 1 meter.
I guess what one needs to do in this case is just to choose relevant poselets from the poselet database
ReplyDelete
Replies
UnknownNovember 4, 2013 at 9:38 PM
This paper specifically focuses on the usage of parts as applied to humans. It seems the implicit argument is that humans are the most complex sets of parts, and so solving the problem for them will render the problem solved for simpler objects.

They constrain themselves to rectangular patches for simplicity's sake. This approach seems fine as a pre-processing step, but using straight unsegmented patches to put things in configuration space seems non-optimal. Can we create mid-level patches and then use only the regions of interest (and not the background) as input?

It was interesting to see their results of something like "metadata transfer" of dataset segmentation onto an image with predicted poselets, and I would've liked to have seen more of such results.
ReplyDelete
Replies
Divya HariharanNovember 4, 2013 at 9:56 PM
I really like the concept of poselets for doing pose estimation and the way the authors had compared this method with DPMs and how the performance varies. They have also created a new dataset H3D with annotations that can be used in different ways, which I'm sure is a major contribution. But I had issues with some parts of their paper which people have mentioned above.

1. The pruning through cross-validation - It is not clear how this works and what exactly the authors do, why a greedy approach, how confident are they about the end results of the greedy approach, how significant is the distance metric, why is this distance metric better over the others, etc.
2. Choosing negative examples without humans in them - This definitely disturbed me a lot. These are not hard negatives if there are no humans in them. I wonder if they did not try with "proper" hard negative examples.
ReplyDelete
Replies
UnknownNovember 4, 2013 at 10:09 PM
I like this paper because it demonstrates the question that what is a good part. Other than using the visual constraints only in 2D, this paper uses 3D information of the joint positions. It is a good way to combine 2D and 3D information. And it is possible to apply this algorithm using depth sensor such as kinect. At least we can capture the intricate dataset.
ReplyDelete
Replies
Priya DeoNovember 4, 2013 at 11:11 PM
I really liked how the authors tried to make a more comprehensive dataset, but I feel like their dataset, and more specifically the 3D annotated poses, was really the key behind their model. The intuition here is that with the 3D annotated poses, you can get much more detailed information about the part that you're tracking, like its orientation and scale. It also takes into account joint limitations. I think this is why they do much better than DPM on their own dataset, but perform similarly on PASCAL VOC.

I didnt quite like their method for poselet selection. They had to keep using different techniques to prune their poselet selections, and even when they had reduced it to their final state, a lot of the poselets were still redundent. I would have thought a much more natural way of finding poselets would be some clustering approach. Also, I would guess that there are much more than 300 poses for humans to take, and given that several of these were redundent, perhaps their poselets were not quite comprehensive or exhaustive. I think this again shows that perhaps the 3D annotated data is really what is driving the algorithm here.
ReplyDelete
Replies
UnknownNovember 5, 2013 at 8:54 AM
I like this mid-level representation using poselet a lot. It is interesting to see the connection between this mid-level feature: poselet, with attribute-based pose estimation. The way the authors have defined the poselet makes each body part an attribute already. For example, this subject has a 45-degree arm across torso, or, this particular arm pose looks like arm pose No.3 from the reference. That way, simile classifier can also be harnessed which may further improve the pose estimation tasks like we have discussed in the facial attribute paper.
ReplyDelete
Replies

Add comment