16-824: Learning-based Methods in Vision (F'13): Reading for 10/24

Tuesday, October 22, 2013

Reading for 10/24

Guest lecture from Leonid Sigal--but you still have to read/summarize/post to the blog as usual.

Book Chapter on Human Pose Estimation: http://www.ics.uci.edu/~dramanan/papers/people_chapter.pdf

And optionally:

Y. Yang, D. Ramanan. Articulated Human Detection with Flexible Mixtures of Parts IEEE Pattern Analysis and Machine Intelligence (PAMI).

P. Felzenszwalb, D. Huttenlocher. Efficient Matching of Pictorial Structures. CVPR 2000

M. Andriluka, S. Roth, B. Schiele, Pictorial Structures Revisited: People Detection and Articulated Pose Estimation, CVPR, 2009.

33 comments:

AnonymousOctober 23, 2013 at 1:17 PM
This reading differs from others in that it is a book rather than an academic paper. As such, it presents much more of an *overview* of existing person-tracking techniques, rather than going into depth about any particular technique. The author provides a nice overview and introduction of various part-based techniques, from the standard DPM and its variants to pictoral models, tree-based and star-based models, and mixtures of gaussians and trees. We also see a basic overview of descriptors and model-learning techniques.

The author provides several driving reasons for performing part-based human pose estimation: Detection, regression, tracking, and inference -- and reviews common techniques for each. He discusses the strengths and weaknesses of local descriptors based on color and texture, and of different part models.

From this book chapter, we might conclude that the state of the art techniques for human pose estimation and tracking are variations on a theme: 2D pictoral structures consisting of linked-together 2D detectors. Learning consists of discovering a model for each part, and then estimating the parameters of the model as a whole. In the application of tracking, the model's *configuration* is estimated from prior knowledge and time constraints, usually with a particle filter.

Though state of the art results are impressive, I feel that they lack some important information. Each model is a vast simplification of the human form, and as such they have to rely on sketchy post-processing to deal with the model's limitations. Humans are not simple 2D pictoral tree structures linked by springs. Rather, they are 3D kinematic chains, composed of deformable 3D links, and hundreds of 3D joints. Other models, which assume 3D data, and use simpler inference algorithms seem to do better than the best state of the art 2D human detectors and trackers. (See Efficient Human Pose Estimation from Single Depth Images Jamie Shotton, et al) This seems to suggest that something fundamental is missing from such 2D models.

-- Matt Klingensmith
ReplyDelete
Replies
Divya HariharanOctober 23, 2013 at 4:38 PM
This comment has been removed by the author.
ReplyDelete
Replies
Priya DeoOctober 23, 2013 at 6:16 PM
I really like the appearance/color modeling idea presented in this chapter. I feel like this approach is opposite to most of vision where we tend to focus more on image intensity and gradients. While I think that modeling the color symmetry between arms and legs is a good idea, I'm not sure if it will work in practice. For example if we try to match head color with arm color, then what would happen if the person was wearing sunglasses and a hat, and we got some very weird head appearance. I think that beards would be another thing that would throw off such an algorithm. I would guess that an approach that tries to model color-relationships between parts would just end up overfitting to the norm.

Would this be a good way to proceed for future algorithms?
ReplyDelete
Replies
Srivatsan VaradharajanOctober 23, 2013 at 6:53 PM
The chapter for this lecture's reading gives a comprehensive account of using part based models for human activity recognition, building on the DPM algorithm. One thing that I found rather depressing was the detection based tracking paradigm (page 20) that the chapter reports as performing surprisingly well. This is a typical case of the template-update problem where the update happens with every frame and the contribution of the tracking component is zero, except maybe for posterior smoothing.
Performing an independent detection every frame sounds extremely wasteful, though it might be working well in certain cases. I think it points to our lack of understanding of kinematic reasoning towards tracking humans in 2D images, which shouldn't be a reason to altogether abandon it. For doing things like human action/activity/event forecasting, this might be extremely essential.
ReplyDelete
Replies
Abhinav ShrivastavaOctober 23, 2013 at 8:22 PM
Agree with everyone's summary of the chapter that is a well written comprehensive account of work done in this field. It nicely connects different methods, explaining their intuition, similarity and differences. Even the math has been simplified so that normal student (ML101 background) can understand most of it on a higher level, and if needed, can go to the individual papers for details..

However, I think the "Discussions and Open Questions" section should have discussed more things.. :)
ReplyDelete
Replies
IshanOctober 23, 2013 at 9:42 PM
I like this book chapter because of it's clear flow and presentation style.

I agree with Srivatsan that it is depressing to see that removing temporal links (and just literally "tracking by detection") performs well. One would assume that temporal consistency would help. I think the main problem is that current motion models fail in cluttered backgrounds. The motion models (like KL/LDOF) do not have the notion of an object, while your "tracking by detection" has no notion of time. It is difficult to get these two to talk to each other.
An even more depressing result is from Deva's recent paper (http://www.ics.uci.edu/~jsupanci/SPLTT.pdf) which shows that temporal consistency just improves F1 score from 89% to 91%
Summary: the better your detector, the less important the motion model is.
ReplyDelete
Replies
UnknownOctober 23, 2013 at 10:54 PM
I thought the chapter was generally well-written although I was disappointed and distracted by all the typos. I expect book chapters to be a little more carefully edited.

I feel like I was able to get the general gist of how everything worked. However, I've forgotten a lot of details regarding graphical models, which made it difficult to understand all the message passing and inference stuff.

It seems like almost all the part-based methods described by this chapter used limbs as parts. Is this true, or did the authors just name them after limbs as examples? I like the idea of using lots of parts that were not necessarily tied to specific limbs, and the method of the algorithm we read on Tuesday where the algorithm learned the parts itself. I feel like we can use human intuition to get hints for what might be useful to machines, but that in the end, the machines should figure it out, or we should at least keep in mind that what works for humans may not be best for machines.

I just keep thinking about sewing machines and how people tried to make the needle go over and under for a really long time before they realized that it's easier to make a machine with where a needle goes up and down and a thread is put through the loop on the other side - a totally different type of stitch than what a human would ever use. (I'm mostly sure I'm not making this story up.)

Sorry, I digress...

I wish the authors had motivated parts a little more. It seems like they just jumped into "parts are commonly used for pose estimation." I wish this had been compared to and contrasted with other extensions to simple templates. I know we've talked about a lot of this in class already, but I'm starting to have a hard time keeping everything straight.
ReplyDelete
Replies
Divya HariharanOctober 23, 2013 at 10:56 PM
I really like this book chapter because it gives detailed intuitive aspects of the state-of-the-art algorithms for Human Pose Estimation. However, I had concerns with the color modeling part. As the authors had mentioned in the Discussions section, in the real world it is very hard to use color to model humans because people might be wearing different colored clothing. I think something like a color-free shape representation would be the ideal thing to do. Maybe texture is important in cases when we have similar looking objects and we want to classify between them. But I feel color should not be the main component of a model. With reference to Priya's point about beards, there are algorithms that can detect facial hair and do funky stuff (http://repository.cmu.edu/cgi/viewcontent.cgi?article=1140&context=robotics). This works because once we know there is a face, we know where to look for a beard/mustache and what colors to look for (in general). But this is not the case with clothes. With such huge number of varieties and colors of clothes, it just becomes impossible to get a model that can handle everything (if the model is based on colors for the most part).
ReplyDelete
Replies
UnknownOctober 24, 2013 at 12:57 AM
I really like this book chapter because it is really comprehensive and contains detailed aspects of recent research on human pose estimation. The outline of this chapter is really clear and I like the way it gives the road map of this research area. This chapter compares different algorithms and gives the improvement and constraint of each algorithm. Generally speaking, this chapter is really well organized.

However, I'm curious about whether those algorithms could be suitable for the state of art depth sensor - kinect. Or in other words, will those algorithms benefit from the depth information. Another question is that how well will those algorithms perform if human is hold something like a bat. When I test the 3D joint positions given by the kinect SDK, I found that it will recognize the end of the bat as the human hand.

ReplyDelete
Replies
UnknownOctober 24, 2013 at 1:21 AM
I really want to see some connection between human pose estimation and action recognition. Seems according to the discussion from last lecture, 1. the definition of action itself is very blurred, and 2. the most working part of the pipeline for action recognition still likes in the non-motion part (so t is not that special). Then it comes to the whole field of human pose estimation that is able to estimate actions from still images (in this case, \delta t goes to zero), and as pointed out, they work pretty well. This two phenomena might indicate that 1. some "key frames" in a sequence of images can not only stand out as discriminative features for action recognition, and 2. part-based models are really good at capturing the underlying motion system that humans have.
ReplyDelete
Replies
M AravindhOctober 24, 2013 at 5:19 AM
The book chapter discusses part based methods. The underlying theme is that parts placed at neighboring (x_i, y_i, \theta_i, s_i) can use the same template and a mixture of parts can be used to deal with non similar part poses. This representation has the benefit of having fewer training parameters at the expense of computational complexity (more mixture components to convolve with the image), but is only approximate - the parts do change a bit with small changes in pose. While more elaborate deformation models would capture these variations why don't they give state of the art? Are they unable to work on real images?
ReplyDelete
Replies
UnknownOctober 24, 2013 at 7:41 AM
This chapter focuses exclusively on part-based models for person detection and pose estimation. An overview of this specific area of techniques was useful.

Humans are a very special type of non-deformable object in that we all share the same parts. Parts-based model breaks down when there is intra-class variation of parts. Maybe my computer has an extended keyboard, and most others don't. Perhaps my house has a chimney or some other artifact that other houses don't. In the wild, arbitrary classes may have arbitrary additions or subtractions of parts. Perhaps a very blunt example is that this parts based modeling would surely fail on an amputee: an instance who is definitely a member of the person class. Training a separate 'person' and 'amputee' detector seems wrong on many levels. Is it possible to relax part constraints in the same manner as the occlusion modeling? How do you model the fact that some instances have additional parts that you never saw during training data?
ReplyDelete
Replies
UnknownOctober 24, 2013 at 8:45 AM
About the discussion of 2D/3D:
The current situation is that: we have done lots of things on 2D and built many beautiful models that work reasonably well (like in people detection and some general object detection algos). One key common point for these models is that they are designed to be somehow robust to noise with tons of data (images) we saw in real life. In contrast, for 3D models, we've done very little about modelling the uncertainty and relies too much on precise euclidean geometry which makes our model quite brittle to noise and outliers "in the wild". I think introducing more concepts about statistical learning into 3D world might be a good idea that moves us forward.
ReplyDelete
Replies
UnknownOctober 24, 2013 at 9:00 AM
I tend to think no matter 3D or parts model, the intuition behind is to mine the best parametric model which preserves the most invariant information while handling deformation in an organized way.

I saw a lot of discussion on incorporating 3D models. Actually I'm very curious about whether 3D information and parts models are really necessary. 3D models seems to be too explicit, which are more likely to happen when a human is elaborately creating/imagining what a pedestrian look like, such as during drawing, sculpture, or other CREATION PROCESS. In fact when you look at sth that you frequently saw such as pedestrian, you seldom need to painstakingly think about what a 3D person look like and try to match it. In stead what you frequently do is to subconsciously match it with scene context and sub-category information (to handle view change, occlusion).

For parts model, the essence behind is to sth that is discriminative. The good things about parts are: 1. They have discriminative features (strong HOG between parts and background). 2. they are in a sense invariant (all human parts look alike). Again this problem goes back to finding discriminative features. Possibly, there is no need to adopt parts model as well cuz when you naturally look at sth you well know you seldom try to parse its parts. You just do matching with discriminative features. It is only when you don't recognize sth, you start interpreting it explicitly with high-level information. This is when parts-model may play a more important role.
ReplyDelete
Replies
Humphrey HuOctober 24, 2013 at 9:09 AM
I was surprised that the limitations of star and tree structured graphical models were discussed, but clique trees (junction trees) were mentioned only for the pairwise case.
ReplyDelete
Replies

Add comment