Guest lecture from Leonid Sigal--but you still have to read/summarize/post to the blog as usual.
And optionally:
Y. Yang, D. Ramanan. Articulated Human Detection with Flexible Mixtures of Parts IEEE Pattern Analysis and Machine Intelligence (PAMI).
M. Andriluka, S. Roth, B. Schiele, Pictorial Structures Revisited: People Detection and Articulated Pose Estimation, CVPR, 2009.
This reading differs from others in that it is a book rather than an academic paper. As such, it presents much more of an *overview* of existing person-tracking techniques, rather than going into depth about any particular technique. The author provides a nice overview and introduction of various part-based techniques, from the standard DPM and its variants to pictoral models, tree-based and star-based models, and mixtures of gaussians and trees. We also see a basic overview of descriptors and model-learning techniques.
ReplyDeleteThe author provides several driving reasons for performing part-based human pose estimation: Detection, regression, tracking, and inference -- and reviews common techniques for each. He discusses the strengths and weaknesses of local descriptors based on color and texture, and of different part models.
From this book chapter, we might conclude that the state of the art techniques for human pose estimation and tracking are variations on a theme: 2D pictoral structures consisting of linked-together 2D detectors. Learning consists of discovering a model for each part, and then estimating the parameters of the model as a whole. In the application of tracking, the model's *configuration* is estimated from prior knowledge and time constraints, usually with a particle filter.
Though state of the art results are impressive, I feel that they lack some important information. Each model is a vast simplification of the human form, and as such they have to rely on sketchy post-processing to deal with the model's limitations. Humans are not simple 2D pictoral tree structures linked by springs. Rather, they are 3D kinematic chains, composed of deformable 3D links, and hundreds of 3D joints. Other models, which assume 3D data, and use simpler inference algorithms seem to do better than the best state of the art 2D human detectors and trackers. (See Efficient Human Pose Estimation from Single Depth Images Jamie Shotton, et al) This seems to suggest that something fundamental is missing from such 2D models.
-- Matt Klingensmith
I agree that having 3D models would be the ideal thing to do. But the problem of getting depth information from a single image is the hard part. In the paper you mentioned, they consider depth images, which means they already know some prior 3D information which makes the problem "slightly" easier. With just 2D information, I think the state-of-the-art results are very impressive and this book chapter has provided insight into a lot of existing algorithms for human pose estimation.
DeleteI agree that getting depth information from just the image would be impossible, but there could be an approach that tries to incorporate a 3D cad model of a person. For example, we saw the 3D DPM model earlier used for car detection. A similar approach for people might help bridge the gap between the 2D images that we have and the 3D reality,
DeleteI think that starting with 2D person detection and pose estimation makes sense. Firstly, most of the data we had (or have) is still 2D images, and 2D to 3D conversion techniques are not as reliable for every type of scene. Secondly, it helps us understand the strengths and weakness of doing things in 2D, which will eventually help (as it did) to incorporate 3D information/constraints, predict 3D from 2D and do stuff in 3D...
DeleteSo yes, humans are not 2D pictorial tree structures linked by springs, but that's how they appear in our 2D images. So I think everything from dealing with humans in 2D to incorporating and/or inferring 3D information is a valid problem. I agree that most of the work has been in just 2D, but since the advent of consumer devices like kinect, we are seeing increase in 3D techniques as well :)
On the other hand, one good example of going from 2D key-points to 3D landmarks is Varun's paper (http://www.cs.cmu.edu/~vramakri/academic/ECCV2012.html). I think it might make you happy :)
I agree with Priya's idea of using a 3D model. I understand that 3D from 2D is unreliable and hard, but why not incorporate 3D into the part model and project the model into 2D - sort of like the paper we saw with the car and the...landmarks? (Headlights, wheels, etc.). This way, different viewpoints don't need to be represented with completely different models - less stuff to train, no? Is there any argument for why representing different viewpoints with different models is a good way to go?
DeleteAnd oops, after reading Priya's post again more carefully, I realize that I said just about the exact same thing. My bad...
I think we need to combine 3D and 2D information together. If we only use 3D points, we may have some problem when human is holding something.
DeleteI still believe that 2D data is enough if we have a lot of them organized in a structured way. 3D information may help for sure, and it has its practical usage. However, taking advantage of them is not that useful in understanding the latent variables that human brain is able to produce from the 2D world (and we dabble it 3D, maybe dogs can not have language, but they can still behave properly and adapt to the 3D world, and it is believed that ants only have a sense of the 2D world, they still live pretty well in terms of sensing).
DeleteI agree with Priya in that incorporating 3D seems the right way to go. Looking at Fig. 4 gives me the idea that one could run a 2D DPM and then fit a 3D pose model on top of it, possibly then refining the 2D model. Or at least the spring energies of the 2D DPM could be derived from a 3D model, rather than just being Gaussians.
DeleteI think incorporating 3D information or combining 2D and 3D is absolutely important, which is the future direction for human detection and pose estimation. However, at the current point, focusing on purely 2D will make us fully understand the problem, i.e. the overall pipeline, how far we can go in 2D, its limitations, and in what sense 3D will break through the bottleneck of 2D. This is better than we somehow directly imposing the preliminary 3D constraints so as to simply enhance the performance of detection and estimation.
DeleteThis comment has been removed by the author.
ReplyDeleteI really like the appearance/color modeling idea presented in this chapter. I feel like this approach is opposite to most of vision where we tend to focus more on image intensity and gradients. While I think that modeling the color symmetry between arms and legs is a good idea, I'm not sure if it will work in practice. For example if we try to match head color with arm color, then what would happen if the person was wearing sunglasses and a hat, and we got some very weird head appearance. I think that beards would be another thing that would throw off such an algorithm. I would guess that an approach that tries to model color-relationships between parts would just end up overfitting to the norm.
ReplyDeleteWould this be a good way to proceed for future algorithms?
The color/appearance model idea has been proposed in the "tracking by model building" section too. But here, if we have a high enough frame rate then the appearance in frame t and frame t+1 might be similar enough for this idea to work across frames.
DeleteBut yes, within a single frame it seems that a pairwise local symmetry is not expressive enough to capture the color patterns in natural data.
The chapter for this lecture's reading gives a comprehensive account of using part based models for human activity recognition, building on the DPM algorithm. One thing that I found rather depressing was the detection based tracking paradigm (page 20) that the chapter reports as performing surprisingly well. This is a typical case of the template-update problem where the update happens with every frame and the contribution of the tracking component is zero, except maybe for posterior smoothing.
ReplyDeletePerforming an independent detection every frame sounds extremely wasteful, though it might be working well in certain cases. I think it points to our lack of understanding of kinematic reasoning towards tracking humans in 2D images, which shouldn't be a reason to altogether abandon it. For doing things like human action/activity/event forecasting, this might be extremely essential.
I remember Martial saying that Kinect uses frame-by-frame detection instead of tracking (though the situation is different here since the picture has depth). There was some research on pros and cons of constant re-detection vs tracking.
DeleteI had some experience with tracking driver's head and eyes in a course project. I can say that detection (in several implementations) took way more time than tracking (also in several implementations). It wasn't possible to do only constant re-detection in real-time on a laptop. So tracking may not be so good but it's still useful :)
Tracking is useful if you can constrain your domain and leverage domain specific knowledge. For e.g. it is reasonably successful in surveillance because you can build a nice background model (by background subtraction).
DeleteFor the kinect example, I highly recommend Fitzgibbon's lecture. He clearly shows that even if tracking were "close to perfect", a "tracking based kinect" would be a disaster.
One of the things I found missing in this chapter was an explanation as to what model works best in what situation. Usually, one gets a feel for things by looking at the dataset presented with the paper. In this particular case, I'd like to know more about why tracking is not so useful.
DeleteI agree that it seems wasteful to refit a model on each frame, it could be that the only solutions which work for complex environments are things like particle filters, which add so much overhead at the model fitting level to propagate each particle that you may as well refit at each frame and instead treat tracking as an independent, high level process.
From a state estimation perspective, detection-based tracking could be justified because it better approximates the independent observation assumption made for various filtering techniques, like the Kalman filter mentioned in that section.
DeleteAgree with everyone's summary of the chapter that is a well written comprehensive account of work done in this field. It nicely connects different methods, explaining their intuition, similarity and differences. Even the math has been simplified so that normal student (ML101 background) can understand most of it on a higher level, and if needed, can go to the individual papers for details..
ReplyDeleteHowever, I think the "Discussions and Open Questions" section should have discussed more things.. :)
I like this book chapter because of it's clear flow and presentation style.
ReplyDeleteI agree with Srivatsan that it is depressing to see that removing temporal links (and just literally "tracking by detection") performs well. One would assume that temporal consistency would help. I think the main problem is that current motion models fail in cluttered backgrounds. The motion models (like KL/LDOF) do not have the notion of an object, while your "tracking by detection" has no notion of time. It is difficult to get these two to talk to each other.
An even more depressing result is from Deva's recent paper (http://www.ics.uci.edu/~jsupanci/SPLTT.pdf) which shows that temporal consistency just improves F1 score from 89% to 91%
Summary: the better your detector, the less important the motion model is.
If detection is perfect then temporal consistency would come automatically. But, in the presence of noise there is no perfect detection and therefore the importance of temporal consistency is non zero. Another reason to motivate temporal consistency is that detection in every frame is computationally formidable and temporal consistency might help prune the search space. Furthermore, independent detection results from multiple frames cannot be merged easily in crowded scenarios - groups of marching soldiers. Sometimes objects are entirely occluded and temporal information along with motion models help know that they exist in the image but invisible.
DeleteI agree with Aravindh. The data association problem for identical objects is impossible without some temporal reasoning.
DeleteI thought the chapter was generally well-written although I was disappointed and distracted by all the typos. I expect book chapters to be a little more carefully edited.
ReplyDeleteI feel like I was able to get the general gist of how everything worked. However, I've forgotten a lot of details regarding graphical models, which made it difficult to understand all the message passing and inference stuff.
It seems like almost all the part-based methods described by this chapter used limbs as parts. Is this true, or did the authors just name them after limbs as examples? I like the idea of using lots of parts that were not necessarily tied to specific limbs, and the method of the algorithm we read on Tuesday where the algorithm learned the parts itself. I feel like we can use human intuition to get hints for what might be useful to machines, but that in the end, the machines should figure it out, or we should at least keep in mind that what works for humans may not be best for machines.
I just keep thinking about sewing machines and how people tried to make the needle go over and under for a really long time before they realized that it's easier to make a machine with where a needle goes up and down and a thread is put through the loop on the other side - a totally different type of stitch than what a human would ever use. (I'm mostly sure I'm not making this story up.)
Sorry, I digress...
I wish the authors had motivated parts a little more. It seems like they just jumped into "parts are commonly used for pose estimation." I wish this had been compared to and contrasted with other extensions to simple templates. I know we've talked about a lot of this in class already, but I'm starting to have a hard time keeping everything straight.
This is really interesting. I can't - with my human bias - imagine trying to model a tree based structure with the lower arm as a root. Modeling the head as a root (in tree shaped models) and the rest of the body as parts has a good non-human justification though - they're easier to detect. They are hardly ever self occluded, there is less variability is appearance (heads are usually not clothed). When we look at examples where pose estimation has failed, the placement of heads is usually still correct. But, you're right. It'll be fun to look at what components of a human body should be parts and how they should be connected to each other in a way that minimizes human input.
DeleteI really like this book chapter because it gives detailed intuitive aspects of the state-of-the-art algorithms for Human Pose Estimation. However, I had concerns with the color modeling part. As the authors had mentioned in the Discussions section, in the real world it is very hard to use color to model humans because people might be wearing different colored clothing. I think something like a color-free shape representation would be the ideal thing to do. Maybe texture is important in cases when we have similar looking objects and we want to classify between them. But I feel color should not be the main component of a model. With reference to Priya's point about beards, there are algorithms that can detect facial hair and do funky stuff (http://repository.cmu.edu/cgi/viewcontent.cgi?article=1140&context=robotics). This works because once we know there is a face, we know where to look for a beard/mustache and what colors to look for (in general). But this is not the case with clothes. With such huge number of varieties and colors of clothes, it just becomes impossible to get a model that can handle everything (if the model is based on colors for the most part).
ReplyDeleteI really like this book chapter because it is really comprehensive and contains detailed aspects of recent research on human pose estimation. The outline of this chapter is really clear and I like the way it gives the road map of this research area. This chapter compares different algorithms and gives the improvement and constraint of each algorithm. Generally speaking, this chapter is really well organized.
ReplyDeleteHowever, I'm curious about whether those algorithms could be suitable for the state of art depth sensor - kinect. Or in other words, will those algorithms benefit from the depth information. Another question is that how well will those algorithms perform if human is hold something like a bat. When I test the 3D joint positions given by the kinect SDK, I found that it will recognize the end of the bat as the human hand.
I really want to see some connection between human pose estimation and action recognition. Seems according to the discussion from last lecture, 1. the definition of action itself is very blurred, and 2. the most working part of the pipeline for action recognition still likes in the non-motion part (so t is not that special). Then it comes to the whole field of human pose estimation that is able to estimate actions from still images (in this case, \delta t goes to zero), and as pointed out, they work pretty well. This two phenomena might indicate that 1. some "key frames" in a sequence of images can not only stand out as discriminative features for action recognition, and 2. part-based models are really good at capturing the underlying motion system that humans have.
ReplyDeleteThe book chapter discusses part based methods. The underlying theme is that parts placed at neighboring (x_i, y_i, \theta_i, s_i) can use the same template and a mixture of parts can be used to deal with non similar part poses. This representation has the benefit of having fewer training parameters at the expense of computational complexity (more mixture components to convolve with the image), but is only approximate - the parts do change a bit with small changes in pose. While more elaborate deformation models would capture these variations why don't they give state of the art? Are they unable to work on real images?
ReplyDeleteThis chapter focuses exclusively on part-based models for person detection and pose estimation. An overview of this specific area of techniques was useful.
ReplyDeleteHumans are a very special type of non-deformable object in that we all share the same parts. Parts-based model breaks down when there is intra-class variation of parts. Maybe my computer has an extended keyboard, and most others don't. Perhaps my house has a chimney or some other artifact that other houses don't. In the wild, arbitrary classes may have arbitrary additions or subtractions of parts. Perhaps a very blunt example is that this parts based modeling would surely fail on an amputee: an instance who is definitely a member of the person class. Training a separate 'person' and 'amputee' detector seems wrong on many levels. Is it possible to relax part constraints in the same manner as the occlusion modeling? How do you model the fact that some instances have additional parts that you never saw during training data?
Intuitively, Amputees seem to be the same problem as occlusions. From the front, a man without a right arm looks very similar to a man with a right arm placed behind his back. As for additional parts, do you mean "extra" parts (A woman holding a purse) or "New" parts in place of others (Hook in place of hand)?
DeleteI don't think that a person without an arm is really the same problem as a person with an occluded arm. Though they may be confusable, if the model is good enough it would be able to say what is occluding the arm. This points at the fact that it may not be possible to do pose estimation without also understanding what else is in the scene.
DeleteAbout the discussion of 2D/3D:
ReplyDeleteThe current situation is that: we have done lots of things on 2D and built many beautiful models that work reasonably well (like in people detection and some general object detection algos). One key common point for these models is that they are designed to be somehow robust to noise with tons of data (images) we saw in real life. In contrast, for 3D models, we've done very little about modelling the uncertainty and relies too much on precise euclidean geometry which makes our model quite brittle to noise and outliers "in the wild". I think introducing more concepts about statistical learning into 3D world might be a good idea that moves us forward.
I tend to think no matter 3D or parts model, the intuition behind is to mine the best parametric model which preserves the most invariant information while handling deformation in an organized way.
ReplyDeleteI saw a lot of discussion on incorporating 3D models. Actually I'm very curious about whether 3D information and parts models are really necessary. 3D models seems to be too explicit, which are more likely to happen when a human is elaborately creating/imagining what a pedestrian look like, such as during drawing, sculpture, or other CREATION PROCESS. In fact when you look at sth that you frequently saw such as pedestrian, you seldom need to painstakingly think about what a 3D person look like and try to match it. In stead what you frequently do is to subconsciously match it with scene context and sub-category information (to handle view change, occlusion).
For parts model, the essence behind is to sth that is discriminative. The good things about parts are: 1. They have discriminative features (strong HOG between parts and background). 2. they are in a sense invariant (all human parts look alike). Again this problem goes back to finding discriminative features. Possibly, there is no need to adopt parts model as well cuz when you naturally look at sth you well know you seldom try to parse its parts. You just do matching with discriminative features. It is only when you don't recognize sth, you start interpreting it explicitly with high-level information. This is when parts-model may play a more important role.
I was surprised that the limitations of star and tree structured graphical models were discussed, but clique trees (junction trees) were mentioned only for the pairwise case.
ReplyDelete