Bangpeng Yao and Li Fei-Fei, Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities, In CVPR 2010
And optionally:
Abhinav Gupta and Larry S. Davis, Objects in Action: An Approach for Combining Action Understanding and Object Perception, In CVPR 2007
Abhinav Gupta, Scott Satkin, Alexei A. Efros and M. Hebert, From 3D Scene Geometry to Human Workspace. In CVPR 2011.
-- Summary --
ReplyDeleteThis paper, like last week’s, looks again at human pose estimation. However, instead of treating human pose estimation as an independent problem, the authors exploit scenarios of human-object interaction activities to provide context that helps both the tasks of human pose estimation and object detection. As the authors say, “using context is clearly a good [idea]”. Specifically, they use a graphical model structure to find relations between the Activity, Object, Human pose, and human body Parts (denoted A, O, H, P respectively). They use hill-climbing to determine the connectivity of A,O,H,P and then a max-margin learning algorithm to optimize the discriminative power of their model. The authors show positive results and considerable improvement on a sports dataset with this approach.
-- Contributions --
* As of CVPR 2010 (the paper’s publication), the authors claim novelty in utilizing context information for human pose estimation. However, they acknowledge that there has been prior interest in using context in computer vision, as we have seen in some of the other papers we have read in class.
* The max-margin learning approach they propose seems reasonable and useful. It allows some freedom for the learner to map a feature to any subclass as long as the overall class is correct. The learning problem is now easier as it doesn’t try to map many features for different poses into one larger category describing the activity.
-- A Few Concerns --
* Dataset (size, type) –
The dataset they use is tiny (50 images per activity). There are some natural biases (they keep mentioning right arm correlates with the object, but what about left handed people?!). Their dataset always seems to have the object in interaction with the person. Can their system still gracefully detect a volleyball player or the activity when the player isn’t near the ball? The work they compare to for activity recognition by Gupta et al. likely won’t have this problem. This is as much dataset bias as they claim that the other approach as.
* Hill climbing initialization – The authors note that of the three initializations to their hill-climbing algorithm, they seed one with a human, manually labeled connectivity (see Appendix/Section A). How often was the best result coming from this seed as opposed to the random seeds? This somewhat leads to my next point…
* Why binary connections between nodes?
It isn’t obvious why the connections should be binary instead of some measure of affinity. Possibly this yields better computational time, but I don’t think that’s the reasoning. It would have been good to see some comparisons to show that the hill-climbing to prune connections bought them anything.
* Parameter estimation -
There is a parameter estimation for the discriminative weights. Why is this not used in the cost function for the hill climbing? My guess here is that hill climbing is better for human pose estimation, whereas the discriminative learner is trying more to detect activities.
-- Overall Thoughts --
* The paper is well written and thought it. The approach is clearly explained and the motivation is obvious. The results presented are good.
*It would be nicer to see more detail of what actually worked. Was the hill climbing really necessary? How much improvement did it bring? I liked that they compared 1 human pose per class to having many. That was a good way to show that having multiple poses doubled the improvement amount.
The manually specified model is indeed suspicious, especially since it probably works very well for their sports dataset. Its very existence seems to suggest that simply using three random models was not producing very good results...
DeleteOn the other hand, I think the binary connections between parts is sensible. It's equivalent to enforcing sparsity on the graphical model.
I'm a little upset that all this manual stuff was hidden in the appendix. The paper proper states that they "randomly initialized the structure for [sic] three times.." Having two randomly initialized structures, and one manual one does *not* mean that it was randomly initialized three times. I feel like the use of manual initialization was purposefully hidden and I really object to authors misrepresenting their work.
DeleteIn this paper, the authors learn a graphical model which takes into account activity, human pose, and the presence of objects to jointly detect human activities, poses, and objects in the scene. Assuming that the human is performing a task with an object, and that the human is touching the object, we can make very strong priors on object location and human pose. The authors use sports as their primary driving example. They provide very compelling performance improvements on a sports database in both detecting the objects of interest, and the pose of the human. Clusters of human poses are detected in either an unsupervised manner, or with human labeling and the most likely class is chosen as context. The pose models are generated via hill climbing.
ReplyDeletePros
I think its definitely true that "context" can help in detecting human and object poses, and that their definition of "context" is much closer to what we intuitively experience as context than the previous literature we looked at. On their specific examples of a human in a canonical pose touching a single object, their method seems to work well.
Cons
It seems like a "just so story." By restricting themselves artificially to images of humans engaged in sports, holding a single object, centered on the frame, they make the problem almost too easy. Then they compare themselves to a more general solution, and (surprise!) they do better on their specific subset of images. I feel like their approach is more generally applicable, but it's hard to take them seriously when they use so few images, and such a small set of potential activities.
-- Matt Klingensmith
The authors want to know whether context helps if context is present. They don't want to discuss the problem of whether context helps if context is absent or how often context is present at all. To make sure that they are getting an answer to their question they need to control the data. I think that sports images are a very good choice as they offer a large collection of human poses and lot of self occlusion while satisfying their requirement of having context (in this case via object interaction).
DeleteI think that by working with natural images we are unable to answer such scientific questions correctly. In particular, we are claim is "if A then B" and we work with natural images for which we don't know whether A is true or false, we essentially don't get anywhere.
I agree that this method might not work for natural images because it is less structured than the scenario that the authors have considered. However, it comes back to the point of discussion that we had at the end of the last lecture (before the guest lecture). Is our world structured enough to be modeled this way? If the AND-OR graph kind of method can be modeled to any real world scenario, then this method should also be feasible because they are very similar in the basic assumptions that "if A happens, then either B happens or C happens" or something like that. I guess the real question to be answered is how much can we put on paper about the structure of the world.
DeleteI think that the difference is that though the world may be structured, how much of that structure is actually captured in images? How much of it is regularly captured so we can make a generic algorithm relying on it.
DeleteI would guess that images in the wild are so grossly different that there no regular context structure for most of them. But we can't know for sure until we construct the "perfect" dataset.
The authors don't seem to explicitly discuss failure cases, but there are some funky results in Figure 9, such as the golfer with an arm detected as a leg, or the bowler with strange leg positions for the occluded leg. Though occlusion is mentioned as a challenge in HOI detection, it doesn't seem like their model covers that case, so these results always give "hallucinated" limbs..
ReplyDeleteThe algorithm learns multiple graphical model structures to handle the wide variation in human pose and human object interaction. This is really cool as now there is no need to make strong statements about conditional independence for the entire dataset but only for subparts of it (can I still call this graphical models?). More cool is their being able to learn these from the data using a structure learning method (though some people are complaining about the initialization for hill climbing, but I think its learnable).
ReplyDeleteTable 1 shows that learning multiple poses per class helps. Nice to see that complicated algorithms do indeed work in practice and outperform everything else.
I would not go so far.. Yes, the complicated algorithm did work on the dataset they tried it on and outperformed everything else by a margin.. But there are some concerns regarding the dataset itself..
DeleteJust to be clear on my opinion (that I'm not totally against it), ofcourse it is possible that it works equally well of a larger and more challenging dataset; and I agree that they tried on the standard dataset at that time.. :)
When I was reading the paper, I had a feeling that the problem is to recognize a part-based model of 11 interconnected parts (10 human parts + 1 object). Plus, the model has global mixtures (for different human poses) and local mixtures (different object). I'm wondering, if it has ever been formulated that way?..
DeleteI love that idea! Objects can be just another part in their or any other baseline method.. I would like to see that.. That would actually inform us whether its the object context that is helping or its their new learning/inference method which can capture slightly more complicated articulations..
DeleteThis reminds me the connection between this and the work of "visual phrases". It would also be enlightening to see the comparison with learning template for the whole phrase (human parts+object) with much simpler learning paradigm.
DeleteThis paper gives interesting results for the task of detecting single objects, human pose, and sport activity recognition. They utilize the fact that humans doing sporting activities are often interacting with objects, and incorporate shared information between the two tasks improves them both. Their experiments seem geared to proving that context is quite useful when incorporated correctly.
ReplyDeleteTheir displayed results looked good, but I would've liked to have seen many more. Concerning to me about these types of approaches is that they ignore all other context in the scene. Show me an image with a hardwood floor and tell me to classify it as one of many sport scenes, and I'm pretty likely to pick basketball or volleyball. It seems like there is a wealth of other useful contextual information that is essentially ignored in the approach to many highly specific tasks. I would like to see more 'interdisciplinary' computer vision approaches...
The authors mention this in section 6.4, where they say that not using the background context allows them to generalize better. Their point is that for this particular type of pose and object recognition (sports) most datasets would be heavily biased because all recorded events of sports take place in highly distinguishable courts/fields and the algorithm might learn that instead of the actual activity.
DeleteI think their statement is true to a certain extent, after all, it is considered quite normal to play cricket in the middle of the street in India (though it would look horribly out of context anywhere else in the world).
- The paper proposed a new model for dealing joint object detection and human-pose estimation. Overall, (to me) the paper was quite intuitively explained. They explain their complicated model in a nice way, postponing the details to appendix for interested readers. I also like their way of dealing with articulated poses (more on this later).
ReplyDelete- I don't agree with the authors that object detection and human pose estimation are unrelated problems (but that's just me).
- After reading the paper, I couldn't decide whether its the context (object-HOI) helping, or whether its the new way of learning the articulated model. Esp. see Fig.4 (2,2), cricket bowling second example. The ball is never in hands in that particular pose. It's a pose after throwing the ball, but still the authors detect ball in hand. I think it's not the ball context that's helping here at all. Its just an articulated model, with additional part (as Evgeny pointed out) and a new learning model.. Given this doubt, one thing to try might be treating the object as an extension of human (i.e., another human part), and just learn the articulated model with n+1 parts. All the baselines get n+1 parts as well. This might change the learning framework slightly for this paper, but I think it would give another compelling way to learn these complicated articulated models..... And if it works (as I doubt it would), we have another way of training DPM/FMP models, which can capture more complicated structures because of their framework...
Overall, I liked the paper. If they had released their code, I would have liked to try few variants of this model and figure out what's making it work.. Whether it's the special treatment to objects (for context) or whether it's the new structured learning framework..
Re pt 2: They aren't unrelated, but they're usually phrased differently. Each human body part is like an object detector, but the pose estimation is a higher-level idea than just detection. In that sense, I think they are different problems.
DeleteHowever, the point that the object is effectively an extended body part, I agree with. If you look at the graphical model (and at their hill climbing initialization), it is effectively like tacking on a new body part.
I think its an interesting idea to solve two "hard" problems simultaneously, which makes solving both problems a little bit easier. We saw this idea previously with the paper that correlated object locations and scene geometry. Maybe this is the way that vision needs to proceed. For a given image, we try to extract as much information as we can (detect possible objects, recognize action, estimate scene label, estimate planar geometries, etc). Even if we could only do each task with okay-ish accuracy, we might be able to aggregate and refine the labels based on the combined results. I think this is really what this paper is trying to get at. We know the person is playing cricket (scene context). From his pose, we estimate he is a bowler. We know from prior knowledge that there is often a ball in the bowler’s hand (even though we cannot see it/it is already thrown), and it’s more likely there is a ball since he is making a throwing motion.
ReplyDeleteWith all the talk in the introduction about how other context methods only get 4-5% improvement, I was really expecting to see a much bigger improvement overall. I feel like they really oversold their method in the beginning.
I agree with the competitive results shown on the sports dataset. However, I think the human figures in the sports dataset are relatively less occluded, and the background is much simpler, compared to, let's say, an office setting where human may be partially visible for sitting behind the desk in a much more complex environment. Also the object in the sports dataset is singular compared to indoor affordance problem where the subject is usually interacting with multiple objects, with much more severe occlusion and so forth. I would expect to see this "mutually beneficial" algorithm to work on more complicated environment.
DeleteI think to solve two problems simultaneously is always the trick by computer vision. This is exactly the scenario of chicken and egg. It is the case in optimization as well, corresponds to block coordinate decent. In general, the two different worlds regularize each other and produce much better results.
DeleteI like this paper because of the ideas it uses (HOI+context) and the way it captures them.
ReplyDeleteI particularly like the high order co-occurence statistics being used. I like the overall flow of the learning framework. Let's try to solve a hard problem. Oh wait, we can't. Let's just keep splitting it into smaller subproblems till we can.
My concern with this is that the authors fail to go into more details. Firstly, the stopping criterion (3 images) seems a bit too extreme for me in the light of the fact that the dataset is tiny.
Also, as Arun mentions, the hill-climbing with multiple restarts makes me uncomfortable. I am all for simplicity but this falls into the category of problems which you formulate very well and elegantly, but use a hacky way to solve. Not an expert on this, and in fact I have known ICM with restarts to work as well as TRW on graphical models in practice.
Overall, I think this is a great paper if one looks at the ideas. Getting into the details dampens my excitement.
You generally do random restarts with hill climbing. As they mention, it is only looking for local optima. Generally, you do random restarts to get around this. The worrisome part is the heuristic restart where they seed it with a pose that they determined. Again, this isn't so bad from an engineering point of view if you have a strong belief in your heuristic. But yes, it is a hack.
DeleteI agree with most of the points that people brought up above. It is indeed a pretty cool way to think about how the problems of object detection and human pose estimation can add information to each other and hence perform better on both these problems. However, I do have big concerns with one of their statements that this method is "less data set dependent" when compared to Gupta et. al. I do agree the authors focus on the core human-object interaction in this paper, whereas the other paper used more of the background context. But as Arun had mentioned earlier, the first concern is that the object is always in contact with the human. The second concern is that most of the images that they have shown (atleast in the paper) look like they are all centered in the scene, take most part of the image without a lot of background information (ideal images). Just by cutting off most of the background and claiming that the method is less data set dependent does not make any sense, atleast to me. Both background context and human-object interaction should be addressed together and that will probably give us much better results in a general scenario.
ReplyDeleteI agree, my immediate thought when looking at figure 4 was "there isn't much variation in this dataset." As much as a I agree with the general philosophy of the paper, they haven't shown it on anything more than 'toy' data.
DeleteFor me, I dont like the context thing in general because they can actually be modelled via blurring the bounday of objects. The ball should occur with some players who are in the kicking pose, and some poses can actually be distinguished by the objects that are being manipulated. They are born together, no context is needed because they are the context.
ReplyDeleteI think context only helps in some specific situations. If the scenario is quite clean without much interactions, pure object detectors or pure pose estimators work quite well. In this case there is no need for context. However, in some difficult scenarios with blurring itself existed, we need the context information to help inference. That's why they refine the problem domain here as human-object interaction activities with the relevant object small or only partially visible and the human body parts self-occluded. The key issue is how to judge the two situations in advance. An alternative is to choose a two-phase processing pipeline.
DeleteThis paper attempts to tie human pose estimation with object detection by using one as a context for the other, which is interesting. However, it does seem a little odd that detecting a small object (like a ball) can be used as "context" to find the pose of a human in complete detail. I think it is more likely that each object would improve detection for a particular (set of) human part(s), and it would have been nice to see which human part detections are improved in each of these activities that correspond to interaction with a different object in each case. Is it just that the detection of the ball in the bowling pose directly affects localization of the hand and torso, which in turn allows to find the legs a little more reliably? From the results section, we can see that the forearm detection accuracy shows a 200% increase over the pictorial structures method - which makes sense considering that most of these objects interact heavily with the forearm, the detection of which seems to be the most difficult, in general.
ReplyDeleteThe results in this paper look impressive, though one concern as others mentioned previously is the size of the dataset and the relatively small number of activities.
The background section mentions that others have not even tried using pose for HOI activity classification, so one wonders how much of their improvement over the state of the art comes from the poses rather than the full model.
ReplyDeleteAgree, also I am wondering how the human pose and object detection are estimated simultaneously. It looks to me more like a chick and egg problem as they showed that human pose helped to detect object while object also helped to detect human pose. Where do they start? Did they simply random hypothesize the object of interests?
DeleteI think this paper presented an interesting idea - as Priya mentioned, solving two "hard" problems simultaneously, making them each a little easier. One concern I have is the generalizablility of this algorithm. The dataset seems quite particular, and the training images require hand labeled objects and body parts. How will this possibly generalize to humans "in the wild" interacting with a huge variety of objects - coffee mugs, computers, etc. I just don't see how this work can be extended further, besides just adding in more object/pose pairs.
ReplyDeleteI like the idea proposed in this paper to exploit the mutual relationship between objects and human poses. The performance seems to be promising (even though there might be some problems with the data they use, as mentioned by some other people) on sports data.
ReplyDeleteHowever, to be honest, I'm a little bit suspicious about structure learning, especially inferring a non-tree structure from data. There is no analysis about how good the local optimal and the variance of the structures learned from different initial points. The philosophy behind this is: the graphical model is a tool for human to impose prior knowledge instead of "inferring priors" from our data.
My opinion about this paper agrees with most of the points brought up here. At the first glance of the paper, I think the philosophy behind this paper is reasonable and intuitive. This is one of the pros of this paper.
ReplyDeleteThe second thing I like in this paper specifically lies at the way they treat the weakest model. They do a clustering with the samples to split them and individually learn a connectivity pattern for each sub-category. This is sth make sense here. In a way they are actually trying to find discriminative patterns. I think this is one of the important thing that has boosted their performance.
In addition, the paper's description and presentation is clear to read.
But at the second thought, I immediately realize that the dataset is highly biased. I tend to think it is the "context of sports" that is also helping their method a lot. Generalization ability of this method is a big problem. In the real world, the context between human and object is far worse structured than the sports scenario. Moreover, there would be so many different configurations that their method may run super slow or directly fail.