16-824: Learning-based Methods in Vision (F'13): Reading for 10/29

Thursday, October 24, 2013

Reading for 10/29

Bangpeng Yao and Li Fei-Fei, Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities, In CVPR 2010

And optionally:

Abhinav Gupta and Larry S. Davis, Objects in Action: An Approach for Combining Action Understanding and Object Perception, In CVPR 2007

Abhinav Gupta, Scott Satkin, Alexei A. Efros and M. Hebert, From 3D Scene Geometry to Human Workspace. In CVPR 2011.

32 comments:

ArunOctober 28, 2013 at 9:52 AM
-- Summary --
This paper, like last week’s, looks again at human pose estimation. However, instead of treating human pose estimation as an independent problem, the authors exploit scenarios of human-object interaction activities to provide context that helps both the tasks of human pose estimation and object detection. As the authors say, “using context is clearly a good [idea]”. Specifically, they use a graphical model structure to find relations between the Activity, Object, Human pose, and human body Parts (denoted A, O, H, P respectively). They use hill-climbing to determine the connectivity of A,O,H,P and then a max-margin learning algorithm to optimize the discriminative power of their model. The authors show positive results and considerable improvement on a sports dataset with this approach.

-- Contributions --
* As of CVPR 2010 (the paper’s publication), the authors claim novelty in utilizing context information for human pose estimation. However, they acknowledge that there has been prior interest in using context in computer vision, as we have seen in some of the other papers we have read in class.
* The max-margin learning approach they propose seems reasonable and useful. It allows some freedom for the learner to map a feature to any subclass as long as the overall class is correct. The learning problem is now easier as it doesn’t try to map many features for different poses into one larger category describing the activity.

-- A Few Concerns --
* Dataset (size, type) –
The dataset they use is tiny (50 images per activity). There are some natural biases (they keep mentioning right arm correlates with the object, but what about left handed people?!). Their dataset always seems to have the object in interaction with the person. Can their system still gracefully detect a volleyball player or the activity when the player isn’t near the ball? The work they compare to for activity recognition by Gupta et al. likely won’t have this problem. This is as much dataset bias as they claim that the other approach as.
* Hill climbing initialization – The authors note that of the three initializations to their hill-climbing algorithm, they seed one with a human, manually labeled connectivity (see Appendix/Section A). How often was the best result coming from this seed as opposed to the random seeds? This somewhat leads to my next point…
* Why binary connections between nodes?
It isn’t obvious why the connections should be binary instead of some measure of affinity. Possibly this yields better computational time, but I don’t think that’s the reasoning. It would have been good to see some comparisons to show that the hill-climbing to prune connections bought them anything.
* Parameter estimation -
There is a parameter estimation for the discriminative weights. Why is this not used in the cost function for the hill climbing? My guess here is that hill climbing is better for human pose estimation, whereas the discriminative learner is trying more to detect activities.

-- Overall Thoughts --
* The paper is well written and thought it. The approach is clearly explained and the motivation is obvious. The results presented are good.
*It would be nicer to see more detail of what actually worked. Was the hill climbing really necessary? How much improvement did it bring? I liked that they compared 1 human pose per class to having many. That was a good way to show that having multiple poses doubled the improvement amount.
ReplyDelete
Replies
AnonymousOctober 28, 2013 at 1:59 PM
In this paper, the authors learn a graphical model which takes into account activity, human pose, and the presence of objects to jointly detect human activities, poses, and objects in the scene. Assuming that the human is performing a task with an object, and that the human is touching the object, we can make very strong priors on object location and human pose. The authors use sports as their primary driving example. They provide very compelling performance improvements on a sports database in both detecting the objects of interest, and the pose of the human. Clusters of human poses are detected in either an unsupervised manner, or with human labeling and the most likely class is chosen as context. The pose models are generated via hill climbing.

Pros
I think its definitely true that "context" can help in detecting human and object poses, and that their definition of "context" is much closer to what we intuitively experience as context than the previous literature we looked at. On their specific examples of a human in a canonical pose touching a single object, their method seems to work well.

Cons
It seems like a "just so story." By restricting themselves artificially to images of humans engaged in sports, holding a single object, centered on the frame, they make the problem almost too easy. Then they compare themselves to a more general solution, and (surprise!) they do better on their specific subset of images. I feel like their approach is more generally applicable, but it's hard to take them seriously when they use so few images, and such a small set of potential activities.

-- Matt Klingensmith
ReplyDelete
Replies
Humphrey HuOctober 28, 2013 at 6:37 PM
The authors don't seem to explicitly discuss failure cases, but there are some funky results in Figure 9, such as the golfer with an arm detected as a leg, or the bowler with strange leg positions for the occluded leg. Though occlusion is mentioned as a challenge in HOI detection, it doesn't seem like their model covers that case, so these results always give "hallucinated" limbs..
ReplyDelete
Replies
M AravindhOctober 28, 2013 at 7:01 PM
The algorithm learns multiple graphical model structures to handle the wide variation in human pose and human object interaction. This is really cool as now there is no need to make strong statements about conditional independence for the entire dataset but only for subparts of it (can I still call this graphical models?). More cool is their being able to learn these from the data using a structure learning method (though some people are complaining about the initialization for hill climbing, but I think its learnable).

Table 1 shows that learning multiple poses per class helps. Nice to see that complicated algorithms do indeed work in practice and outperform everything else.
ReplyDelete
Replies
UnknownOctober 28, 2013 at 9:21 PM
This paper gives interesting results for the task of detecting single objects, human pose, and sport activity recognition. They utilize the fact that humans doing sporting activities are often interacting with objects, and incorporate shared information between the two tasks improves them both. Their experiments seem geared to proving that context is quite useful when incorporated correctly.

Their displayed results looked good, but I would've liked to have seen many more. Concerning to me about these types of approaches is that they ignore all other context in the scene. Show me an image with a hardwood floor and tell me to classify it as one of many sport scenes, and I'm pretty likely to pick basketball or volleyball. It seems like there is a wealth of other useful contextual information that is essentially ignored in the approach to many highly specific tasks. I would like to see more 'interdisciplinary' computer vision approaches...
ReplyDelete
Replies
Abhinav ShrivastavaOctober 28, 2013 at 10:42 PM
- The paper proposed a new model for dealing joint object detection and human-pose estimation. Overall, (to me) the paper was quite intuitively explained. They explain their complicated model in a nice way, postponing the details to appendix for interested readers. I also like their way of dealing with articulated poses (more on this later).

- I don't agree with the authors that object detection and human pose estimation are unrelated problems (but that's just me).

- After reading the paper, I couldn't decide whether its the context (object-HOI) helping, or whether its the new way of learning the articulated model. Esp. see Fig.4 (2,2), cricket bowling second example. The ball is never in hands in that particular pose. It's a pose after throwing the ball, but still the authors detect ball in hand. I think it's not the ball context that's helping here at all. Its just an articulated model, with additional part (as Evgeny pointed out) and a new learning model.. Given this doubt, one thing to try might be treating the object as an extension of human (i.e., another human part), and just learn the articulated model with n+1 parts. All the baselines get n+1 parts as well. This might change the learning framework slightly for this paper, but I think it would give another compelling way to learn these complicated articulated models..... And if it works (as I doubt it would), we have another way of training DPM/FMP models, which can capture more complicated structures because of their framework...

Overall, I liked the paper. If they had released their code, I would have liked to try few variants of this model and figure out what's making it work.. Whether it's the special treatment to objects (for context) or whether it's the new structured learning framework..
ReplyDelete
Replies
Priya DeoOctober 28, 2013 at 11:08 PM
I think its an interesting idea to solve two "hard" problems simultaneously, which makes solving both problems a little bit easier. We saw this idea previously with the paper that correlated object locations and scene geometry. Maybe this is the way that vision needs to proceed. For a given image, we try to extract as much information as we can (detect possible objects, recognize action, estimate scene label, estimate planar geometries, etc). Even if we could only do each task with okay-ish accuracy, we might be able to aggregate and refine the labels based on the combined results. I think this is really what this paper is trying to get at. We know the person is playing cricket (scene context). From his pose, we estimate he is a bowler. We know from prior knowledge that there is often a ball in the bowler’s hand (even though we cannot see it/it is already thrown), and it’s more likely there is a ball since he is making a throwing motion.

With all the talk in the introduction about how other context methods only get 4-5% improvement, I was really expecting to see a much bigger improvement overall. I feel like they really oversold their method in the beginning.
ReplyDelete
Replies
IshanOctober 29, 2013 at 12:12 AM
I like this paper because of the ideas it uses (HOI+context) and the way it captures them.
I particularly like the high order co-occurence statistics being used. I like the overall flow of the learning framework. Let's try to solve a hard problem. Oh wait, we can't. Let's just keep splitting it into smaller subproblems till we can.
My concern with this is that the authors fail to go into more details. Firstly, the stopping criterion (3 images) seems a bit too extreme for me in the light of the fact that the dataset is tiny.
Also, as Arun mentions, the hill-climbing with multiple restarts makes me uncomfortable. I am all for simplicity but this falls into the category of problems which you formulate very well and elegantly, but use a hacky way to solve. Not an expert on this, and in fact I have known ICM with restarts to work as well as TRW on graphical models in practice.

Overall, I think this is a great paper if one looks at the ideas. Getting into the details dampens my excitement.
ReplyDelete
Replies
Divya HariharanOctober 29, 2013 at 1:29 AM
I agree with most of the points that people brought up above. It is indeed a pretty cool way to think about how the problems of object detection and human pose estimation can add information to each other and hence perform better on both these problems. However, I do have big concerns with one of their statements that this method is "less data set dependent" when compared to Gupta et. al. I do agree the authors focus on the core human-object interaction in this paper, whereas the other paper used more of the background context. But as Arun had mentioned earlier, the first concern is that the object is always in contact with the human. The second concern is that most of the images that they have shown (atleast in the paper) look like they are all centered in the scene, take most part of the image without a lot of background information (ideal images). Just by cutting off most of the background and claiming that the method is less data set dependent does not make any sense, atleast to me. Both background context and human-object interaction should be addressed together and that will probably give us much better results in a general scenario.
ReplyDelete
Replies
UnknownOctober 29, 2013 at 3:40 AM
For me, I dont like the context thing in general because they can actually be modelled via blurring the bounday of objects. The ball should occur with some players who are in the kicking pose, and some poses can actually be distinguished by the objects that are being manipulated. They are born together, no context is needed because they are the context.
ReplyDelete
Replies
Srivatsan VaradharajanOctober 29, 2013 at 5:18 AM
This paper attempts to tie human pose estimation with object detection by using one as a context for the other, which is interesting. However, it does seem a little odd that detecting a small object (like a ball) can be used as "context" to find the pose of a human in complete detail. I think it is more likely that each object would improve detection for a particular (set of) human part(s), and it would have been nice to see which human part detections are improved in each of these activities that correspond to interaction with a different object in each case. Is it just that the detection of the ball in the bowling pose directly affects localization of the hand and torso, which in turn allows to find the legs a little more reliably? From the results section, we can see that the forearm detection accuracy shows a 200% increase over the pictorial structures method - which makes sense considering that most of these objects interact heavily with the forearm, the detection of which seems to be the most difficult, in general.
The results in this paper look impressive, though one concern as others mentioned previously is the size of the dataset and the relatively small number of activities.
ReplyDelete
Replies
Mike McCannOctober 29, 2013 at 5:44 AM
The background section mentions that others have not even tried using pose for HOI activity classification, so one wonders how much of their improvement over the state of the art comes from the poses rather than the full model.
ReplyDelete
Replies
UnknownOctober 29, 2013 at 6:07 AM
I think this paper presented an interesting idea - as Priya mentioned, solving two "hard" problems simultaneously, making them each a little easier. One concern I have is the generalizablility of this algorithm. The dataset seems quite particular, and the training images require hand labeled objects and body parts. How will this possibly generalize to humans "in the wild" interacting with a huge variety of objects - coffee mugs, computers, etc. I just don't see how this work can be extended further, besides just adding in more object/pose pairs.
ReplyDelete
Replies
UnknownOctober 29, 2013 at 8:13 AM
I like the idea proposed in this paper to exploit the mutual relationship between objects and human poses. The performance seems to be promising (even though there might be some problems with the data they use, as mentioned by some other people) on sports data.
However, to be honest, I'm a little bit suspicious about structure learning, especially inferring a non-tree structure from data. There is no analysis about how good the local optimal and the variance of the structures learned from different initial points. The philosophy behind this is: the graphical model is a tool for human to impose prior knowledge instead of "inferring priors" from our data.
ReplyDelete
Replies
UnknownOctober 29, 2013 at 8:32 AM
My opinion about this paper agrees with most of the points brought up here. At the first glance of the paper, I think the philosophy behind this paper is reasonable and intuitive. This is one of the pros of this paper.

The second thing I like in this paper specifically lies at the way they treat the weakest model. They do a clustering with the samples to split them and individually learn a connectivity pattern for each sub-category. This is sth make sense here. In a way they are actually trying to find discriminative patterns. I think this is one of the important thing that has boosted their performance.

In addition, the paper's description and presentation is clear to read.

But at the second thought, I immediately realize that the dataset is highly biased. I tend to think it is the "context of sports" that is also helping their method a lot. Generalization ability of this method is a big problem. In the real world, the context between human and object is far worse structured than the sports scenario. Moreover, there would be so many different configurations that their method may run super slow or directly fail.
ReplyDelete
Replies

Add comment