16-824: Learning-based Methods in Vision (F'13): Reading for 10/22

Thursday, October 17, 2013

Reading for 10/22

Juan Carlos Niebles and Li Fei-Fei. A Hierarchical Model of Shape and Appearance for Human Action Classification. IEEE Computer Vision and Pattern Recognition (CVPR), Minneapolis, 2007

And optionally:

H. Wang, M. M. Ullah, A. Kläser, I. Laptev and C. Schmid, Evaluation of local spatio-temporal features for action recognition, BMVC, 2009.

Abhinav Gupta, Praveen Srinivasan, Jianbo Shi and Larry S. Davis, Understanding Videos, Constructing Plots: Learning a Visually Grounded Storyline Model from Annotated Videos, In CVPR 2009.

W. Brendel and S. Todorvic, Learning Spatiotemporal Graphs of Human Activities, ICCV 2011

34 comments:

AnonymousOctober 21, 2013 at 1:46 PM
In this paper, the authors propose a framework for action detection which combines three principal ideas: hierarchical part/mixture models, motion features, and static features. Their hierarchical model is a mixture of parts. Each feature is matched to parts by a linear weighted sum (ie, a mixture model).

The parameters of the mixture model are tuned using an EM algorithm.

Features of the model represent "static" and "dynamic" features. Their features seem fairly crude. They assume the background can be segmented out, and use a basic Canny edge detector.

To detect various classes of actions, they appy a simple SVM to clusters learned from training data.

Their experimental results show promising detection rates for each action, and further they show that the contribution of dynamic features is much greater than that of static features. They also show that the contribution of the mixture model allows them to classify actions much better than a simple "bag of features" model.

---

Comments:

I feel like this paper comes from a different tradition than some of the other papers we've read. Rather than classifying "in the wild" images, they use carefully staged images. This allows them to use pretty suspect features with canny edge detection.

Their idea of a "mixture of parts" is much more primitive than in DPM, as well.

I think its big strength is the use of dynamic and static features together.

-- Matt Klingensmith
ReplyDelete
Replies
GauravOctober 21, 2013 at 6:55 PM
I'm still writing out the complete summary and critique but worth looking at for others:
The video action recognition research page of the second author -http://vision.stanford.edu/projects/niebles/humanactions.htm. If you scroll down to the Resources section, the second paper has a video showing their results of action recognition on a video sequence including some results on ice skating actions which I found cool.

Humans can discern actions from images and a recent paper from the same group that attempts image action recognition is http://vision.stanford.edu/documents/YaoFei-Fei_ECCV12.pdf. Here the researchers aim to be view-invariant by using a 2.5D approach - Localizing keypoints in 3D space and using 2D features.
ReplyDelete
Replies
Humphrey HuOctober 21, 2013 at 8:00 PM
I found it interesting/disturbing that the recognition system uses the aggregated class likelihoods as an input to a discriminative classifier to determine the final class. I suppose this could be viewed as a smarter way to pick out the MLE result, or a way to account for classification errors, as in stacking, but how much of the classification accuracy is accounted for by this step as opposed to their complex hierarchical model?
ReplyDelete
Replies
GauravOctober 21, 2013 at 9:07 PM
Paper describes a Hierarchical model for human actions which includes 4 parts in the higher layer associated with bag of features in the lower layer.
Features could be static or dynamic or both, all clustered into bags of words. Dynamic features are spatio-temporal features that are pretty good for action classification. In fact I recall using spatio temporal features for action classification in Martial's CV class for exactly this task and it performed pretty well using just a simple histogram of bag of features comparison.
Training is done by breaking video sequences into clusters and fitting a 1-component model on them initially and then performing EM steps till minimization.
The author's rationale for using static features - some body parts remain static to form a particular pose during actions and this seems reasonable. Later borne out by the slight performance increase accrued by using static features.
I liked how occluded parts are managed by allowing a part to have no features to be assigned to it.
I liked how the authors showed how intuitive assumptions like using static and dynamic features and using a multi-component mixture model are useful in practice in the results section.
Could there be less complex and more complex actions requiring splitting the video into more clusters than just three components? The authors have chosen very short and simple video sequences for their experiments so their choice seems reasonable but there is no discussion of this choice and how it may/may not change for longer videos.
pjump, jump look to be a same action only view invariant. Could they be classified as same action in training and could we look to use a view invariant model?
Only four parts in hierarchical model are used when sometimes there are more than 4 parts being exercised or in motion. Note see figure 6 - that's a jumping jack (http://www.youtube.com/watch?v=5cGbc15CO1Q) and the part distribution looks odd with no parts for right arm.
ReplyDelete
Replies
Srivatsan VaradharajanOctober 21, 2013 at 9:11 PM
The algorithm presented in this paper sounds like a simplified version of DPM where the parts are represented by bag-of-words features. They method works by combining static and motion features, which seems to produce good results than using each of them alone. One thing I noticed was their use of shape context descriptors to describe edge features for 2D deformable profile matching, which makes a lot of sense. However, the algorithm as a whole seems to be tailored to the dataset they use, which looks extremely unrealistic and small. The paper doesn't discuss how effective the proposed approach would be in recognizing human actions in the wild.
ReplyDelete
Replies
UnknownOctober 21, 2013 at 9:21 PM
It feels really awkward when I figured out all the way through the mathematical derivation of the complicated generative model part...and puuuf! - they only use that as a feature-like input for a discriminative classifier!

I guess it is really complicated and difficult to truly build an overall generative model. The last discriminative part considerably alleviates this problem. It also tolerates the possible errors made during the previous generative models. As long as the previous generative models correctly captures the relative likelihood increase/decrease trend for each class (no absolute likelihood comparison is required), the method is going to output reasonably good results.
ReplyDelete
Replies
Priya DeoOctober 21, 2013 at 10:10 PM
It seems like this paper is using the model it builds to classify each frame with a set of "actions" and then uses the probabilities that the action is occurring in the frame as a bag-of-words histogram to feed into an SVM.

I dont understand why the authors classify the video frame-by-frame. My intuition is that it would be more accurate to classify the entire video as a whole, since a jumping jack looks like a 2-handed wave when your arms are up and it looks like a jump for the frame you are in the air.

Also, I think that instead of picking the class with the most number of votes, each frame could contribute a "soft" vote based on how sure it was of that class, similar to what we saw with the hierarchical segmentation paper.
ReplyDelete
Replies
UnknownOctober 21, 2013 at 10:30 PM
I found this paper to be really confusing, and I still don't have a good sense of how the part model works or what exactly these parts are. This is what I *think* is going on with the generative model, someone please correct me if I'm wrong:

(1) take static (edges) and dynamic (gradients over time?) features, use k-means
(2) assign cluster centers to so-called "parts" (separate set of four parts for each action class)
(3) part position is given by normal distribution of the locations of the features that are assigned to it?
(4) probability of part relative positions is given by some multivariate gaussian distribution.

My impression is that through the action sequence, the four parts for the given action class do not change. I don't understand how these parts are allowed to move through the sequence. It looks like they sort of have parts learned, for say, the arms and legs during jumping jacks. However, these parts necessarily move throughout the sequence - are the moving into a lower probability part of the gaussian representing their relative positions? Is there something built in that models how they move relative to each other over time?

I feel like I must be missing something big here, but I hope I'm not *completely* misunderstanding everything. Someone enlighten me please?
ReplyDelete
Replies
IshanOctober 21, 2013 at 10:38 PM
Concerns

1. Dataset - it is too simple. Everyone's already pointed it out.
That's also why using their simple features works.

2. What actually helps? Is it the hierarchical model? What happens if you just take a MLE of the action models? Why do you even need an SVM?! What happens if you vary the number of mixtures?

3. A simple baseline (anything using BoW aggregated over frames+SVM) would be very helpful.

4. Why is [2] bad if it cannot take decisions frame by frame? Why do we need frame-by-frame decisions? How is this current model more generic? Background subtraction seems reasonable given the dataset. Sure, this maybe looked upon as "overfitting to the data", but I think these authors overfit their features too.
ReplyDelete
Replies
Abhinav ShrivastavaOctober 21, 2013 at 10:52 PM
This comment has been removed by the author.
ReplyDelete
Replies
Abhinav ShrivastavaOctober 21, 2013 at 10:54 PM
I liked the high-level idea of the paper -- using a hierarchy from low-level features to mid-level parts (though I mostly see only 2 levels being used). But quite frankly, that's where my liking of this paper stops.

Firstly, I didn't like the writing.. It was pain to decipher what the authors were getting at or why is a particular flow was written (e.g., "Fanti et al. proposed in [10] that it is fruitful to utilize a mixture of both static and dynamic features. In their work, the dynamic features are limited to simple velocity description. We therefore propose the hybrid usage of static shape features as well as spatial-temporal features in our framework.").
I felt really stupid trying to understand and failing after multiple attempts..

More importantly, I felt cheated in the experiments section (after all the effort to understand the paper and make sense of the math) (no promises that I got it right). Why is frame-to-frame a good idea? Doesn't an action involve sequence of frame, anyways?.. Furthermore, the only form of comparison the author mention is this -- "The first reported classification results on this dataset appeared on [2]. Their method achieved a classification error rate of 0.39%. It is however, difficult to make a fair comparison. Their method requires a background substraction procedure, global motion compensation, and it cannot take classification decisions frame by frame. Please also note, that our model is general in the sense that it aims to offer a generic framework for human motion and pose categorization.". Kind-of disappointing given that [2] was in 2005, this paper was in 2007. More baselines, apart from ablation analysis would have been nice..

Though, the authors did do nice ablation analysis on different aspects of their method, which was a delight to see :)

Point of discussion: Was it a good idea to have no sharing of low-level features between parts? In current form, the low-level features for different parts do not overlap, but you can imagine having same low-level feature (like waist keypoints) being shared by multiple parts (like leg(s) and torso).. Something to think about...
ReplyDelete
Replies
UnknownOctober 21, 2013 at 11:21 PM
While this approach demonstrates reasonable effectiveness for contrived examples, I feel as if direct reasoning about human action is skipping a few steps in the perception pipeline, specifically 2D semantic segmentation and 3D geometric reasoning. Things like this just strike me as a convoluted way of producing a very specific output, and won't stand the test of time. I understand that the other parts in the pipeline are far from solved, but it seems as if once they're somewhat better, there won't be much that this paper has left to offer, other than the notion of combining static and dynamic features, which doesn't seem that much of a contribution. I place high value on approaches to problems that can be easily integrated into other computer vision tasks, and I don't think standalone action recognition is a desirable enough task to solve ahead of and independent of better scene understanding.
ReplyDelete
Replies
ArunOctober 21, 2013 at 11:23 PM
Is it worth trying to build a complex generative model, that in the end, is not actually very generative? It is hard to say if the trained graphical model is adding much to their final problem (Humphrey's point). What it adds is in effect a mid-level feature.
A local-bag-of-words plus features representing relative location to a parent may be good enough to train a direct svm classifier (Ishan's point).

The probability -> discriminative portion seems somewhat a reversal of idea by Platt for using an SVM + sigmoid (logistic regression) to get probabilities out of an SVM. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=7CEEACDF86A0DF78FB7543FB40C755FC?doi=10.1.1.41.1639&rep=rep1&type=pdf
ReplyDelete
Replies
UnknownOctober 21, 2013 at 11:59 PM
For the problem with discriminative models, the authors said "While discriminative frameworks are often very successful in the classiﬁcation results, they suffer either the laborious training problem or a lack of true understanding of the videos or images." Not sure about the reason why the author argue that we should use a generative model instead of a discriminative one, especially when the author used a discriminative model on top of a generative one. There is no analysis about why they should do that which really makes me confused.
ReplyDelete
Replies
Mike McCannOctober 22, 2013 at 5:52 AM
"Action Classification" to me is a very suspect goal. First, the problem is ill-defined: What are the classes of human actions? What constitutes a different action? "Walking" for me might look totally different than "walking" for someone else. Second, what use is an action classifier? What we really want to do is understand where a person's limbs are and how their joint angles are changing.
ReplyDelete
Replies
M AravindhOctober 22, 2013 at 7:22 AM
The generative model proposed in this paper is very expressive while retaining a small number of parameters. I look at this in two ways:-
1. Generative methods are modelling the data. There is greater hope in making them unsupervised when compared to discriminiative methods. When used like in this paper we can interpret the learnt model.
2. The generative model is mapping the raw feature space into a more separable semantic space (of likelihood scores). A discriminative classifier on this more separable space is doing better than applying discriminiative classification on the raw features itself. Some approaches that work like this have been shown to perform well even when the discriminative part at the end is removed/weakened - eg: Convolutional neural nets without the giant two layered feedforward neural net at the end.
3. Generative models exist all over the place - often as optimization problems rather than probabilistic methods. This includes, ICA, RICA, PCA, some clustering methods.
ReplyDelete
Replies

Add comment