Thursday, October 17, 2013

Reading for 10/22

Juan Carlos Niebles and Li Fei-Fei. A Hierarchical Model of Shape and Appearance for Human Action ClassificationIEEE Computer Vision and Pattern Recognition (CVPR), Minneapolis, 2007
And optionally:

H. Wang, M. M. Ullah, A. Kläser, I. Laptev and C. Schmid, Evaluation of local spatio-temporal features for action recognitionBMVC, 2009.
Abhinav Gupta, Praveen Srinivasan, Jianbo Shi and Larry S. Davis, Understanding Videos, Constructing Plots: Learning a Visually Grounded Storyline Model from Annotated Videos, In CVPR 2009.
W. Brendel and S. Todorvic,  Learning Spatiotemporal Graphs of Human Activities, ICCV 2011


  1. In this paper, the authors propose a framework for action detection which combines three principal ideas: hierarchical part/mixture models, motion features, and static features. Their hierarchical model is a mixture of parts. Each feature is matched to parts by a linear weighted sum (ie, a mixture model).

    The parameters of the mixture model are tuned using an EM algorithm.

    Features of the model represent "static" and "dynamic" features. Their features seem fairly crude. They assume the background can be segmented out, and use a basic Canny edge detector.

    To detect various classes of actions, they appy a simple SVM to clusters learned from training data.

    Their experimental results show promising detection rates for each action, and further they show that the contribution of dynamic features is much greater than that of static features. They also show that the contribution of the mixture model allows them to classify actions much better than a simple "bag of features" model.



    I feel like this paper comes from a different tradition than some of the other papers we've read. Rather than classifying "in the wild" images, they use carefully staged images. This allows them to use pretty suspect features with canny edge detection.

    Their idea of a "mixture of parts" is much more primitive than in DPM, as well.

    I think its big strength is the use of dynamic and static features together.

    -- Matt Klingensmith

    1. I think the two types of features - static and dynamic will be helpful to different kinds of tasks. Maybe they should compare that. And probably, we will have a more strong methods to combine the two types of features.

    2. The paper didn't give me a strong intuition as to why the two types of features are useful. Their results certainly seem to suggest that using static and motion features helps, but beyond a few sentences touting its "intuitiveness," the don't detail any experiments that clearly demonstrate interplay between the two. In fact, the recognition accuracy of the motion + static almost looks like the sum accuracy of the individual recognition accuracies. A paragraph discussing this would be enlightening.

    3. Other papers also used static+dynamic features together [10], but they never discuss what the difference is...

    4. @Humphrey I assume you are looking at Figure 9 when you say that accuracy of dynamic and static = sum of individual accuracies. Now this is very bad practice but the authors have the scale on the Y axis starting from 40%.

    5. This paper builds a hierarchical system for action recognition. For some modules, they use complex components, such as the combination of both static and dynamic features, while for some other modules, they use simple ones, such as simplified DPM, bags-of-features, edge detector. They don't provide a clear justification why and how these modules are selected, and which is more crucial to the overall performance. Moreover, they use simple dataset and baseline to evaluate, which makes their approach seeming dataset tuned.

  2. I'm still writing out the complete summary and critique but worth looking at for others:
    The video action recognition research page of the second author - If you scroll down to the Resources section, the second paper has a video showing their results of action recognition on a video sequence including some results on ice skating actions which I found cool.

    Humans can discern actions from images and a recent paper from the same group that attempts image action recognition is Here the researchers aim to be view-invariant by using a 2.5D approach - Localizing keypoints in 3D space and using 2D features.

  3. I found it interesting/disturbing that the recognition system uses the aggregated class likelihoods as an input to a discriminative classifier to determine the final class. I suppose this could be viewed as a smarter way to pick out the MLE result, or a way to account for classification errors, as in stacking, but how much of the classification accuracy is accounted for by this step as opposed to their complex hierarchical model?

  4. Paper describes a Hierarchical model for human actions which includes 4 parts in the higher layer associated with bag of features in the lower layer.
    Features could be static or dynamic or both, all clustered into bags of words. Dynamic features are spatio-temporal features that are pretty good for action classification. In fact I recall using spatio temporal features for action classification in Martial's CV class for exactly this task and it performed pretty well using just a simple histogram of bag of features comparison.
    Training is done by breaking video sequences into clusters and fitting a 1-component model on them initially and then performing EM steps till minimization.
    The author's rationale for using static features - some body parts remain static to form a particular pose during actions and this seems reasonable. Later borne out by the slight performance increase accrued by using static features.
    I liked how occluded parts are managed by allowing a part to have no features to be assigned to it.
    I liked how the authors showed how intuitive assumptions like using static and dynamic features and using a multi-component mixture model are useful in practice in the results section.
    Could there be less complex and more complex actions requiring splitting the video into more clusters than just three components? The authors have chosen very short and simple video sequences for their experiments so their choice seems reasonable but there is no discussion of this choice and how it may/may not change for longer videos.
    pjump, jump look to be a same action only view invariant. Could they be classified as same action in training and could we look to use a view invariant model?
    Only four parts in hierarchical model are used when sometimes there are more than 4 parts being exercised or in motion. Note see figure 6 - that's a jumping jack ( and the part distribution looks odd with no parts for right arm.

  5. The algorithm presented in this paper sounds like a simplified version of DPM where the parts are represented by bag-of-words features. They method works by combining static and motion features, which seems to produce good results than using each of them alone. One thing I noticed was their use of shape context descriptors to describe edge features for 2D deformable profile matching, which makes a lot of sense. However, the algorithm as a whole seems to be tailored to the dataset they use, which looks extremely unrealistic and small. The paper doesn't discuss how effective the proposed approach would be in recognizing human actions in the wild.

  6. It feels really awkward when I figured out all the way through the mathematical derivation of the complicated generative model part...and puuuf! - they only use that as a feature-like input for a discriminative classifier!

    I guess it is really complicated and difficult to truly build an overall generative model. The last discriminative part considerably alleviates this problem. It also tolerates the possible errors made during the previous generative models. As long as the previous generative models correctly captures the relative likelihood increase/decrease trend for each class (no absolute likelihood comparison is required), the method is going to output reasonably good results.

    1. I interpret the discriminative classifier used at the end as a calibration cum score analysis method. The generative model is doing the major job of capture semantic information which is then refined by a hyperplane.

  7. It seems like this paper is using the model it builds to classify each frame with a set of "actions" and then uses the probabilities that the action is occurring in the frame as a bag-of-words histogram to feed into an SVM.

    I dont understand why the authors classify the video frame-by-frame. My intuition is that it would be more accurate to classify the entire video as a whole, since a jumping jack looks like a 2-handed wave when your arms are up and it looks like a jump for the frame you are in the air.

    Also, I think that instead of picking the class with the most number of votes, each frame could contribute a "soft" vote based on how sure it was of that class, similar to what we saw with the hierarchical segmentation paper.

    1. I think that aggregating over frame-by-frame classification makes them robust to bad performance on odd frames. The "soft" vote idea will probably improve results (like soft kmeans + BoW) but I don't think that would make a very big difference.

  8. I found this paper to be really confusing, and I still don't have a good sense of how the part model works or what exactly these parts are. This is what I *think* is going on with the generative model, someone please correct me if I'm wrong:

    (1) take static (edges) and dynamic (gradients over time?) features, use k-means
    (2) assign cluster centers to so-called "parts" (separate set of four parts for each action class)
    (3) part position is given by normal distribution of the locations of the features that are assigned to it?
    (4) probability of part relative positions is given by some multivariate gaussian distribution.

    My impression is that through the action sequence, the four parts for the given action class do not change. I don't understand how these parts are allowed to move through the sequence. It looks like they sort of have parts learned, for say, the arms and legs during jumping jacks. However, these parts necessarily move throughout the sequence - are the moving into a lower probability part of the gaussian representing their relative positions? Is there something built in that models how they move relative to each other over time?

    I feel like I must be missing something big here, but I hope I'm not *completely* misunderstanding everything. Someone enlighten me please?

    1. The description below eqn. 8 on page 3 says that point (3) above is also using relative positions of features to parts. This way the parts can move through the video. The "relative" position trick has been used again in 4 and I think that one allows for motion of the entire person in the video. May be this answers the concern.

  9. Concerns

    1. Dataset - it is too simple. Everyone's already pointed it out.
    That's also why using their simple features works.

    2. What actually helps? Is it the hierarchical model? What happens if you just take a MLE of the action models? Why do you even need an SVM?! What happens if you vary the number of mixtures?

    3. A simple baseline (anything using BoW aggregated over frames+SVM) would be very helpful.

    4. Why is [2] bad if it cannot take decisions frame by frame? Why do we need frame-by-frame decisions? How is this current model more generic? Background subtraction seems reasonable given the dataset. Sure, this maybe looked upon as "overfitting to the data", but I think these authors overfit their features too.

    1. Humphrey raises the same point (MLE estimate). Sorry, I missed that.

    2. 2) I like that they show that static + dynamic works (Figure 9, right). It looks to be almost exact in how much it got better when adding static. As expected dynamic features do most of the heavy lifting, but their logic for something like hand-waving has a lot of static compared to dynamic features well intuits the idea of having static features.
      I think the need for the svm is that the generative part doesn't actually do much of the work (related to your 3). The reasoning for generative is that it should tell you something more. However, for the task of action classification (the target application) I don't see why generative buys you anything.

    3. I would suggest that you need an SVM because to do action classification you really need to do two things. First get some idea of pose and how it's changing (generative part) and then classify that into a semantic category (discriminative part). This seems to be a recurring theme with papers that try to assign labels that are words; natural language is very vague and therefore very hard to learn, so you need hacks like mixture models, etc. to get classes you can actually do anything with.

    4. For 2.
      P(class | data, model) = P(data | class, model) * P(class | model)/P(data | model).

      Computing the likelihood alone is insufficient and modeling the other terms is painful. Approximating this with a hyperplane is a simple workaround. But I +1 Mike's intuitive explanation.

      For 4.
      If a classification method can only classify entire videos at a time, we need some other algo to cleanly segment a large video into smaller semantic parts. This second problem is not very easy and a frame by frame classifier is saving us from the trouble of solving it. But if the output desired is video segments then frame by frame classification needs post processing to merge frames and make video segments, but now with likelihood scores for each; which might be easier to do.

    5. There are some literatures focusing on the problem of jointly segmenting and classifying videos with large-margin learning and dynamic programming inference. e.g.,

    6. The paper observes that:
      "human actions are results of a sequence of poses, which arise from a few sets of similar body configurations"
      and then proceeds to represent these "few sets of similar body configurations" using a mixture of hierarchical models. Therefore, each model represents an action ('walk', 'run', 'jump', etc.).
      I'm still not sure I've got this right, but if one really HAS to make the final decision using an SVM over the likelihood estimates of each mixture of models (as opposed to picking the maximum as Humphrey, Ishan and others have pointed out), why not instead go all the way and use the SVM over likelihood estimates of the individual hierarchical models directly? Wouldn't this give richer features for the SVM? Is the mixture step really required?

  10. This comment has been removed by the author.

  11. I liked the high-level idea of the paper -- using a hierarchy from low-level features to mid-level parts (though I mostly see only 2 levels being used). But quite frankly, that's where my liking of this paper stops.

    Firstly, I didn't like the writing.. It was pain to decipher what the authors were getting at or why is a particular flow was written (e.g., "Fanti et al. proposed in [10] that it is fruitful to utilize a mixture of both static and dynamic features. In their work, the dynamic features are limited to simple velocity description. We therefore propose the hybrid usage of static shape features as well as spatial-temporal features in our framework.").
    I felt really stupid trying to understand and failing after multiple attempts..

    More importantly, I felt cheated in the experiments section (after all the effort to understand the paper and make sense of the math) (no promises that I got it right). Why is frame-to-frame a good idea? Doesn't an action involve sequence of frame, anyways?.. Furthermore, the only form of comparison the author mention is this -- "The first reported classification results on this dataset appeared on [2]. Their method achieved a classification error rate of 0.39%. It is however, difficult to make a fair comparison. Their method requires a background substraction procedure, global motion compensation, and it cannot take classification decisions frame by frame. Please also note, that our model is general in the sense that it aims to offer a generic framework for human motion and pose categorization.". Kind-of disappointing given that [2] was in 2005, this paper was in 2007. More baselines, apart from ablation analysis would have been nice..

    Though, the authors did do nice ablation analysis on different aspects of their method, which was a delight to see :)

    Point of discussion: Was it a good idea to have no sharing of low-level features between parts? In current form, the low-level features for different parts do not overlap, but you can imagine having same low-level feature (like waist keypoints) being shared by multiple parts (like leg(s) and torso).. Something to think about...

    1. The way they choose to do frame-by-frame classification also brings my attention. If they are doing frame-by-frame, it is very important for them to show 1-to-1 comparison results like some roc curves or at least verification rate at some particular false accept rate. When doing 1-to-1, the classification rate would always look "nice", but usually the VR is where it's getting way uglier.

  12. While this approach demonstrates reasonable effectiveness for contrived examples, I feel as if direct reasoning about human action is skipping a few steps in the perception pipeline, specifically 2D semantic segmentation and 3D geometric reasoning. Things like this just strike me as a convoluted way of producing a very specific output, and won't stand the test of time. I understand that the other parts in the pipeline are far from solved, but it seems as if once they're somewhat better, there won't be much that this paper has left to offer, other than the notion of combining static and dynamic features, which doesn't seem that much of a contribution. I place high value on approaches to problems that can be easily integrated into other computer vision tasks, and I don't think standalone action recognition is a desirable enough task to solve ahead of and independent of better scene understanding.

  13. Is it worth trying to build a complex generative model, that in the end, is not actually very generative? It is hard to say if the trained graphical model is adding much to their final problem (Humphrey's point). What it adds is in effect a mid-level feature.
    A local-bag-of-words plus features representing relative location to a parent may be good enough to train a direct svm classifier (Ishan's point).

    The probability -> discriminative portion seems somewhat a reversal of idea by Platt for using an SVM + sigmoid (logistic regression) to get probabilities out of an SVM.;jsessionid=7CEEACDF86A0DF78FB7543FB40C755FC?doi=

  14. For the problem with discriminative models, the authors said "While discriminative frameworks are often very successful in the classification results, they suffer either the laborious training problem or a lack of true understanding of the videos or images." Not sure about the reason why the author argue that we should use a generative model instead of a discriminative one, especially when the author used a discriminative model on top of a generative one. There is no analysis about why they should do that which really makes me confused.

  15. "Action Classification" to me is a very suspect goal. First, the problem is ill-defined: What are the classes of human actions? What constitutes a different action? "Walking" for me might look totally different than "walking" for someone else. Second, what use is an action classifier? What we really want to do is understand where a person's limbs are and how their joint angles are changing.

    1. One could argue that finding the model of the person's limbs first loses certain important aspects of motion which might be captured by this system, like the subtle movement of clothing, for instance.

      -- Matt K

    2. I also doubt that 2D DPM-like model itself is enough for such recognition tasks as human motions consist essentially 3D complex movement of the limbs. Well, although they used simple test dataset that simply the problem. But I feel the system without taking these into account would be difficult to extend for real scenarios.

  16. The generative model proposed in this paper is very expressive while retaining a small number of parameters. I look at this in two ways:-
    1. Generative methods are modelling the data. There is greater hope in making them unsupervised when compared to discriminiative methods. When used like in this paper we can interpret the learnt model.
    2. The generative model is mapping the raw feature space into a more separable semantic space (of likelihood scores). A discriminative classifier on this more separable space is doing better than applying discriminiative classification on the raw features itself. Some approaches that work like this have been shown to perform well even when the discriminative part at the end is removed/weakened - eg: Convolutional neural nets without the giant two layered feedforward neural net at the end.
    3. Generative models exist all over the place - often as optimization problems rather than probabilistic methods. This includes, ICA, RICA, PCA, some clustering methods.