Tuesday, October 29, 2013

Reading for 11/5

Note: No class on 8/31


and optionally:

Saurabh Singh, Abhinav Gupta, Alexei A. Efros. Unsupervised Discovery of Mid-Level Discriminative Patches. In ECCV 2012.

Lubomir Bourdev, Subhransu Maji, Thomas Brox, Jitendra Malik, Detecting People Using Mutually Consistent Poselet Activations, ECCV 2010

38 comments:

  1. ================
    ### Summary ###
    ================
    This paper addresses the problem of detecting humans and estimating their pose in images using a part-based method. The authors define poselets, that are part of a pose (for instance, right arm crossing torso, right profile and shoulder, etc.) and do not necessarily correspond to an anatomical part of a human such as left forearm, right leg, etc. The annotation for discovering poselets requires 3D position and visibility information from 2D images, which is reminiscent of the earlier paper by Deva Ramanan's group that we read: "Analyzing 3D objects in cluttered scenes". As described in section 2, annotating images this way allows them to easily perform some interesting queries such as estimating probable locations of one part conditioned on the location of another part, statistics of camera views, and extracting images of people in specific, rough poses such as sitting. Their H3D dataset containing these annotations has been made publicly available.
    The major sections in their algorithm are:
    * Clustering to discover poselets: Poselets are discovered by performing clustering with a squared error metric in the 3D configuration space, followed by pruning to reduce the number of clusters.
    * Detecting poselets and humans: For each poselet a classifier is trained on its associated rectangular window. Objects such as human torsos and keypoints in the body are detected by combining the scores obtained from running each poselet classifier using a multiscale scanning window on the image. Results are shown comparing their algorithm with the Dalal and Triggs pedestrian detector as well as a DPM method, on their own H3D dataset and the VOC 2007 dataset.

    ====================
    ### Contribution ###
    =====================
    * The main contribution of this paper is to address the question of "How do we decide what good parts are?" in the context of recognizing humans, by introducing poselets. The data-driven approach to detect parts of a pose, as opposed using a natural definition of parts from the human anatomy is a good idea. The authors also compare this to DPM-like modeling of parts as latent variables that facilitate detection and reason about in which cases their method performs better or worse.

    ====================================
    ### Points of concern / interest ###
    ====================================
    * The hard-negatives used for training classifiers for each poselet are derived from pictures without humans in them. This seems to suggest that there is not much merit in discriminating between poselets. However, in the results section the authors mention that poselet classifiers for front-facing persons might fire on back-facing persons leading to wrong detection of shoulders. It seems possible that discriminatively training front and back-facing poselets could have helped here.

    * The discovered poselets are pruned from an initial count of 120K to just 300 finally using various criteria. One of them is to remove poselets that do not train well. It would have been good to see how many different examples in their dataset the poselets that do not train well correspond to. If commonly occuring poselets did not train well on their classifier, that might indicate a problem with their features or classifier.

    * The authors apply an asymmetric distance metric to describe the distance in configuration space from one example to another. I am not sure how significant this is, but the effect of the asymmetric distance metric is not discussed explicitly.

    ReplyDelete
  2. I like this paper because it introduces parts as mid-level cues for the recognition problem. Here 2D information is not enough to discover a good cluster so they switched to 3D, but the second paper proposed a way to discover good patches purely from 2D images (although not in a very principled manner). Lessons learned from this paper is that, we need extra constraint to study the visual world.

    ReplyDelete
  3. On the topic of poselets that don't train well: The matching in configuration space is invariant to scale and rotation because a similarity transform is applied to the keypoints; however in the appearance space where the SVMs are trained, the features don't have this invariance. I think the authors are getting at this point in footnote 5. I think the framework should allow a many to one mapping between the poselet in appearance space and the poselet in configuration space.

    ReplyDelete
    Replies
    1. What I think is missing here is the fact that multiple poselets existing in the image appearance space can constrain a pose in configuration space. By considering the poselets to be independent, we lose the key constraint that allows us to make this mapping.

      -- Matt K

      Delete
    2. Another way of looking at Mike's comment (also hinted at by Srivatsan) is to say that the images defining some poselets vary so widely in appearance that a linear SVM cannot separate the data. The authors might have experimented with different features or non linear classifiers but have found that it shows some improvement over the baseline simply by rejecting these poselets.

      Delete
    3. Hm, isn't the different appearance of potential poselets exactly the reason to filter out the majority of them?

      Delete
    4. Why should the SVM perform that much worse in this application versus the Dalal and Triggs pedestrian detector? At least in this case, the HOG features are from a semi-consistent subset of human poses. I agree that the seeming independence of pose lets is troubling. It would be interesting to see if the false positives contains a large number of improbably configurations.

      Delete
    5. As I understood it, the whole point of poselets (as opposed to DPM-style parts) is that they capture semantic information rather than visual appearance information. Therefore, the visually dissimilar characteristic of potential poselets is not necessarily bad, especially if it captures unique or important semantic meaning.

      Delete
  4. In this paper, the authors present an approach for simultaneous human detection and pose recognition which maps 3D positions of joints, links, and keypoints to arbitrary 2D "poselets," which are similar to the "parts" in a DPM model. They require an annotated 3D database, and leverage this database to produce distributions on keypoint locations given poses, and vice versa. "Poselets" are learned from the data which maximally discriminate human configurations. They discovered that the most informative "poselets" involved various zoom levels of human faces and shoulders -- an expected result given the performance of other human detectors.

    While the method the authors present doesn't always do as well as simpler DPM methods in finding humans and measuring the 2D positions of their parts, it is obviously a much more powerful and complete method for detecting humans and their poses. Humans are made up of 3D objects connected by 3D joints. Their appearence is generated by the position and orientation of their center of mass, combined with configurations of each joint. It only makes sense to leverage this generative model to produce *correct* distributions on 2D appearances in the image, just from a theoretical standpoint. 2D discriminative methods, in contrast, assume faulty priors in the image space such as gaussian, linear, or uniform distributions in the image, which don't capture the underlying dynamics of the system. In this way, a 2D parts model is generated *automatically* from the 3D human model and data.

    Further, this work comes closer to giving us "what we want" in terms of the "parts" of a human. Really, what's needed in most applications which consider human pose is the real, 3D, continuous configuration of the human, not a 2D bounding box approximation of parts.

    I think that further work in this direction for detecting poses is necessary. The data collection problem is very difficult, and I don't believe their interface works well enough to be useful. Further, their model of the human is not, in my opinion, complex enough to give us the best representation. A human is not a collection of "balls and sticks", but rather has more or less rigid links of a certain length, connected by joints with more or less known and constrained properties. Each joint has not only a position, but an orientation (in 3D), as well as a degree of actuation. The joints also have limits. The human spine is another unmodeled aspect which is extremly high dimensional. I think that ultimately, all of these degrees of freedom should be modeled.

    -- Matt Klingensmith

    ReplyDelete
    Replies
    1. I would argue that things like joint limits are implicitly modeled by the poselet generation step. If there are no configurations with improper joint angles, no clusters should appear there.

      Delete
    2. I agree with Humphrey. Poselets are a good way of doing pose estimation by classification (each pose is a class). They decompose the problem by making pose classes out of two or three parts together rather than the entire human body.

      Delete
    3. They describe detection of individual human poselets at test time, but how do we know that the global configuration of detected poselets make sense? They use 3D constraints during training time, but are we sure that all the individually detected poselets on a given human are sensible when we put them all together?

      Delete
    4. I agree with Jacob. Torso detection seems like a relatively safe criteria for detection, since many conflicting poselets may vote for the same torso location. While the autors do show results on some keypoint detections, I would liked to have seen more qualitative results, especially since using 3D poselets for detections should allow for a somewhat detailed pose detection (as far as limbs are concerned.

      Delete
    5. So seems like if we want to make reliable detection in the testing time, it is better to model them entirely from 2d images in the training time as well? I believe with enough data we are able to do so (without any information from 3d), but 3d information should provide some guidance and remove some false positive during training when the dataset size is quite small. That is why all these geometry method works better with smaller training example sets.

      Delete
  5. This comment has been removed by the author.

    ReplyDelete
  6. The paper seems to present an interesting and different approach to the problem of human pose detection. The paper effectively says that the dataset we're training on are just somewhat wrong since they separate the 3d and 2d during training time. This claim seems obvious from the viewpoint of replicating human understanding of 2D images. We spend our lives interacting with a 3D world to get information, but we are able to properly detect and classify objects, people, poses, etc. in a 2D image. This is seen even the way they construct the dataset by using human labelers to reverse-generate an approximate 3D configuration. It is impressive that that they were able to train on their relatively small H3D dataset
    and still achieve state of the art performance on PASCAL. Hopefully, with kinect-like sensors, so much effort won't be needed to create annotated 3D & 2D datasets in the future.

    I am also curious to see how it would have worked to just do L1 regularized training instead of the greedy search for picking the top candidates.

    ReplyDelete
  7. It seems like the power behind this method is coming from the intricate dataset. How will this extend to other classes? How large of a Turkish army will we need? Perhaps if we had a way to go from other imaging modalities, such as 3D point clouds or 2.5 depth images to the configuration-space data they need, this could extend nicely. On that note, wouldn't skeleton data from Kinect be able to produce useful (albeit biased) data?

    ReplyDelete
    Replies
    1. It seems so, though the problem with current kinect-style sensors is that they only work indoors. This could cause a problem with getting a wide variety of lighting and could bias the hard mining due to backgrounds only being of indoor settings.

      Delete
    2. I suspect that even poselets learned from quite biased datasets would still be useful. I think there is definitely merit to coming up with a more automated way to create a dataset like the one presented in this paper.

      Delete
    3. The kinect might give us a really biased dataset because of the constraints such as only indoors or the poses should be somehow simple.

      Delete
    4. I think that if we only used the kinect to get pose information, we might be able to automatically correlate it with a 2D image dataset. I imagine this being a system where the kinect (which lacks visual variety) enumerates different common poses for humans and a 2D image dataset (which lacks 3D data) is correlated with the 3D pose to get a larger dataset.

      Delete
    5. The paper specifically mentioned that there are 15 regions of a person ("face", "upper clothes", etc.) that were annotated manually. These seem like data that would be non-trivial to obtain from Kinect data.

      While I was reading this paper, I thought it would be super cool if all you needed was the skeletal data,and if that could be obtained from the Kinect. If so, this kind of method would be much more feasible. As it stands, we have 5 min/annotation * 2000 annotations, which seems like a very long time.

      I have no idea how necessary it is to have an accurate segmentation of these region types. Maybe some approximation of it via the Kinect skeleton would suffice. If so, it seems to me that this kind of data is relatively easy to get (excluding biases introduced by Kinect limitations). This makes this paper particularly exciting for me, because it seems like something for which generalization is actually feasible.

      Thoughts on why the region segmentations were required? I know they needed some sort of image stuff, but do people think these segmnetations have to be really accurate?

      Delete
  8. The authors find 200K poselets and prune it down because some of them aren't having enough data points or they cannot make a linear SVM work on it. I think they are loosing out on modelling the fat and long tail of the dataset by discarding this rich information.

    Based on our estimates of the number of classes required to solve pose estimation as good as humans do, we probably need a large number of poselets too (not as many as classes but probably much more than 300).

    If they hang on to everthing till the end and use their cross validation to discard only those which hurt then they might be able to capture odd poses better (thus improving the performance a tiny bit).

    ReplyDelete
    Replies
    1. I don't follow this. If an SVM doesn't train well (which seems to be the case given their description), how would it be selected by cross validation?
      Unless you are suggesting something closer to "exemplar-poselets"? With few images that vary too much in appearance, a SVM doesn't stand a chance.

      Delete
    2. Exactly. The pruning/selection step seems to bother me.. Was the task pose-estimation or person detection for cross-validation? Was it to find similar poselets in appearance and was evaluated using the configuration space? ??

      Delete
    3. I agree. The pruning through cross-validation was very vague. Since they mention it is a greedy approach, I wonder if they are doing something like OMP, where you first choose the poselet with the maximum score and then combine poselets sequentially to get the best subset. But wouldn't that be a very very expensive operation?

      Delete
    4. I don't think they're doing OMP, but it seems like it would be a very natural thing to do... Why would it be expensive?

      Actually, so there's something that is similar to this paper in that they're clustering patches, but in the other paper they are clustering at many different scales. http://homes.cs.washington.edu/~lfb/paper/cvpr13.pdf

      Delete
    5. My bad! Didn't write fully what I was thinking. Just choosing one poselet with the maximum score and adding on other poselets to it might not be too expensive. But how confident are the authors about the distance metric? If the distance metric is not suitable, one iteration of OMP like method might not work. Trying with different starting poselets would be required and that would be an expensive operation.

      Delete
    6. They introduce a midlevel representation, i.e., poselets, to connect configuration space and appearance space. It seems for some cases there exists not-so-well matching between these two spaces via poselets, so they have to prune something. If we do like exemplar-SVMs for detection, which might alleviate the influence of large intra-class variation yet small training samples.

      Delete
  9. I like this paper for it's use of parts which have some grounding (in configuration space). I believe using a grounding apart from appearance (like in DPMs) is a nice way to enrich these part-based models. A2's paper in ICCV finds such a grounding in geometry space.

    I liked the task-specific learning of weights. I think many times we do not exploit domain/task specific information and hence make a problem harder than it should be. The authors show a nice way to overcome this.

    ReplyDelete
  10. The authors present a new mid-level representation -- poselets, which is another attempt to define "part" for human detection and/or pose estimation. They introduce a new dataset H3D which has 3D keypoint annotations for humans, which they use to discover poselets. The key properties they are looking for in these poselets are that they should be easily detectable and should help in estimating the 3D pose/configuration of a person.

    I liked that the paper for its merit, the ease of reading and for describing everything intuitively.. However, here are a few concerns (points of discussions):
    1. Going from 120k->2k->300-256 poselets. I was around when Singh et al. were working on their ECCV 2012 mid-level patches paper, and I know how difficult it was to define a good metric to select good patches or even to induce a good ordering.. Apart from the 120k->2k jump, I'm very unclear about how other pruning via cross-validation done or engineered. Anyways, I would have liked to see some more details :)
    2. Annotation. The authors try to address this point in page 3 c2 l2, but I was not convinced. I have seen images in PASCAL and they do seem to be pretty arbit and complex. Why couldn't we just do similar 3D annotations on those images??
    3. Not related to the paper but in general. It is easy to provide 3D annotations for humans, cars etc. which usually follow a very standard skeleton/structure.. But I doubt such techniques can be extended to most other objects, which is where I think making use of 2.5D or kinect-like data might come in handy.. This point is also raised by few other students above..

    ReplyDelete
  11. It looks to me that this method can be even more useful, if it is known that the camera will see objects only in a subspace of configurations. Like cameras in public places will always see people from some pitch angle, or car dash cameras will see people and cars from the height of 1 meter.
    I guess what one needs to do in this case is just to choose relevant poselets from the poselet database

    ReplyDelete
  12. This paper specifically focuses on the usage of parts as applied to humans. It seems the implicit argument is that humans are the most complex sets of parts, and so solving the problem for them will render the problem solved for simpler objects.

    They constrain themselves to rectangular patches for simplicity's sake. This approach seems fine as a pre-processing step, but using straight unsegmented patches to put things in configuration space seems non-optimal. Can we create mid-level patches and then use only the regions of interest (and not the background) as input?

    It was interesting to see their results of something like "metadata transfer" of dataset segmentation onto an image with predicted poselets, and I would've liked to have seen more of such results.

    ReplyDelete
  13. I really like the concept of poselets for doing pose estimation and the way the authors had compared this method with DPMs and how the performance varies. They have also created a new dataset H3D with annotations that can be used in different ways, which I'm sure is a major contribution. But I had issues with some parts of their paper which people have mentioned above.

    1. The pruning through cross-validation - It is not clear how this works and what exactly the authors do, why a greedy approach, how confident are they about the end results of the greedy approach, how significant is the distance metric, why is this distance metric better over the others, etc.
    2. Choosing negative examples without humans in them - This definitely disturbed me a lot. These are not hard negatives if there are no humans in them. I wonder if they did not try with "proper" hard negative examples.

    ReplyDelete
  14. I like this paper because it demonstrates the question that what is a good part. Other than using the visual constraints only in 2D, this paper uses 3D information of the joint positions. It is a good way to combine 2D and 3D information. And it is possible to apply this algorithm using depth sensor such as kinect. At least we can capture the intricate dataset.

    ReplyDelete
  15. I really liked how the authors tried to make a more comprehensive dataset, but I feel like their dataset, and more specifically the 3D annotated poses, was really the key behind their model. The intuition here is that with the 3D annotated poses, you can get much more detailed information about the part that you're tracking, like its orientation and scale. It also takes into account joint limitations. I think this is why they do much better than DPM on their own dataset, but perform similarly on PASCAL VOC.

    I didnt quite like their method for poselet selection. They had to keep using different techniques to prune their poselet selections, and even when they had reduced it to their final state, a lot of the poselets were still redundent. I would have thought a much more natural way of finding poselets would be some clustering approach. Also, I would guess that there are much more than 300 poses for humans to take, and given that several of these were redundent, perhaps their poselets were not quite comprehensive or exhaustive. I think this again shows that perhaps the 3D annotated data is really what is driving the algorithm here.

    ReplyDelete
    Replies
    1. I dunno, I think 300 might be adequate for estimating poses of humans in street scenes. It seems like the images don't really include funky poses like those in sports images. Additionally, one poselet can be used for many poses since poses are comprised of combinations of poselets.

      In class we talked about the number of human poses, and I think somewhere there was a paper that estimated it at 10^5? I can see how 300 poselets can cover this space pretty well since there are many combinations of poselets available. What I mean is that, yes, 300 << 10^5, but one poselet doesn't correspond to one pose.

      Delete
  16. I like this mid-level representation using poselet a lot. It is interesting to see the connection between this mid-level feature: poselet, with attribute-based pose estimation. The way the authors have defined the poselet makes each body part an attribute already. For example, this subject has a 45-degree arm across torso, or, this particular arm pose looks like arm pose No.3 from the reference. That way, simile classifier can also be harnessed which may further improve the pose estimation tasks like we have discussed in the facial attribute paper.

    ReplyDelete