Comments on 16-824: Learning-based Methods in Vision (F'13): Reading for 11/5

So seems like if we want to make reliable detectio...

2013-11-05T09:48:22.282-08:00

So seems like if we want to make reliable detection in the testing time, it is better to model them entirely from 2d images in the training time as well? I believe with enough data we are able to do so (without any information from 3d), but 3d information should provide some guidance and remove some false positive during training when the dataset size is quite small. That is why all these geometry method works better with smaller training example sets.

I like this mid-level representation using poselet...

2013-11-05T08:54:30.447-08:00

I like this mid-level representation using poselet a lot. It is interesting to see the connection between this mid-level feature: poselet, with attribute-based pose estimation. The way the authors have defined the poselet makes each body part an attribute already. For example, this subject has a 45-degree arm across torso, or, this particular arm pose looks like arm pose No.3 from the reference. That way, simile classifier can also be harnessed which may further improve the pose estimation tasks like we have discussed in the facial attribute paper.

I dunno, I think 300 might be adequate for estimat...

2013-11-05T06:08:05.363-08:00

I dunno, I think 300 might be adequate for estimating poses of humans in street scenes. It seems like the images don't really include funky poses like those in sports images. Additionally, one poselet can be used for many poses since poses are comprised of combinations of poselets.

In class we talked about the number of human poses, and I think somewhere there was a paper that estimated it at 10^5? I can see how 300 poselets can cover this space pretty well since there are many combinations of poselets available. What I mean is that, yes, 300 << 10^5, but one poselet doesn't correspond to one pose.

The paper specifically mentioned that there are 15...

2013-11-05T05:57:56.723-08:00

The paper specifically mentioned that there are 15 regions of a person ("face", "upper clothes", etc.) that were annotated manually. These seem like data that would be non-trivial to obtain from Kinect data.

While I was reading this paper, I thought it would be super cool if all you needed was the skeletal data,and if that could be obtained from the Kinect. If so, this kind of method would be much more feasible. As it stands, we have 5 min/annotation * 2000 annotations, which seems like a very long time.

I have no idea how necessary it is to have an accurate segmentation of these region types. Maybe some approximation of it via the Kinect skeleton would suffice. If so, it seems to me that this kind of data is relatively easy to get (excluding biases introduced by Kinect limitations). This makes this paper particularly exciting for me, because it seems like something for which generalization is actually feasible.

Thoughts on why the region segmentations were required? I know they needed some sort of image stuff, but do people think these segmnetations have to be really accurate?

They introduce a midlevel representation, i.e., po...

2013-11-05T05:52:46.178-08:00

They introduce a midlevel representation, i.e., poselets, to connect configuration space and appearance space. It seems for some cases there exists not-so-well matching between these two spaces via poselets, so they have to prune something. If we do like exemplar-SVMs for detection, which might alleviate the influence of large intra-class variation yet small training samples.

I agree with Jacob. Torso detection seems like a ...

2013-11-05T05:26:49.189-08:00

I agree with Jacob. Torso detection seems like a relatively safe criteria for detection, since many conflicting poselets may vote for the same torso location. While the autors do show results on some keypoint detections, I would liked to have seen more qualitative results, especially since using 3D poselets for detections should allow for a somewhat detailed pose detection (as far as limbs are concerned.

As I understood it, the whole point of poselets (a...

2013-11-04T23:23:39.868-08:00

As I understood it, the whole point of poselets (as opposed to DPM-style parts) is that they capture semantic information rather than visual appearance information. Therefore, the visually dissimilar characteristic of potential poselets is not necessarily bad, especially if it captures unique or important semantic meaning.

I think that if we only used the kinect to get pos...

2013-11-04T23:16:44.653-08:00

I think that if we only used the kinect to get pose information, we might be able to automatically correlate it with a 2D image dataset. I imagine this being a system where the kinect (which lacks visual variety) enumerates different common poses for humans and a 2D image dataset (which lacks 3D data) is correlated with the 3D pose to get a larger dataset.

I really liked how the authors tried to make a mor...

2013-11-04T23:11:09.359-08:00

I really liked how the authors tried to make a more comprehensive dataset, but I feel like their dataset, and more specifically the 3D annotated poses, was really the key behind their model. The intuition here is that with the 3D annotated poses, you can get much more detailed information about the part that you're tracking, like its orientation and scale. It also takes into account joint limitations. I think this is why they do much better than DPM on their own dataset, but perform similarly on PASCAL VOC.

I didnt quite like their method for poselet selection. They had to keep using different techniques to prune their poselet selections, and even when they had reduced it to their final state, a lot of the poselets were still redundent. I would have thought a much more natural way of finding poselets would be some clustering approach. Also, I would guess that there are much more than 300 poses for humans to take, and given that several of these were redundent, perhaps their poselets were not quite comprehensive or exhaustive. I think this again shows that perhaps the 3D annotated data is really what is driving the algorithm here.

The kinect might give us a really biased dataset b...

2013-11-04T22:13:46.828-08:00

The kinect might give us a really biased dataset because of the constraints such as only indoors or the poses should be somehow simple.

I like this paper because it demonstrates the que...

2013-11-04T22:09:54.027-08:00

I like this paper because it demonstrates the question that what is a good part. Other than using the visual constraints only in 2D, this paper uses 3D information of the joint positions. It is a good way to combine 2D and 3D information. And it is possible to apply this algorithm using depth sensor such as kinect. At least we can capture the intricate dataset.

They describe detection of individual human posele...

2013-11-04T22:07:25.150-08:00

They describe detection of individual human poselets at test time, but how do we know that the global configuration of detected poselets make sense? They use 3D constraints during training time, but are we sure that all the individually detected poselets on a given human are sensible when we put them all together?

My bad! Didn't write fully what I was thinking...

2013-11-04T22:05:59.807-08:00

My bad! Didn't write fully what I was thinking. Just choosing one poselet with the maximum score and adding on other poselets to it might not be too expensive. But how confident are the authors about the distance metric? If the distance metric is not suitable, one iteration of OMP like method might not work. Trying with different starting poselets would be required and that would be an expensive operation.

I really like the concept of poselets for doing po...

2013-11-04T21:56:02.790-08:00

I really like the concept of poselets for doing pose estimation and the way the authors had compared this method with DPMs and how the performance varies. They have also created a new dataset H3D with annotations that can be used in different ways, which I'm sure is a major contribution. But I had issues with some parts of their paper which people have mentioned above.

1. The pruning through cross-validation - It is not clear how this works and what exactly the authors do, why a greedy approach, how confident are they about the end results of the greedy approach, how significant is the distance metric, why is this distance metric better over the others, etc.
2. Choosing negative examples without humans in them - This definitely disturbed me a lot. These are not hard negatives if there are no humans in them. I wonder if they did not try with "proper" hard negative examples.

I don't think they're doing OMP, but it se...

2013-11-04T21:44:46.324-08:00

I don't think they're doing OMP, but it seems like it would be a very natural thing to do... Why would it be expensive?

Actually, so there's something that is similar to this paper in that they're clustering patches, but in the other paper they are clustering at many different scales. http://homes.cs.washington.edu/~lfb/paper/cvpr13.pdf

This paper specifically focuses on the usage of pa...

2013-11-04T21:38:10.451-08:00

This paper specifically focuses on the usage of parts as applied to humans. It seems the implicit argument is that humans are the most complex sets of parts, and so solving the problem for them will render the problem solved for simpler objects.

They constrain themselves to rectangular patches for simplicity's sake. This approach seems fine as a pre-processing step, but using straight unsegmented patches to put things in configuration space seems non-optimal. Can we create mid-level patches and then use only the regions of interest (and not the background) as input?

It was interesting to see their results of something like "metadata transfer" of dataset segmentation onto an image with predicted poselets, and I would've liked to have seen more of such results.

Why should the SVM perform that much worse in this...

2013-11-04T21:30:59.236-08:00

Why should the SVM perform that much worse in this application versus the Dalal and Triggs pedestrian detector? At least in this case, the HOG features are from a semi-consistent subset of human poses. I agree that the seeming independence of pose lets is troubling. It would be interesting to see if the false positives contains a large number of improbably configurations.

I agree. The pruning through cross-validation was ...

2013-11-04T21:19:03.266-08:00

I agree. The pruning through cross-validation was very vague. Since they mention it is a greedy approach, I wonder if they are doing something like OMP, where you first choose the poselet with the maximum score and then combine poselets sequentially to get the best subset. But wouldn't that be a very very expensive operation?

It looks to me that this method can be even more u...

2013-11-04T21:17:17.498-08:00

It looks to me that this method can be even more useful, if it is known that the camera will see objects only in a subspace of configurations. Like cameras in public places will always see people from some pitch angle, or car dash cameras will see people and cars from the height of 1 meter.
I guess what one needs to do in this case is just to choose relevant poselets from the poselet database

Exactly. The pruning/selection step seems to bothe...

2013-11-04T21:04:29.834-08:00

Exactly. The pruning/selection step seems to bother me.. Was the task pose-estimation or person detection for cross-validation? Was it to find similar poselets in appearance and was evaluated using the configuration space? ??

Hm, isn't the different appearance of potentia...

2013-11-04T21:04:26.494-08:00

Hm, isn't the different appearance of potential poselets exactly the reason to filter out the majority of them?

The authors present a new mid-level representation...

2013-11-04T21:00:46.024-08:00

The authors present a new mid-level representation -- poselets, which is another attempt to define "part" for human detection and/or pose estimation. They introduce a new dataset H3D which has 3D keypoint annotations for humans, which they use to discover poselets. The key properties they are looking for in these poselets are that they should be easily detectable and should help in estimating the 3D pose/configuration of a person.

I liked that the paper for its merit, the ease of reading and for describing everything intuitively.. However, here are a few concerns (points of discussions):
1. Going from 120k->2k->300-256 poselets. I was around when Singh et al. were working on their ECCV 2012 mid-level patches paper, and I know how difficult it was to define a good metric to select good patches or even to induce a good ordering.. Apart from the 120k->2k jump, I'm very unclear about how other pruning via cross-validation done or engineered. Anyways, I would have liked to see some more details :)
2. Annotation. The authors try to address this point in page 3 c2 l2, but I was not convinced. I have seen images in PASCAL and they do seem to be pretty arbit and complex. Why couldn't we just do similar 3D annotations on those images??
3. Not related to the paper but in general. It is easy to provide 3D annotations for humans, cars etc. which usually follow a very standard skeleton/structure.. But I doubt such techniques can be extended to most other objects, which is where I think making use of 2.5D or kinect-like data might come in handy.. This point is also raised by few other students above..

I like this paper for it's use of parts which ...

2013-11-04T20:42:57.497-08:00

I like this paper for it's use of parts which have some grounding (in configuration space). I believe using a grounding apart from appearance (like in DPMs) is a nice way to enrich these part-based models. A2's paper in ICCV finds such a grounding in geometry space.

I liked the task-specific learning of weights. I think many times we do not exploit domain/task specific information and hence make a problem harder than it should be. The authors show a nice way to overcome this.

I don't follow this. If an SVM doesn't tra...

2013-11-04T20:40:03.008-08:00

I don't follow this. If an SVM doesn't train well (which seems to be the case given their description), how would it be selected by cross validation?
Unless you are suggesting something closer to "exemplar-poselets"? With few images that vary too much in appearance, a SVM doesn't stand a chance.

The authors find 200K poselets and prune it down b...

2013-11-04T19:09:40.803-08:00

The authors find 200K poselets and prune it down because some of them aren't having enough data points or they cannot make a linear SVM work on it. I think they are loosing out on modelling the fat and long tail of the dataset by discarding this rich information.

Based on our estimates of the number of classes required to solve pose estimation as good as humans do, we probably need a large number of poselets too (not as many as classes but probably much more than 300).

If they hang on to everthing till the end and use their cross validation to discard only those which hurt then they might be able to capture odd poses better (thus improving the performance a tiny bit).