16-824: Learning-based Methods in Vision (F'13): Reading for 10/3

Tuesday, October 1, 2013

Reading for 10/3

M. Hejrati, D. Ramanan. Analyzing 3D Objects in Cluttered Images Neural Info. Proc. Systems (NIPS), Lake Tahoe, NV, Dec 2012.

And optionally:

Bojan Pepik, Michael Stark, Peter Gehler, Bernt Schiele, Teaching 3D Geometry to Deformable Part Models,CVPR 2012.

48 comments:

Abhinav ShrivastavaOctober 2, 2013 at 2:11 PM
Analyzing 3D Objects in Cluttered Images
Mohsen Hejrati, Deva Ramanan

This paper tackles the problem of 3D understanding and reasoning about objects in 2D images. In particular, they focus on cars. First, they use variant of DPM and flexible mixture-of-parts (see Yang & Ramanan) to detect cars and predict keypoints corresponding to a 3D structure. Then, they enforce SFM like constraints on these keypoints for refinement.

2D shape and appearance
Some background: DPM models objects using global mixtures (having multiple models for each object), and each mixture has its own set of parts. On the other hand, [Yang & Ramanan]’s work on pose estimation uses just one model (single mixture) for modelling humans, but each part has local mixtures (having multiple appearance models for each part), and call it flexible mixture-of-parts.

This paper combines the global mixtures from DPM and local mixtures from [Yang & Ramanan] and call it a compositional model. The difference is that they don’t have separate parts for each global mixture. They have one set of mixture-of-parts, and each global mixture can use (compose or cut-and-paste) the parts from this one set.

They start with a dataset of cars which have labeled 3D keypoints in 2D images. Because of 2D projection of a 3D object, half of the 3D keypoints are not actually visible in the image, so they have additional visible-or-occluded annotation for each key-point (since humans might be wrong in marking these occluded keypoints, they assume these locations are latent and update them during the training). They initialize one part for each of these keypoints and train a DPM like mixture-model. They add additional co-occurence term, which captures the intuition that some parts are almost always occluded together, e.g., if left-rear wheel is not visible, then possibly the left tail light is also not visible.

One major point (according to me) is that they still train appearance models for occluded parts. I see how this is helpful (if a part is occluded, there might still be local appearance evidence for that), but I would have still liked to see this turned on and off to understand it better. I think the authors do a very nice job of explaining the model in Section 3 (local, relational and global model), so I won’t repeat it here. I would just mention that as opposed to the previous works, they want to reason about all the geometric configurations and occlusion states of parts.

3D shape and viewpoint
This papers enforces SFM like constraints, for refinement, on the keypoints that were predicted by DPM like 2D model. They assume that each car can be represented by a combination of 3D basis shapes. They minimize the difference between their detected positions and the projection of combination of basis shapes.

I thought that details were missing with regards to their basis shapes. Unlike previous section, I think this section lacked the clarity, description and convincing argument.
ReplyDelete
Replies
Humphrey HuOctober 2, 2013 at 3:30 PM
In Figure 2, the viewpoint label errors appear to have two distinct modes: one at the median of 9 degrees, and another around 180 degrees. Considering that a car is bilaterally symmetric, I wonder if this error mode is significant. Is this a failing of the system, or just an artifact because the fronts and rears of cars are similar? Looking at the landmarks/keypoints in Figure 4, I suspect this may be related to their choice of car landmarks being too symmetric.

ReplyDelete
Replies
AnonymousOctober 2, 2013 at 4:11 PM
In this work, the authors explore combining 2D part based models with 3D geometric knowledge about the objects being detected. They do so by creating a two stage recognition and localization pipeline. In the first stage of the pipeline, they construct a 2D deformable parts model based on mixtures of trees. This model is then applied to a scene for recognition purposes. In the second stage, the authors use an a-priori association of each part in the tree with a point on a deformable 3D model. This is used in an expectation maximization algorithm to adjust and perfect the 3D mach of the object to the image. This allows them to localize objects under heavy clutter, and provides excellent performance in comparison with state of the art 2D approaches.

I believe this paper is going in the right direction. As a roboticist primarily interested in using vision algorithms to allow autonomous behavior in a robot, 2D approaches are limited -- nearly useless in fact -- in the amount of knowledge they give the robot. What we really want is 3D knowledge about the scene so that we can navigate through the scene and interact with objects. For this, approaches which give us more than a simple bounding box are absolutely essential.

One concern I have with this algorithm is their reliance on a simple a-priori 3D model to detect cars. I am wondering how their approach scales to more objects.

-- Matt Klingensmith
ReplyDelete
Replies
Srivatsan VaradharajanOctober 2, 2013 at 5:05 PM
The algorithm in this paper constructs a simple 3D model of cars by identifying landmark points (both visible and occluded) in single 2D images by using a compositional representation of cars. The authors have implemented a two-stage process where the first stage identifies the visible and occluded landmarks using 2D shape and appearance (taking into account viewpoint and intra-class variations). The second stage takes the output of the first stage and refines it further using 3D geometric techniques like non-rigid SFM. This paper attempts to integrate the goals of early 3D geometry based approaches and recent DPM - like methods that build a statistical model of appearance in 2D, towards object recognition.

Points of interest to me:
1. The way occluded landmarks are handled - the authors' idea that the classifier can decide for itself whether to ignore it or to find features correlated with occlusions is very interesting, and also one that I haven't come across before. In addition, the model's capability to detect occluded landmarks is really good.
2. Using non-rigid structure from motion to model intra-class and viewpoint variation and refine refine the predicted landmarks is a novel idea to me. I've only seen non-rigid SfM being used on deformations of the same object over time.

What I would have liked to see/possible future work:
1. The training requires a visibility flag that tells the algorithm if a point is occluded or not. Rather than completely doing away with landmark annotation, it would be good if the algorithm can at least work without the occlusion flags. In fact, it would be great if it didn't require occluded parts to be explicitly labeled, but somehow identified them from 3D geometry.
2. The separation of the process into two stages was maybe good for a start. But ideally, I think applying the 3D geometry constraints and 2D appearance based identification should be done simultaneously within an iterative process.
ReplyDelete
Replies
Priya DeoOctober 2, 2013 at 6:27 PM
This comment has been removed by the author.
ReplyDelete
Replies
M AravindhOctober 2, 2013 at 6:49 PM
The results section does not contain error bars. They must have tried several train test splits and obtained a range of results. I think its important to see these error bars because all of the baselines are comparable and sometimes better than their proposed approach.

The above echoes from a more general sense of unhappiness. Before the results section, I was going to say that this is the marriage of 3D geometric vision and 2D semantic analysis. They are reasoning in 3D and correcting for faults in 2D. At the same time they are able to pick up cars from complicated scenes and predict viewpoint with decent accuracy. But looking at the results I didn't see a 10-20% jump in performance that I had expected. Is this reason to believe that 3D reasoning (even with full annotation) is not going to help much.
ReplyDelete
Replies
M AravindhOctober 2, 2013 at 7:05 PM
The training set is very complicated and so is the test set. I can see that test set should be complicated to convince users to use the system but training sets can be made simple right?

I consider the analogy of trying to teach a kid about numbers and starting with complex numbers and hoping that the kid understands real numbers, integers and fractions in the process. I don't think that will work --- even if we have plots drawing them on the complex plane (aka annotation).

Training with random images is good but I think people are purposefully choosing bad images to make their datasets difficult for no reason. Or was it that datasets were more contrived in the past but are now what we get by randomly sampling internet images. If the later is true then (1) Instead of annotating in so much detail can we use simpler training data and learn it all (picking simpler images is easier than detailed annotation in my opinion) or (2) Is internet images the right training data .. may be driving a car down Pittsburgh and annotating the images and then using internet images to augment this will be better.
ReplyDelete
Replies
IshanOctober 2, 2013 at 11:11 PM
It makes me wonder if 20 landmark points are essential, or can we do with less? I think the important thing for VPC is figuring out the headlights and the windshield. I would've like a teeny experiment with fewer points (12 seemed reasonable to model a car), just to show what exactly it is that helps them with VPC.
ReplyDelete
Replies
UnknownOctober 3, 2013 at 1:44 AM
The idea in this paper that use two stage model to recognize objects is really interesting. The first stage uses DPMs to give a 2D estimate and the second stage uses 3D models to refine the model. However, I don't quite understand the SFM model about how it can model 3D.

However, I'm curious about how could this model applies on other object recognition problems. Is there any discussion on what kind of objects may be suitable for this model. Because for me, I think only those objects with really simple and sharp 3D models will be suitable for this algorithm. But things like animals or clothes may not be suitable for that.

ReplyDelete
Replies
ArunOctober 3, 2013 at 7:06 AM
It was cool that they were able to use 2D images to reason about 3D structure. I'm not convinced that the 3D buys them a lot (~2%). This isn't bad, but it seems like 3D structure, which defines our world, should be able to get us more.
Also, it would be interesting if their future work looked to automatically detect the landmarks. Picking good landmarks that work well for interpreting 3d structure would be difficult. However, for the system to become easily usable with some database with images of a category, it has to be more unsupervised. Also, it may have been unclear what their 3D basis shapes were. Were they crafted to car like shapes or were they more generic?
ReplyDelete
Replies
UnknownOctober 3, 2013 at 7:40 AM
This paper describes an interesting way of inferring 3d structure from 2d image detections. I thought it was interesting that even though SFM usually suffers from occlusion, their algorithm can handle occlusion of landmarks since the 2D model provides information about estimated locations of occluded landmark points.

One thing to note is that they make two important assumptions about the scene: 1. depth variation of objects are small compared to distance from the camera (which I think seems pretty reasonable for most outdoor scenes), and 2. Object instances can be written as linear combinations of a few basis shapes. The second assumption I think is a little more limiting. I can see this working for something with fairly consistent inter class variation like the car they tested on, but I'm not sure how it would work on some other classes - it's hard to tell because I'm not really sure what these basis shapes are representing (I wish they had shown what some of these shapes looked like).
ReplyDelete
Replies
UnknownOctober 3, 2013 at 8:35 AM
I think it is an interesting paper. I believe 3D structure, geometry information can be derived as a latent variable from the 2D images. I have long learned that humans have two eyes simply because it gives us the ability to understand the 3D world, but I am not convinced at that argument. What about people who are born with a single eye? What about animals that do not have their eyes both at the front but on the side? They are able to navigate the world as well. Two eyes extend our ability to perceive the world from different aspects and I believe the underlying structure that deals with the information and thus supports vision is able to handle when animals have more eyes. They are just inferring the underlying latent variables. Another fact that supports this is that, humans are not able to reason about the world as accurate as computers do, e.g., a truncated triangular pyramid (http://www.korthalsaltes.com/photo/truncated_triangular_pyramid.jpg), even if it is not an object in the real world (like the extensions of the lines do not join together), humans can still be cheated easily. It indicates humans are not that good at *accurately* inferring the latent structure, but they tend to infer in the right direction. However for computers, the just don't know what to do even though given the right instruction (please join the edges to see if it is a real object in the world) they are able to do better at accuracy.
ReplyDelete
Replies
UnknownOctober 3, 2013 at 9:02 AM
I invite everyone to take a quick look at this related paper from CMU presented at this year's CVPR, entitled "Correlation Filters for Object Alignment". http://www.cs.cmu.edu/~vboddeti/papers/cvpr_2013.pdf In this paper, landmark detection model based on correlation filters features are proposed, which is robust to occlusions as well. I think the object alignment paper and the 3D object analysis paper in discussion can be used alternately. In the sense that the appearance model would need fairly good initialization, and the 3D analysis paper can give a good estimate. With that, the correlation filter based appearance model can correct the estimation error further. It's good to see how these two can be fused together.
ReplyDelete
Replies
Mike McCannOctober 3, 2013 at 9:09 AM
Given the name, the amount of actual "deformation" happening in DPM seems very limited: Why can't the parts shear? Why can't the parts rotate? I understand that such things complicate the inference step, but I'd say object detection schemes should be invariant to a set of transforms including at least rotation, reflection, and scaling.
ReplyDelete
Replies
UnknownOctober 3, 2013 at 9:19 AM
I like the idea of using landmarks in the ground truth to establish frames for 3D models, but to echo Arun's sentiment, it seems perhaps nonoptimal that they are chosen by the humans.

It seems like 3D model validation could be a good method for discarding object detections. If you can fit a 3D model to a proposed detection easily, then it's likely a detection, however, if you have false positives or poorly overlapping detections, model fitting should be difficult and low scoring.
ReplyDelete
Replies
UnknownOctober 3, 2013 at 10:00 AM
This paper proposed an extension of the DPM model to 3D scenarios. The proposed method is a two-stage model. In the first stage, the 2D shape and appearance variations are reasoned through dynamic programming which finds the maximum score in a limited set of configurations. In the second stage, the inferenced 2D landmarks are regularized by assuming a weak-perspective camera model and assuming it is a linear combination of basis shapes.
Pros:
The overall niche of this paper is that it extended the generation of possible model configurations from 2D to 3D. In this case occlusions, appearance variation as well as shape variations are handled in a much better and organized way than assuming a single view detector. This indeed comes in accordance with human understanding, as single view detection (or direct matching) should be used for smaller scales, more discriminative and less variant patches, while globally we may include higher order reasonings.
The experiments show really good performance in handling occlusions as well as pose estimation.
Cons:
The training part requires training samples given image-landmark triplets, which is different from original DPM that directly infers these parameters as latent states in a latent-SVM. This greatly increases the difficulties of training. I suspect that the model is too complicated to use latent-SVM as it may generate lots of over-fitting which results in incorrect training configurations.
A big question for this paper is that: is it really necessary to infer a full-3D model even for occluded parts? Will it be better if sub-category classification can be combined with 3D inference? Say we use sub-category classification to coarsely classify objects approximately into different categories as well as viewing-angles, and then infer 3D information only for the non-occluded parts.
ReplyDelete
Replies
UnknownOctober 3, 2013 at 10:05 AM
I think its really interesting that the authors are trying to model the 3D shape of the object for better detection, and it definitely seems like such an approach would be better able to handle multiple viewpoints and occlusion than algorithms that use 2D information only. My concern is that I don't see how this could ever be generalized to solve object detection on a larger scale. It seems like a significant amount of human time is required to create enough training data to do object detection of cars only. How Mich time would it take to provide enough training data for the 20 classes in the PASCAL dataset? And even then, we would only be able to detect 20 objects.

Granted, other algorithms are maybe not much better, since they at least require bounding boxes, and perhaps require more images to be labeled than this method resulting in equal or more time spent labeling. I simply think we should consider how easy it is to provide training data to a particular algorithm. Those for which we will never be able to provide sufficient training data should maybe not be pursued.

However, maybe if this algorithm could choose its own landmarks, instead of needing each of them annotated specifically, it could be more easily generalizable.
ReplyDelete
Replies
Maheen RashidOctober 3, 2013 at 10:57 AM
Creating appearance templates for occluded landmarks is a really interesting idea and unfortunately the paper doesn't analyze how much these occlusion templates attribute to the accuracy of the algorithm. However, assuming that "typical" occlusion templates exist, it would be interesting to see if a link can be established between overall scene appearance and occlusion parts. In other words, if it is possible to use parts trained on occluded landmarks to say something about the object's interaction with the entire scene, or even the layout of the scene as a whole.
ReplyDelete
Replies
UnknownOctober 3, 2013 at 11:14 AM
The nice thing about this paper:
1. The idea of exploiting the the 3D constraints over DPM detection (more specifically, a variant of DPM), which helps us to prune out the 2D appearance detections which do not respect the 3D structure.
2. Specifically designed energy function for 2D appearance inference capturing intuitions about occlusion.

Cons:
1. According to my understanding (please correct me if I am wrong...). Due to the two-step inference procedure, the 3D information is only used for pruning out detections which do not make sense, it is not involved in the process we determine 2D detections. Probably some approximate inference could be used here (I know people may disagree...)?
2. Lack of diagnostic analysis, we do not know what's wrong since we only get 2% gain even though we expect more from 3D information. Whether it is because the 2D correspondence fails or because we need more robustness from 3D registration?
ReplyDelete
Replies

Add comment