M. Hejrati, D. Ramanan. Analyzing 3D Objects in Cluttered Images Neural Info. Proc. Systems (NIPS), Lake Tahoe, NV, Dec 2012.
And optionally:
Bojan Pepik, Michael Stark, Peter Gehler, Bernt Schiele, Teaching 3D Geometry to Deformable Part Models,CVPR 2012.
 
Analyzing 3D Objects in Cluttered Images
ReplyDeleteMohsen Hejrati, Deva Ramanan
This paper tackles the problem of 3D understanding and reasoning about objects in 2D images. In particular, they focus on cars. First, they use variant of DPM and flexible mixture-of-parts (see Yang & Ramanan) to detect cars and predict keypoints corresponding to a 3D structure. Then, they enforce SFM like constraints on these keypoints for refinement.
2D shape and appearance
Some background: DPM models objects using global mixtures (having multiple models for each object), and each mixture has its own set of parts. On the other hand, [Yang & Ramanan]’s work on pose estimation uses just one model (single mixture) for modelling humans, but each part has local mixtures (having multiple appearance models for each part), and call it flexible mixture-of-parts.
This paper combines the global mixtures from DPM and local mixtures from [Yang & Ramanan] and call it a compositional model. The difference is that they don’t have separate parts for each global mixture. They have one set of mixture-of-parts, and each global mixture can use (compose or cut-and-paste) the parts from this one set.
They start with a dataset of cars which have labeled 3D keypoints in 2D images. Because of 2D projection of a 3D object, half of the 3D keypoints are not actually visible in the image, so they have additional visible-or-occluded annotation for each key-point (since humans might be wrong in marking these occluded keypoints, they assume these locations are latent and update them during the training). They initialize one part for each of these keypoints and train a DPM like mixture-model. They add additional co-occurence term, which captures the intuition that some parts are almost always occluded together, e.g., if left-rear wheel is not visible, then possibly the left tail light is also not visible.
One major point (according to me) is that they still train appearance models for occluded parts. I see how this is helpful (if a part is occluded, there might still be local appearance evidence for that), but I would have still liked to see this turned on and off to understand it better. I think the authors do a very nice job of explaining the model in Section 3 (local, relational and global model), so I won’t repeat it here. I would just mention that as opposed to the previous works, they want to reason about all the geometric configurations and occlusion states of parts.
3D shape and viewpoint
This papers enforces SFM like constraints, for refinement, on the keypoints that were predicted by DPM like 2D model. They assume that each car can be represented by a combination of 3D basis shapes. They minimize the difference between their detected positions and the projection of combination of basis shapes.
I thought that details were missing with regards to their basis shapes. Unlike previous section, I think this section lacked the clarity, description and convincing argument.
My take:
DeletePros:
- Overall, I love how the authors model the compositional model, with one bag of parts, and each global mixture choosing that it likes for that viewpoint.
- Enforcing constraints from SFM seems like a good start (also see cons).
- Paper is well written for most part and does a good job of describing various learning components.
- I like how they can reason about all the geometric configurations and occlusion states of parts.
Cons:
- I would have liked to see more description and details in Section 4 (regarding basis etc.)
- Need for 3D landmark annotation.
- More diagnosis on some design decisions (like training or not for occluded parts)
- Some other things I would like to see in experiments: They have total 723 cars in total. They use 50 global mixture components, which means on average they have only 14 cars per component. The number of global mixtures seems extremely high, as opposed to other DPM works. Also, there are 20 parts, each with 9 local mixtures. For diagnosis, I would have liked to see how the performance varies by changing number of global and local mixture components. Also, how they choose the basis shapes, and why just 5? They mention “We found results relatively robust to these settings”, but some quantification would be good (at least for global mixtures). For baselines, they don’t mention what settings they use for DPM. If standard, they use 6-8 mixture models, but they could have made 50 or 25 mixture models to compare against their 50 mixture models.
In conclusion, I think that this paper has good ideas and techniques which start to look at the problem of 3D understanding of objects from 2D images! First thing to try is to label things (3D keypoints in this case) and see how well can we use it. But moving forward, I would like to see works that don’t require this annotation, or at least works with limited annotation and scales to images without any specific annotation like this.
I have to admit that I am confused as to what was meant by "basis shapes," and was lead to think they meant some number of geometric primitives. However, this wasn't entirely clear to me. It seems like they just used an a-priori simple 3D model. In that sense it is similar to a 3D pictoral model like presented in last class.
Delete-- Matt K
The paper mentions non-rigid structure from motion (NRSFM) as a regressor of sorts to learn the basis shapes. My understanding is that the 3D points are approximated as a linear combination of weighted basis shapes, and that the basis shapes are selected via data subspace methods like PCA. Their choice of 5 shapes does seem very arbitrary, however,
Deletehttp://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4359359
Given the relation between SFM motion rank deficiencies and articulated structures, it would be interesting to see if the 3D model could be extended to learn and handle large deformations, such as the Ferrari doors.
My understanding is that basis shapes help handle outliers in the 2D detection.
DeleteSuppose that a landmark is badly localized. It will skew the least squares optimization done for 3D understanding because no explicit cost terms are used to manage them. The algorithm gets away without this problem because it forces everything to be inside this narrow search space of the convex combination of a very small number of basis shapes. They are using a small number because they don't want it to be very expressive --- in my opinion.
Exactly, but how do they come up with these basis shapes? Are we sure we can do such small number of basis shapes for all the objects?
Delete[trying to justify 50 mixture models]
DeleteActually, if you look at fig.1 and imagine rotating the car, you'll see how the ratios between different edges (that is, springs) vary. Even with rotating on a small angle. So to me it makes sense that every such angle is a mixture component. They describe it as "quantized viewpoint". Another thing is that the database looks tiny for 50 mixtures.
What I don't know is how just 3 mixtures - left, front, right - can describe a face that looks at you. There must be so much spring "distortions"
I think Abhinav mentioned this during his presentation that for the DPM model, different objects seem to perform better with different numbers of parts. I would assume that the same reasoning holds true for this case; that different objects will have differently sized mixture models and basis shapes. For example, a table or car might have relatively few basis shapes, while a person or cat would need many more.
DeleteCompletely agree with Abhinav on experiments varying #parts and #mixtures. I think it puts the baseline at a real disadvantage, especially given the fact that they handpicked viewpoint variant training images, and the 2D DPM would have had a tough time with it (not having enough data to bias it toward a good part).
DeleteComing back to basic shapes. Objects are not a linear combination of a few basic shapes as we can generate a very wide range of poses. But we need to include parametric bias into the system to be able to gain something in the 2D detection by reasoning in 3D. Otherwise no new information is added by 3D reasoning and not much improvement will be seen for landmark localization and AP scores.
DeleteI think their idea about basis shapes is interesting and reasonable. In other cases, when the scenario is a little bit complex, we typically try to construct a set of basis to span the whole space and simplify the overall problem. However, one concern here is the lack of the concrete design of the basis shapes with respect to their numbers. At least, I think it's better to do a parameter sweeping experiments to verify their choice of basis number suitable empirically. This is also the case for other parameters in this paper, such as number of mixtures, number of landmarks.
DeleteIn Figure 2, the viewpoint label errors appear to have two distinct modes: one at the median of 9 degrees, and another around 180 degrees. Considering that a car is bilaterally symmetric, I wonder if this error mode is significant. Is this a failing of the system, or just an artifact because the fronts and rears of cars are similar? Looking at the landmarks/keypoints in Figure 4, I suspect this may be related to their choice of car landmarks being too symmetric.
ReplyDeleteI would also like to see what the distribution of # of objects vs. view-point labels was, to understand that figure better..
DeleteIn this work, the authors explore combining 2D part based models with 3D geometric knowledge about the objects being detected. They do so by creating a two stage recognition and localization pipeline. In the first stage of the pipeline, they construct a 2D deformable parts model based on mixtures of trees. This model is then applied to a scene for recognition purposes. In the second stage, the authors use an a-priori association of each part in the tree with a point on a deformable 3D model. This is used in an expectation maximization algorithm to adjust and perfect the 3D mach of the object to the image. This allows them to localize objects under heavy clutter, and provides excellent performance in comparison with state of the art 2D approaches.
ReplyDeleteI believe this paper is going in the right direction. As a roboticist primarily interested in using vision algorithms to allow autonomous behavior in a robot, 2D approaches are limited -- nearly useless in fact -- in the amount of knowledge they give the robot. What we really want is 3D knowledge about the scene so that we can navigate through the scene and interact with objects. For this, approaches which give us more than a simple bounding box are absolutely essential.
One concern I have with this algorithm is their reliance on a simple a-priori 3D model to detect cars. I am wondering how their approach scales to more objects.
-- Matt Klingensmith
The algorithm in this paper constructs a simple 3D model of cars by identifying landmark points (both visible and occluded) in single 2D images by using a compositional representation of cars. The authors have implemented a two-stage process where the first stage identifies the visible and occluded landmarks using 2D shape and appearance (taking into account viewpoint and intra-class variations). The second stage takes the output of the first stage and refines it further using 3D geometric techniques like non-rigid SFM. This paper attempts to integrate the goals of early 3D geometry based approaches and recent DPM - like methods that build a statistical model of appearance in 2D, towards object recognition.
ReplyDeletePoints of interest to me:
1. The way occluded landmarks are handled - the authors' idea that the classifier can decide for itself whether to ignore it or to find features correlated with occlusions is very interesting, and also one that I haven't come across before. In addition, the model's capability to detect occluded landmarks is really good.
2. Using non-rigid structure from motion to model intra-class and viewpoint variation and refine refine the predicted landmarks is a novel idea to me. I've only seen non-rigid SfM being used on deformations of the same object over time.
What I would have liked to see/possible future work:
1. The training requires a visibility flag that tells the algorithm if a point is occluded or not. Rather than completely doing away with landmark annotation, it would be good if the algorithm can at least work without the occlusion flags. In fact, it would be great if it didn't require occluded parts to be explicitly labeled, but somehow identified them from 3D geometry.
2. The separation of the process into two stages was maybe good for a start. But ideally, I think applying the 3D geometry constraints and 2D appearance based identification should be done simultaneously within an iterative process.
I like your idea of doing away with visibility annotation for landmarks. I think that the algorithm should be able to determine these, for the most part, through the annotated 3D keypoints. The only problem I see with this, is if the object is occluded by something else. Since the dataset the paper uses has a lot of self/other occlusions, I dont think the algorithm would be able to learn whether or not the the car was occluded by another object.
DeleteThis comment has been removed by the author.
ReplyDeleteThe results section does not contain error bars. They must have tried several train test splits and obtained a range of results. I think its important to see these error bars because all of the baselines are comparable and sometimes better than their proposed approach.
ReplyDeleteThe above echoes from a more general sense of unhappiness. Before the results section, I was going to say that this is the marriage of 3D geometric vision and 2D semantic analysis. They are reasoning in 3D and correcting for faults in 2D. At the same time they are able to pick up cars from complicated scenes and predict viewpoint with decent accuracy. But looking at the results I didn't see a 10-20% jump in performance that I had expected. Is this reason to believe that 3D reasoning (even with full annotation) is not going to help much.
I must clarify that the phrase "not going to help much" means that it will not help the 2D detection task. I do not mean to counter Matt's argument that 3D results are very important for robots. I fully agree with him as I've done some robotics too.
DeleteThis is a good question. Maybe the 3D information is not helpful for solving this task. Maybe the approach is not leveraging the 3D information to its full potential. Maybe this is the wrong 3D information to be using.
DeleteYeah maybe they are not using the correct 3D information. Or I think they are trying to handle too much occlusion in this problem which probably affects the overall model? I would like to see the performance of this algorithm on a simpler dataset where there is lesser occlusion (say 20% max). That would give us a better idea of how important 3D reasoning is for 2D detection.
DeleteI think they could try to use some ground truth 3D information in the model and try to find out whether the 3D information will help or not and how much will it help if we have better 3D information.
DeleteI definitely agree with Aravindh's view that one would expect using 3D geometry in this problem to give a more significant boost in performance. However, the geometry reasoning being used in this paper seem to have been just tacked on and not properly applied. As Abhinav mentioned earlier, this part of the paper does not give too many details and they just mention that they use some publicly available non-rigid SfM code. What I suspect is that maybe this aspect of the problem was not explored in too much detail. Their way of using non-rigid SfM appears clever, but I would have liked to see it being integrated with the first stage into a feedback-loop kind of process rather than as a separate 'refinement' stage. Maybe we don't see the performance jump because if the output of the first stage is messed up at a few points, the second stage might not be able to recover from that.
DeleteI think the full annotation is also only in 2D - not 3D. If I recall, they only hand labeled the landmark locations in the images and 3d shape is somehow computed from using SFM. I'm not completely sure how they use the SFM (it seems from the reading is that they use correspondences between the landmarks across images/cars).
DeleteWe could see much more of a performance boost if there was some sort of ground truth on the 3D structure (CAD models, etc) and they did some iteration with 3D pose estimation and 2D appearance modeling. It seems much more like a pipeline now even though the two are actually tightly coupled.
Also, were there any following papers that applied this to a larger range of objects? It would be easier to see how useful the 3d information is if we can compare across a variety of object categories. 2% seems quite small to me. If the boost is even smaller for other objects such as chairs, perhaps his stage isn't even really helping at all.
DeleteI agree that the performance increase is not what you would hope, especially looking at the precision-recall plot in figure 3a where they really never exceed their competitors anywhere on the curve. The authors place their contribution not in object recognition performance, but rather that they can fit a model with greater precision than their peers using a more compact representation with pretty much the same accuracy.
DeleteIts seems that the other techniques in the P-R plot also utilize 3D/model data, so it is interesting that despite different techniques, the results are all pretty much the same. Does this mean that there is no further benefit to this kind of approach? Some analysis on the failure cases would have been nice, in particular to see if their failures are complementary to their peers.
That the performance gain is quite small also indicates that 2D model is good enough to capture the information.
DeleteThe training set is very complicated and so is the test set. I can see that test set should be complicated to convince users to use the system but training sets can be made simple right?
ReplyDeleteI consider the analogy of trying to teach a kid about numbers and starting with complex numbers and hoping that the kid understands real numbers, integers and fractions in the process. I don't think that will work --- even if we have plots drawing them on the complex plane (aka annotation).
Training with random images is good but I think people are purposefully choosing bad images to make their datasets difficult for no reason. Or was it that datasets were more contrived in the past but are now what we get by randomly sampling internet images. If the later is true then (1) Instead of annotating in so much detail can we use simpler training data and learn it all (picking simpler images is easier than detailed annotation in my opinion) or (2) Is internet images the right training data .. may be driving a car down Pittsburgh and annotating the images and then using internet images to augment this will be better.
It makes me wonder if 20 landmark points are essential, or can we do with less? I think the important thing for VPC is figuring out the headlights and the windshield. I would've like a teeny experiment with fewer points (12 seemed reasonable to model a car), just to show what exactly it is that helps them with VPC.
ReplyDeleteI agree - in general there are quite a few parameters I wish they had varied / measured (#mixtures, #basis shapes, #landmarks). I guess having more landmarks might help deal with some occlusion (as long as you can detect the others), but it's really not clear how many of each thing you need, and having a better way to choose these numbers would be nice.
DeleteThe idea in this paper that use two stage model to recognize objects is really interesting. The first stage uses DPMs to give a 2D estimate and the second stage uses 3D models to refine the model. However, I don't quite understand the SFM model about how it can model 3D.
ReplyDeleteHowever, I'm curious about how could this model applies on other object recognition problems. Is there any discussion on what kind of objects may be suitable for this model. Because for me, I think only those objects with really simple and sharp 3D models will be suitable for this algorithm. But things like animals or clothes may not be suitable for that.
My speculation with the SFM is that they take the landmarks from each image (K landmarks) specified by 2 coordinates. Then they just stack them into a Nx2K matrix and hand it off to the SFM software. So it seems that they are effectively using the landmark positions across images/different cars as correspondences to learn a" 3D morphable basis". My guess is that this is how the landmarks are in 3D space with respect to each other - some sort of 'average' transform. Since they hand annotated 20 landmarks in each image, I wonder if they seed this with some sort of ground truth knowledge of 3d positions.
DeleteI agree with you. The effectiveness of the proposed algorithm should be also evaluated on other objects to see its generalization. What kinds of characteristics should the object satisfy. How about the objects with quite smooth boundary.
DeleteIt was cool that they were able to use 2D images to reason about 3D structure. I'm not convinced that the 3D buys them a lot (~2%). This isn't bad, but it seems like 3D structure, which defines our world, should be able to get us more.
ReplyDeleteAlso, it would be interesting if their future work looked to automatically detect the landmarks. Picking good landmarks that work well for interpreting 3d structure would be difficult. However, for the system to become easily usable with some database with images of a category, it has to be more unsupervised. Also, it may have been unclear what their 3D basis shapes were. Were they crafted to car like shapes or were they more generic?
You could do something like this for non-occluded things (or if your dataset had all views of classes, then it could possibly work for occluded things as well) http://graphics.cs.cmu.edu/projects/discriminativePatches/
Delete@nick I've actually tried doing this for cars. It turns out that non-rigid SFM algorithms need pretty accurate correspondences. If you take a look at the sort of correspondences you get:
Deletehttp://ladoga.graphics.cs.cmu.edu/cdoersch/hn25/france_run/ac_cars3_out/bestbinsort1/bbhtml.html
you can see that there's occasional erroneous detections, which really messes up SFM (the algorithm used in today's paper doesn't handle outliers). The other thing to note is that HOG is designed to be deformation-invariant, and so the registration isn't very good. This is where I tried using SIFT-flow to see if I could get better correspondences, but it never worked well enough for SFM to work.
This paper describes an interesting way of inferring 3d structure from 2d image detections. I thought it was interesting that even though SFM usually suffers from occlusion, their algorithm can handle occlusion of landmarks since the 2D model provides information about estimated locations of occluded landmark points.
ReplyDeleteOne thing to note is that they make two important assumptions about the scene: 1. depth variation of objects are small compared to distance from the camera (which I think seems pretty reasonable for most outdoor scenes), and 2. Object instances can be written as linear combinations of a few basis shapes. The second assumption I think is a little more limiting. I can see this working for something with fairly consistent inter class variation like the car they tested on, but I'm not sure how it would work on some other classes - it's hard to tell because I'm not really sure what these basis shapes are representing (I wish they had shown what some of these shapes looked like).
Yes, assuming that the 3D objects can be written as a linear combination of basis shapes severely limits the variation in those objects. I would like to see someone do DPM but where the parts are allowed to move in 3D, but I guess the difficulty is that you then have to model how the 2D features change.
DeleteI think it is an interesting paper. I believe 3D structure, geometry information can be derived as a latent variable from the 2D images. I have long learned that humans have two eyes simply because it gives us the ability to understand the 3D world, but I am not convinced at that argument. What about people who are born with a single eye? What about animals that do not have their eyes both at the front but on the side? They are able to navigate the world as well. Two eyes extend our ability to perceive the world from different aspects and I believe the underlying structure that deals with the information and thus supports vision is able to handle when animals have more eyes. They are just inferring the underlying latent variables. Another fact that supports this is that, humans are not able to reason about the world as accurate as computers do, e.g., a truncated triangular pyramid (http://www.korthalsaltes.com/photo/truncated_triangular_pyramid.jpg), even if it is not an object in the real world (like the extensions of the lines do not join together), humans can still be cheated easily. It indicates humans are not that good at *accurately* inferring the latent structure, but they tend to infer in the right direction. However for computers, the just don't know what to do even though given the right instruction (please join the edges to see if it is a real object in the world) they are able to do better at accuracy.
ReplyDeleteI invite everyone to take a quick look at this related paper from CMU presented at this year's CVPR, entitled "Correlation Filters for Object Alignment". http://www.cs.cmu.edu/~vboddeti/papers/cvpr_2013.pdf In this paper, landmark detection model based on correlation filters features are proposed, which is robust to occlusions as well. I think the object alignment paper and the 3D object analysis paper in discussion can be used alternately. In the sense that the appearance model would need fairly good initialization, and the 3D analysis paper can give a good estimate. With that, the correlation filter based appearance model can correct the estimation error further. It's good to see how these two can be fused together.
ReplyDeleteGiven the name, the amount of actual "deformation" happening in DPM seems very limited: Why can't the parts shear? Why can't the parts rotate? I understand that such things complicate the inference step, but I'd say object detection schemes should be invariant to a set of transforms including at least rotation, reflection, and scaling.
ReplyDeleteI like the idea of using landmarks in the ground truth to establish frames for 3D models, but to echo Arun's sentiment, it seems perhaps nonoptimal that they are chosen by the humans.
ReplyDeleteIt seems like 3D model validation could be a good method for discarding object detections. If you can fit a 3D model to a proposed detection easily, then it's likely a detection, however, if you have false positives or poorly overlapping detections, model fitting should be difficult and low scoring.
I was just thinking the same thing - that perhaps the machine can pick better landmarks for itself. I thought I saw something at MISC where the algorithm selected its own bounding boxes for detection and we could see that for cars, it included a bit of the road in the box as opposed to using a tight bounding box, which is what humans would have given it. Was it you who gave this talk? I don't remember the details, bit could something similar be applied for choosing the landmarks?
Delete@ada was it this one?
Deletehttp://www.cs.cmu.edu/~cdoersch/precvpr2013.pdf
If so, that was me :-)
This paper proposed an extension of the DPM model to 3D scenarios. The proposed method is a two-stage model. In the first stage, the 2D shape and appearance variations are reasoned through dynamic programming which finds the maximum score in a limited set of configurations. In the second stage, the inferenced 2D landmarks are regularized by assuming a weak-perspective camera model and assuming it is a linear combination of basis shapes.
ReplyDeletePros:
The overall niche of this paper is that it extended the generation of possible model configurations from 2D to 3D. In this case occlusions, appearance variation as well as shape variations are handled in a much better and organized way than assuming a single view detector. This indeed comes in accordance with human understanding, as single view detection (or direct matching) should be used for smaller scales, more discriminative and less variant patches, while globally we may include higher order reasonings.
The experiments show really good performance in handling occlusions as well as pose estimation.
Cons:
The training part requires training samples given image-landmark triplets, which is different from original DPM that directly infers these parameters as latent states in a latent-SVM. This greatly increases the difficulties of training. I suspect that the model is too complicated to use latent-SVM as it may generate lots of over-fitting which results in incorrect training configurations.
A big question for this paper is that: is it really necessary to infer a full-3D model even for occluded parts? Will it be better if sub-category classification can be combined with 3D inference? Say we use sub-category classification to coarsely classify objects approximately into different categories as well as viewing-angles, and then infer 3D information only for the non-occluded parts.
I think its really interesting that the authors are trying to model the 3D shape of the object for better detection, and it definitely seems like such an approach would be better able to handle multiple viewpoints and occlusion than algorithms that use 2D information only. My concern is that I don't see how this could ever be generalized to solve object detection on a larger scale. It seems like a significant amount of human time is required to create enough training data to do object detection of cars only. How Mich time would it take to provide enough training data for the 20 classes in the PASCAL dataset? And even then, we would only be able to detect 20 objects.
ReplyDeleteGranted, other algorithms are maybe not much better, since they at least require bounding boxes, and perhaps require more images to be labeled than this method resulting in equal or more time spent labeling. I simply think we should consider how easy it is to provide training data to a particular algorithm. Those for which we will never be able to provide sufficient training data should maybe not be pursued.
However, maybe if this algorithm could choose its own landmarks, instead of needing each of them annotated specifically, it could be more easily generalizable.
Creating appearance templates for occluded landmarks is a really interesting idea and unfortunately the paper doesn't analyze how much these occlusion templates attribute to the accuracy of the algorithm. However, assuming that "typical" occlusion templates exist, it would be interesting to see if a link can be established between overall scene appearance and occlusion parts. In other words, if it is possible to use parts trained on occluded landmarks to say something about the object's interaction with the entire scene, or even the layout of the scene as a whole.
ReplyDeleteThe nice thing about this paper:
ReplyDelete1. The idea of exploiting the the 3D constraints over DPM detection (more specifically, a variant of DPM), which helps us to prune out the 2D appearance detections which do not respect the 3D structure.
2. Specifically designed energy function for 2D appearance inference capturing intuitions about occlusion.
Cons:
1. According to my understanding (please correct me if I am wrong...). Due to the two-step inference procedure, the 3D information is only used for pruning out detections which do not make sense, it is not involved in the process we determine 2D detections. Probably some approximate inference could be used here (I know people may disagree...)?
2. Lack of diagnostic analysis, we do not know what's wrong since we only get 2% gain even though we expect more from 3D information. Whether it is because the 2D correspondence fails or because we need more robustness from 3D registration?