I'm not the reviewer, but here is my unsolicited opinion anyways:
The overall method seems very geometry-driven, as objects do not interact but depend only on local evidence. However, the local evidence takes the form of inferred object size and orientation, and is at a higher scene level in comparison to image level evidence like HOG, etc.
I remember someone in class mentioned that all of CV is about overfitting to the world. In this case, the authors are "overfitting" to a distribution of object sizes, common orientations, and viewpoints. I would argue that the object size distribution is the most questionable, as we have discussed how humans tend to categorize objects based on function (ie. a toy car vs. a sedan vs. a mining truck).
From a robotics standpoint, it would be interesting to see how well this approach works when the viewpoint is very well known, as tends to be the case in robotic applications. I also wonder if this method could be easily integrated into an existing probabilistic system for joint pose estimation and detection.
This paper uses scene geometry and camera position to inform object detection, and vice versa. This approach makes a lot of intuitive sense following the points discussed in class last lecture. This approach makes a lot of assumptions about the individual relationships between scene geometry, camera position, and object location.
One for example, is that objects are on the ground plane. This assumption is generally false for certain object classes (like birds, lamps, and computer monitors) or may be difficult to verify under heavy occlusion. I would have liked to see a mention of this problem and perhaps some ways to reconcile the framework to include these object classes. For example, how would this framework include pictures of birds mid-flight or sitting in trees. Even for the person class, how does this framework perform for close-ups of people (even knee up) where the ground plane is not visible.
Another assumption that the approach seems to rely on is that the true camera position is close to the viewpoint priors. This prior seems to ignore top-down pictures, where a person might be in a building taking pictures of the ground beneath them. We could even consider pictures where the camera is angled downward, so that the horizon is not visible. Would the algorithm be able to recover from the incorrect prior? And what would the horizon be in such a situation?
I think that given a set of random pictures of city outdoor scenes from the web, the mentioned extreme cases would produce some noise in the ROC curve. Well, yes, indeed, the authors never say explicitly in the beginning that they focus only on cars and pedestrians in city outdoor scenes. But that looks like a quite large portion of tasks to me anyway. Considering they they claim increase of 20% in recognition rate with a fixed FPR.
With respect to your person class example, if the goal was to perform detection for close-up images, I don't think why any geometry would be required at all. If the image has only a person and mostly nothing else, all that we can do is detect a person and with just information from the image, one wouldn't possibly want to know about the geometry of the scene. If the image has one close-up person and a lot of other objects like cars, we can still estimate the geometry. I think geometry comes into play when there are many objects and we want to get understanding of the 3D world using these objects. However, I do agree with your concern of the ability of this method to perform on objects that are not on the ground plane. Maybe the authors just wanted the relatively simpler problem to be solved first :)
The problem is more general. Even in Tuesday's presentation, the issue of a mismatch with model assumptions was not dealt with carefully. This shows up in catastrophic failure cases typical to these approaches.
I think the solution lies in reconsidering a design decision made early on in the paper - replace object to object interactions with object scene / object viewpoint interactions. I'm not entirely convinced that this is the right way to go. Since viewpoint/geometry information is not really in the image we have to use strong priors - people on road, cars on road, horizon line within image, gaussian assumption on camera height. On the other hand, a lot of object-object interactions are visible in the image and we just haven't found a way to use them to the same effect (precision improvements).
At the risk of sounding like a broken tape record:- A deep neural network is combining information from the entire scene to reason about semantic categories. These object-object interactions are already captured within the limits of the local receptive field size * 2^(depth-1). A smarter way to construct the network architecture might provide a better solution.
I think that making assumptions and solving the simpler problem is the way to go (since you cant solve a more complicated problem without at least understanding how to do the easier version). But I think the authors could have spent a bit of effort commenting on how this can be extended to objects that arent on the ground plane, especially since they suggested that their algorithm could be combined with any object detector.
I didn't think the ground plane support assumption was such a big deal.
(1) Very reasonable for the authors' application (which we have already agreed upon)
(2) Even without the ground plane support assumption, you can still at least use the geometry to narrow down your search space a little, since I still think it's reasonable to assume most objects are are vertical surfaces.
(3) Objects are still supported by *something* (unless we're talking about things that can float/fly). If we want to push the detection to finding monitors and mugs, maybe we should also be able to find tables and chairs and say that detected objects and serve as ground planes for additional objects, and then do something similar to what Abhinav presented on Tuesday with regards to stability of objects resting on each other.
I think the intuition of using geometry as cues to limit the searching space of object detection is reasonable. However, the geometry information used in this paper is limited. It also makes several assumptions, which makes it sort of focusing on applications in outdoor scenes and objects such as cars and pedestrians. By leveraging a lot of other images rather than this single image, we can obtain more useful contextual informations to model more kinds of variations.
Re: incorrect prior, I agree that graceful failure is lacking in some of these methods. Ideally we could use strong priors (box-shaped rooms, horizontal horizon near the middle of the image) when they make sense, but have a way to discard them when they really aren't supported by what is in the image.
Most object detection methods try to detect objects at all image locations and scales. This paper proposes a framework for combining conventional object detection with estimates of the scene's 3D geometry. 3D information is incorporated in two ways: the scene's geometry estimate (surface orientation and types) and the camera's viewpoint. The viewpoint is defined by the horizon position and height. The entire system is modeled as a graphical model where object identities (type of object and bounding box) are considered independant given camera viewpoint, and local surface geometry is considered independant given object identities. Furthermore each object identity and corresponding surface geometry patch have an associated evidence probability - which is observed.
Pros I really like the idea of using perspective and prior information about typical object heights to make object detection better. The core idea of modeling distance in the image in a manner representative of the the 3D world is intuitive and mathematically sound.
Simple representation. The a priori estimation of camera viewpoint is quite simple, and shows good results. In fact overall the probabilistic model is simple to understand and well explained.
Experiments. The authors set out to show that modeling different aspects of the image information together works than using just one ingredient alone. The experiments are well constructed to demonstrate this and the results are compelling.
Plugin detectors. There is no dependance on the type of detector that is being used. This makes the system adaptable.
Cons
Probability of background scene geometry label. The experimentally learned probability values do not work. Is this because of insufficient data, or is modeling the background label too hard a problem?
Qualitative results. I would have really liked to see what the strengths of using scene geometry and camera viewpoint individually are and how their combination leads to both compensating for the other's weakness. It would have been nice to see more pictures to get an intuitive feel of what each part is doing and whether it makes sense.
Types of scenes and objects. Would the same approach work in indoor scenes, or in scenes with more types of objects that have more height intraclass height variation?
Typical viewpoint. There is an implicit assumption that some camera viewpoints will not be very different across images. So while this system would be great for security cameras that have very typical views - it may not generalize well.
With respect to your point regarding different types of scenes and objects, I can't think of a reason why this approach wouldn't work. All that they are trying to do is model a system where you can take into account the geometrical aspects of the scene, which should ideally give more information to the object detector. And with respect to objects that have more intraclass height variation, if you have enough training data, I think it should work.
One way to look at the "more intraclass height variation" problem is to consider a class with a multi model distribution over heights. This might be true for trees or even pedestrians (kids + non kids). I think there machinery is general enough to deal with such distributions.
I agree. I think that since the model object height using a probability function, you just have a larger variance on the model for objects with large intraclass height variation. But at least this gives you the ability to say something like, people will never be greater than 10 ft tall and cars will never be shorter than 3ft tall.
This paper combines object detection with local scene geometry and view point information. The combination is done using graphical models - a methodology that has been repeatedly used for combining domain knowledge with statistical methods. The improvement in performance is huge - great return on added investment in terms of complexity.
I am wondering if there is a algorithmic interpretation of the message passing that happens in this model. Do messages passed down the tree look like an intuitive use of the viewpoint information?
I like this approach; it seems that scene geometry has a lot to say about object detection and vice versa. There is evidence that these processes are interrelated in biological vision. There is strong evidence for the existence of two streams in the visual system - the ventral stream (Object Recognition) and the dorsal stream (Motion, Tracking, Object Localization in Space).
What researchers have noted is that these streams are interconnected despite having roughly different functions - perhaps mutual information is needed for either problem?
I like the paper for two reasons 1. The use of viewpoint to size constraint for object detection. 2. Using both the estimate of objects and horizon to help each other.
For the viewpoint constraint, I remember reading this paper - (http://europa.informatik.uni-freiburg.de/files/sudowe-geometric-constraints-icvs11.pdf) which I found to be very simple, yet efficient. Bastian Liebe's group is known for using such simple tricks for building really good detection systems for driving cars.
I like the paper as a cute small paper with nice ideas :) The idea of using scene geometry for imposing constraints on where objects can occur and using objects for imposing constraints on possible scene geometry is not new, but this paper provides a pretty straight forward way of incorporating the same for standard object detectors..
Though I don't imagine this being 'super' useful for tasks like object detection on PASCAL, where there is just too much variation in size and viewpoint to capture by simplistic model; I think it would work great for tasks KITTI dataset or autonomous driving, where we are most interested in finding object on ground with a camera position more-or-less fixed.
I wonder if it is a good idea to reduce the search space for objects first, i.e., finding the scales and locations it can occur on first and then using our object detectors, or should be just run detectors everywhere and get all results and then post-process these results using the constraints mentioned in the paper..
With PASCAL, there are still some types of viewpoints probably. If we can model that as well, would it be helpful? (i.e. more top-down images, horizontal images, etc. ). This is like a mixture of 3d scenes ontop of our mixture of DPMs. I wonder if this would buy us anything. In some sense, some of the DPM mixtures will end up capturing the different viewpoint variances (e.g. if the same object is viewed from different angles as opposed to the object having multiple poses (sleeping/standing)). Having a mixture of viewpoints could serve as a higher level to the dpm. Not sure if this would actually help though...
I think the idea of using surface geometry and camera viewpoints to get 3D information about the scene and use them to get contextual information and thus improve the performance of object detection is very intuitive. The method shown in the paper shows significant improvement in object detection (with other detectors too). I think this method is the right way forward towards making object detection better in a 3D sense (atleast for objects for which some relation with the ground plane can be obtained from a 2D image).
This paper is also among one of the papers I like most. It proposes an object detection framework that incorporates scale cues to reject some of the impossible configurations of object hypothesis, therefore boosting the detector performance considerably. I like the paper since it tried to explore the cues humans naturally use in scene understanding, particularly when strong perspective evidences are available.
But I do want to point out some missing points here. First of all, I don't totally agree with the example given in the beginning, despite the fact that I do think it is an extremely interesting and informative example. The reason why it is so difficult to recognize a car from the green box is not totally because of context. Rather, it is because the image is simply too large and too many high frequency details are occupying you. If you move away from the computer and look at the blocky image again, you'd find it is pretty like a car instantly.
This brings out the very issue I want to point out here that the paper might be missing. The paper emphasized a lot on using scale to reject impossible configurations. But it seems the paper missed to discuss on the influence of scales on feature scales as well as potential problems of matching examples across scales. The paper used unified scales for every object patch. Large patches are ok since Gaussian pyramid might be imposed. But what about patches that are naturally small? If you up-sample them clearly your detail information is not as much as those down-sampled ones. In this case, should we treat them differently?
Are you sure that the reason you can clearly see a car once you move away from the computer isn't because you already know it's a car? If you convince yourself that it looks like a computer mouse, and then move away, it looks a lot like a computer mouse (to me).
I actually agree with zhiding that it looks more like a car, rather than other stuffs named in the paper. Yes, human may not be 100% confident of what it exactly is, but I hardly imagine that computer mouse will be more likely than car. I think human eye are not used to view such pixel-ized images. The imagination beyond the pixels gave us other possibilities. But when u make the pixel smaller as it should be, we don't bother to do that again, and recognize it as car easily.
Geometric cues seem to be very useful evidence for almost any computer vision task that involves objects at multiple scales. While the viewpoint estimation did seem highly tailored to the specific "street-scene" scenario, I think it would be possible to to extend to much larger datasets that only have some of the horizon GTs. Hopefully, a system would be able simply treat the horizon and viewpoint estimates only as evidence, and place no hard constraints.
3D evidence and predictions are clearly integral to solving tasks involving scenes, but it seems difficult to engineer systems that reason in 3D that are generalizable in the sense that they won't detriment performance in any arbitrary scene environment. Making iterative predictions, and no hard assignments, seems like the only way to approach the "chicken-and-egg" problem the authors mention.
I liked this paper and the idea of trying to get a more global scene understanding to improve object detection. Figure 2g is really cool in that all those bounding boxes seems like very plausible places for people to be - walking on the sidewalk - even though 2g just shows the probability distribution of people given the viewpoint and geometry.
The process of using object detections to improve viewpoint and geometry estimates is stable because of how belief propagation works in graphical models, right? Guaranteed to converge? (I can't remember at the moment exactly how all the message passing stuff works, sorry if this is all very silly.)
I was surprised the paper didn't perform that well on the PASCAL data set. Is it because they had a bad object detector? I wonder what kind of precision they could get if they had plugged the winning algorithm into their system.
My guess is that the PASCAL dataset contains many close-up framed images of individual objects, wheras their system seems to rely on having a good estimate of where the horizon and vertical support surfaces are before making a detection. I suspect their learned prior model caused them to throw out many good detections from their underlying object detector.
I like the idea of imposing geometric constraints for object detection which is quite intuitive and is demonstrated to be useful in the experiments of this paper. Many priors learned in this paper seems to match what we humans would do in localizing and recognizing objects in a scene.
I'm wondering why context, a seemingly quite intuitive and important role in human visual functionality, does not help that much in current object detection system. Maybe it is because of the way we make use of context. Like in this paper, many works make use of context as by imposing a higher level model upon different types of context, maybe we should make the learning process for each context components be able to communicate with each other, which means we need to train a joint model instead of piecewise model for different contexts. Another possible reason is that we are still not at the right scale to use context, the impression I got from this paper is that we really need huge amount of data to reduce the variance and uncover the underlying context (e.g. estimating the viewpoint).
I agree with the idea that we might need to train a joint model instead of piecewise model. It is more intuitive to do by this way, however pairwise is easier to calculate, because that we human will not consider all the context relationship between every two object. We just group things together.
A bigger picture here is the Geometrically Coherent Image Interpretation which jointly model the elements that make up the scene with the geometric context of the 3D space that they occupy. The emphasis here is "coherent", where all the measurement quantities of the image should be considered together in a coherent way. I invite everyone to scan through his PhD thesis "Seeing the world behind the image - Spatial layout for 3D scene understanding - 2007" http://www.cs.uiuc.edu/homes/dhoiem/publications/thesis_derek.pdf
Also "Closing the loop in scene interpretation - cvpr 2008" http://www.cs.uiuc.edu/homes/dhoiem/publications/cvpr2008SceneInterpretation.pdf
I liked how the paper's background was a little more historical than just doing a bunch of relevant citations. It went well with the overall flavor with the paper that it is inspired from the way images are created and what they come from - something that they point out was how early vision researcher viewed it as well. Images - and we say images - are usually photos of the world taken by regular people in a regular fashion (i.e. sky is up, grass is down, etc.). As such, as the paper points out, the photos have regular structure in their 3d representation that is then lost when we project into the image. By doing geometric interpretation and representing relationships in the graphical model, we can achieve better performance. This approach seems 'correct' in many ways.
I agree, that's also a point which I like a lot. Just like what Abhinav mentioned in the class, there are many good points in the perspective of early vision research, it is only because of the limitation at that time (they don't have enough data!), they cannot capture their intuition in a reasonable way that respects the objective visual world. Since we now have tons of data, it is time to re-examine some very good points in early research to see if we can find something help us going forward.
I really like this paper because the idea is really intuitive which is really like what we human think when we look at some image. The paper is really short but clearly enough to demonstrate the basic idea of this paper.
The way that the paper combines viewpoint, objects and surface geometry information is really interesting. Those prior knowledge is reasonable. The framework that could combine research result from different small area in computer vision is interesting to me. It uses graphical model to build the framework right now is intuitive, however, we might build other powerful frameworks which could combine more thinks like the prior knowledge of the relationship between different objects and human object interactions.
The authors set a high goal - complete image understanding as humans do. Humans leverage context, some idea about scene geometry. Indirectly some of these cues were being used as features in earlier 2D scene understanding papers we read. Using inference information from neighboring superpixels gives semantic context, and scene geometry features gives geometry context to be more explicit.
However the authors model the problem as complete 3D scene understanding with priors for viewpoints, geometry and objects. They have some fairly reasonable assumptions as to how images are taken, what viewpoints, object sizes in the real world etc. Comments - What I like: 1. Attempt at complete scene understanding. 2. flexibility to change object detectors, possibly add scene categorization to the graphical model. 3. Good results :)
General critical comments: 1. Would like some understanding as to how geometry and viewpoints are helping in detection. 2. I feel that since we are moving into 3D there should be some notion of what depth the object is detected at. The logical extension of this work is to have a bounding box in 3D space instead of the perspective view (will require camera intrinsics though)
Some other points I'd like to raise: As Humphrey and Maheen mentioned, this system is useful when used on a robot or with a surveillance cam. I'd like to see it integrated with streaming video and using the physical world prior of "moving objects on roads are cars" or "people move slower than cars"
This paper seems to handle top-down vs bottom-up in the right way, that is, allowing information to flow in both directions. A disturbing consequence of this sort of design, though, is that it becomes harder to encapsulate the pieces of a vision system: my horizon detector has to know how to use the output of your object detector and vice versa. Perhaps that elusive final 10 or 20 percent accuracy that seems invariably missing in any given application can be achieved by a Frankenstein system that try to do everything at once.
Is it so terrible that the pipeline has to flow both ways? I could imagine something iterative like stacked labeling where each module is only required to adopt some probabilistic input/output interface.
Its still the maximum aposteriori over the joint distribution. We aren't loosing anything by having to iterate twice - its not a greedy solution. This itself is the Frankenstien system.
In this paper, the authors propose a novel modification of the typical "bag of words" or "black box" approach to object detection. They note (rightly) that what is really going on with sliding window detectors is that they assume a uniform prior on object locations and scales. They point out that this assumption is very often false in common scenes.
Instead, they develop a graphical model of objects, surface geometries and the horizon which is able to distinguish good candidate bounding boxes from bad ones, giving them a much stronger prior of potential object locations. By learning the parameters of this model from real image data, they significantly improve the performance of typical object detectors, and present compelling results in that area.
Positives: Something has always bothered me about the sliding window detectors. I always got the feeling that they were just throwing the image into a blender and then running it through a black box detector, without any regard for the heuristic assumptions necessary to make predictions about images.
In a sense, these detectors are reactionary. Early vision work used heuristics exclusively, wheras sliding window detectors throw away any assumptions about the image. This paper provides an excellent middle ground -- it is possible to learn the *parameters of a model* and get the best of both worlds.
In general, I think this is the right way to go about doing machine learning. Statistical black-box magic can only get you so far, and the end result is not even very useful (who cares about bounding boxes, really?) If instead, we have strong priors based on *generative models,* we are able to learn much more deeply about the scene and actually *use* the data we get.
Negatives: I feel like an even more explicit model is necessary to get the kind of performance these authors dream of. The camera parameters should be explicitly modeled. The horizon should be a ground plane, fitted to the data. The objects should be 3D shapes. We've seen elements of these ideas in other papers already. They only need to be combined.
Some people mentioned the limitation of how geometry can help object recognition and vice versa. They mentioned the geometry can be reliably estimated from an image of the street with buildings aside, but it might not help a lot for the an image that contains only a group of salient objects. I think it falls back to the object image vs. scene image argument. And I think the notions of saliency, gaze are related as well. It is very helpful to estimate where to look at given an image. For example, context is the cue to look at if the object is too tiny, but for larger, stand-alone ones the devil are the details. To which extend shall we use from each type of information is the intelligence underlies vision.
I'm not the reviewer, but here is my unsolicited opinion anyways:
ReplyDeleteThe overall method seems very geometry-driven, as objects do not interact but depend only on local evidence. However, the local evidence takes the form of inferred object size and orientation, and is at a higher scene level in comparison to image level evidence like HOG, etc.
I remember someone in class mentioned that all of CV is about overfitting to the world. In this case, the authors are "overfitting" to a distribution of object sizes, common orientations, and viewpoints. I would argue that the object size distribution is the most questionable, as we have discussed how humans tend to categorize objects based on function (ie. a toy car vs. a sedan vs. a mining truck).
From a robotics standpoint, it would be interesting to see how well this approach works when the viewpoint is very well known, as tends to be the case in robotic applications. I also wonder if this method could be easily integrated into an existing probabilistic system for joint pose estimation and detection.
This paper uses scene geometry and camera position to inform object detection, and vice versa. This approach makes a lot of intuitive sense following the points discussed in class last lecture. This approach makes a lot of assumptions about the individual relationships between scene geometry, camera position, and object location.
ReplyDeleteOne for example, is that objects are on the ground plane. This assumption is generally false for certain object classes (like birds, lamps, and computer monitors) or may be difficult to verify under heavy occlusion. I would have liked to see a mention of this problem and perhaps some ways to reconcile the framework to include these object classes. For example, how would this framework include pictures of birds mid-flight or sitting in trees. Even for the person class, how does this framework perform for close-ups of people (even knee up) where the ground plane is not visible.
Another assumption that the approach seems to rely on is that the true camera position is close to the viewpoint priors. This prior seems to ignore top-down pictures, where a person might be in a building taking pictures of the ground beneath them. We could even consider pictures where the camera is angled downward, so that the horizon is not visible. Would the algorithm be able to recover from the incorrect prior? And what would the horizon be in such a situation?
I think that given a set of random pictures of city outdoor scenes from the web, the mentioned extreme cases would produce some noise in the ROC curve.
DeleteWell, yes, indeed, the authors never say explicitly in the beginning that they focus only on cars and pedestrians in city outdoor scenes. But that looks like a quite large portion of tasks to me anyway. Considering they they claim increase of 20% in recognition rate with a fixed FPR.
With respect to your person class example, if the goal was to perform detection for close-up images, I don't think why any geometry would be required at all. If the image has only a person and mostly nothing else, all that we can do is detect a person and with just information from the image, one wouldn't possibly want to know about the geometry of the scene. If the image has one close-up person and a lot of other objects like cars, we can still estimate the geometry. I think geometry comes into play when there are many objects and we want to get understanding of the 3D world using these objects. However, I do agree with your concern of the ability of this method to perform on objects that are not on the ground plane. Maybe the authors just wanted the relatively simpler problem to be solved first :)
DeleteThe problem is more general. Even in Tuesday's presentation, the issue of a mismatch with model assumptions was not dealt with carefully. This shows up in catastrophic failure cases typical to these approaches.
DeleteI think the solution lies in reconsidering a design decision made early on in the paper - replace object to object interactions with object scene / object viewpoint interactions. I'm not entirely convinced that this is the right way to go. Since viewpoint/geometry information is not really in the image we have to use strong priors - people on road, cars on road, horizon line within image, gaussian assumption on camera height. On the other hand, a lot of object-object interactions are visible in the image and we just haven't found a way to use them to the same effect (precision improvements).
At the risk of sounding like a broken tape record:-
A deep neural network is combining information from the entire scene to reason about semantic categories. These object-object interactions are already captured within the limits of the local receptive field size * 2^(depth-1). A smarter way to construct the network architecture might provide a better solution.
I think that making assumptions and solving the simpler problem is the way to go (since you cant solve a more complicated problem without at least understanding how to do the easier version). But I think the authors could have spent a bit of effort commenting on how this can be extended to objects that arent on the ground plane, especially since they suggested that their algorithm could be combined with any object detector.
DeleteI didn't think the ground plane support assumption was such a big deal.
Delete(1) Very reasonable for the authors' application (which we have already agreed upon)
(2) Even without the ground plane support assumption, you can still at least use the geometry to narrow down your search space a little, since I still think it's reasonable to assume most objects are are vertical surfaces.
(3) Objects are still supported by *something* (unless we're talking about things that can float/fly). If we want to push the detection to finding monitors and mugs, maybe we should also be able to find tables and chairs and say that detected objects and serve as ground planes for additional objects, and then do something similar to what Abhinav presented on Tuesday with regards to stability of objects resting on each other.
I think the intuition of using geometry as cues to limit the searching space of object detection is reasonable. However, the geometry information used in this paper is limited. It also makes several assumptions, which makes it sort of focusing on applications in outdoor scenes and objects such as cars and pedestrians. By leveraging a lot of other images rather than this single image, we can obtain more useful contextual informations to model more kinds of variations.
DeleteRe: incorrect prior, I agree that graceful failure is lacking in some of these methods. Ideally we could use strong priors (box-shaped rooms, horizontal horizon near the middle of the image) when they make sense, but have a way to discard them when they really aren't supported by what is in the image.
DeleteMost object detection methods try to detect objects at all image locations and scales. This paper proposes a framework for combining conventional object detection with estimates of the scene's 3D geometry. 3D information is incorporated in two ways: the scene's geometry estimate (surface orientation and types) and the camera's viewpoint. The viewpoint is defined by the horizon position and height. The entire system is modeled as a graphical model where object identities (type of object and bounding box) are considered independant given camera viewpoint, and local surface geometry is considered independant given object identities. Furthermore each object identity and corresponding surface geometry patch have an associated evidence probability - which is observed.
ReplyDeletePros
I really like the idea of using perspective and prior information about typical object heights to make object detection better. The core idea of modeling distance in the image in a manner representative of the the 3D world is intuitive and mathematically sound.
Simple representation. The a priori estimation of camera viewpoint is quite simple, and shows good results. In fact overall the probabilistic model is simple to understand and well explained.
Experiments. The authors set out to show that modeling different aspects of the image information together works than using just one ingredient alone. The experiments are well constructed to demonstrate this and the results are compelling.
Plugin detectors. There is no dependance on the type of detector that is being used. This makes the system adaptable.
Cons
Probability of background scene geometry label. The experimentally learned probability values do not work. Is this because of insufficient data, or is modeling the background label too hard a problem?
Qualitative results. I would have really liked to see what the strengths of using scene geometry and camera viewpoint individually are and how their combination leads to both compensating for the other's weakness. It would have been nice to see more pictures to get an intuitive feel of what each part is doing and whether it makes sense.
Types of scenes and objects. Would the same approach work in indoor scenes, or in scenes with more types of objects that have more height intraclass height variation?
Typical viewpoint. There is an implicit assumption that some camera viewpoints will not be very different across images. So while this system would be great for security cameras that have very typical views - it may not generalize well.
With respect to your point regarding different types of scenes and objects, I can't think of a reason why this approach wouldn't work. All that they are trying to do is model a system where you can take into account the geometrical aspects of the scene, which should ideally give more information to the object detector. And with respect to objects that have more intraclass height variation, if you have enough training data, I think it should work.
DeleteOne way to look at the "more intraclass height variation" problem is to consider a class with a multi model distribution over heights. This might be true for trees or even pedestrians (kids + non kids). I think there machinery is general enough to deal with such distributions.
DeleteI agree. I think that since the model object height using a probability function, you just have a larger variance on the model for objects with large intraclass height variation. But at least this gives you the ability to say something like, people will never be greater than 10 ft tall and cars will never be shorter than 3ft tall.
DeleteThis paper combines object detection with local scene geometry and view point information. The combination is done using graphical models - a methodology that has been repeatedly used for combining domain knowledge with statistical methods. The improvement in performance is huge - great return on added investment in terms of complexity.
ReplyDeleteI am wondering if there is a algorithmic interpretation of the message passing that happens in this model. Do messages passed down the tree look like an intuitive use of the viewpoint information?
I like this approach; it seems that scene geometry has a lot to say about object detection and vice versa. There is evidence that these processes are interrelated in biological vision. There is strong evidence for the existence of two streams in the visual system - the ventral stream (Object Recognition)
ReplyDeleteand the dorsal stream (Motion, Tracking, Object Localization in Space).
http://en.wikipedia.org/wiki/Two-streams_hypothesis
What researchers have noted is that these streams are interconnected
despite having roughly different functions - perhaps mutual information
is needed for either problem?
I like the paper for two reasons
ReplyDelete1. The use of viewpoint to size constraint for object detection.
2. Using both the estimate of objects and horizon to help each other.
For the viewpoint constraint, I remember reading this paper - (http://europa.informatik.uni-freiburg.de/files/sudowe-geometric-constraints-icvs11.pdf) which I found to be very simple, yet efficient. Bastian Liebe's group is known for using such simple tricks for building really good detection systems for driving cars.
I like the paper as a cute small paper with nice ideas :)
ReplyDeleteThe idea of using scene geometry for imposing constraints on where objects can occur and using objects for imposing constraints on possible scene geometry is not new, but this paper provides a pretty straight forward way of incorporating the same for standard object detectors..
Though I don't imagine this being 'super' useful for tasks like object detection on PASCAL, where there is just too much variation in size and viewpoint to capture by simplistic model; I think it would work great for tasks KITTI dataset or autonomous driving, where we are most interested in finding object on ground with a camera position more-or-less fixed.
I wonder if it is a good idea to reduce the search space for objects first, i.e., finding the scales and locations it can occur on first and then using our object detectors, or should be just run detectors everywhere and get all results and then post-process these results using the constraints mentioned in the paper..
With PASCAL, there are still some types of viewpoints probably. If we can model that as well, would it be helpful? (i.e. more top-down images, horizontal images, etc. ). This is like a mixture of 3d scenes ontop of our mixture of DPMs. I wonder if this would buy us anything. In some sense, some of the DPM mixtures will end up capturing the different viewpoint variances (e.g. if the same object is viewed from different angles as opposed to the object having multiple poses (sleeping/standing)). Having a mixture of viewpoints could serve as a higher level to the dpm. Not sure if this would actually help though...
DeleteI think the idea of using surface geometry and camera viewpoints to get 3D information about the scene and use them to get contextual information and thus improve the performance of object detection is very intuitive. The method shown in the paper shows significant improvement in object detection (with other detectors too). I think this method is the right way forward towards making object detection better in a 3D sense (atleast for objects for which some relation with the ground plane can be obtained from a 2D image).
ReplyDeleteThis paper is also among one of the papers I like most. It proposes an object detection framework that incorporates scale cues to reject some of the impossible configurations of object hypothesis, therefore boosting the detector performance considerably. I like the paper since it tried to explore the cues humans naturally use in scene understanding, particularly when strong perspective evidences are available.
ReplyDeleteBut I do want to point out some missing points here. First of all, I don't totally agree with the example given in the beginning, despite the fact that I do think it is an extremely interesting and informative example. The reason why it is so difficult to recognize a car from the green box is not totally because of context. Rather, it is because the image is simply too large and too many high frequency details are occupying you. If you move away from the computer and look at the blocky image again, you'd find it is pretty like a car instantly.
This brings out the very issue I want to point out here that the paper might be missing. The paper emphasized a lot on using scale to reject impossible configurations. But it seems the paper missed to discuss on the influence of scales on feature scales as well as potential problems of matching examples across scales. The paper used unified scales for every object patch. Large patches are ok since Gaussian pyramid might be imposed. But what about patches that are naturally small? If you up-sample them clearly your detail information is not as much as those down-sampled ones. In this case, should we treat them differently?
Are you sure that the reason you can clearly see a car once you move away from the computer isn't because you already know it's a car? If you convince yourself that it looks like a computer mouse, and then move away, it looks a lot like a computer mouse (to me).
DeleteI actually agree with zhiding that it looks more like a car, rather than other stuffs named in the paper. Yes, human may not be 100% confident of what it exactly is, but I hardly imagine that computer mouse will be more likely than car. I think human eye are not used to view such pixel-ized images. The imagination beyond the pixels gave us other possibilities. But when u make the pixel smaller as it should be, we don't bother to do that again, and recognize it as car easily.
DeleteGeometric cues seem to be very useful evidence for almost any computer vision task that involves objects at multiple scales. While the viewpoint estimation did seem highly tailored to the specific "street-scene" scenario, I think it would be possible to to extend to much larger datasets that only have some of the horizon GTs. Hopefully, a system would be able simply treat the horizon and viewpoint estimates only as evidence, and place no hard constraints.
ReplyDelete3D evidence and predictions are clearly integral to solving tasks involving scenes, but it seems difficult to engineer systems that reason in 3D that are generalizable in the sense that they won't detriment performance in any arbitrary scene environment. Making iterative predictions, and no hard assignments, seems like the only way to approach the "chicken-and-egg" problem the authors mention.
I liked this paper and the idea of trying to get a more global scene understanding to improve object detection. Figure 2g is really cool in that all those bounding boxes seems like very plausible places for people to be - walking on the sidewalk - even though 2g just shows the probability distribution of people given the viewpoint and geometry.
ReplyDeleteThe process of using object detections to improve viewpoint and geometry estimates is stable because of how belief propagation works in graphical models, right? Guaranteed to converge? (I can't remember at the moment exactly how all the message passing stuff works, sorry if this is all very silly.)
I was surprised the paper didn't perform that well on the PASCAL data set. Is it because they had a bad object detector? I wonder what kind of precision they could get if they had plugged the winning algorithm into their system.
My guess is that the PASCAL dataset contains many close-up framed images of individual objects, wheras their system seems to rely on having a good estimate of where the horizon and vertical support surfaces are before making a detection. I suspect their learned prior model caused them to throw out many good detections from their underlying object detector.
DeleteI like the idea of imposing geometric constraints for object detection which is quite intuitive and is demonstrated to be useful in the experiments of this paper. Many priors learned in this paper seems to match what we humans would do in localizing and recognizing objects in a scene.
ReplyDeleteI'm wondering why context, a seemingly quite intuitive and important role in human visual functionality, does not help that much in current object detection system. Maybe it is because of the way we make use of context. Like in this paper, many works make use of context as by imposing a higher level model upon different types of context, maybe we should make the learning process for each context components be able to communicate with each other, which means we need to train a joint model instead of piecewise model for different contexts. Another possible reason is that we are still not at the right scale to use context, the impression I got from this paper is that we really need huge amount of data to reduce the variance and uncover the underlying context (e.g. estimating the viewpoint).
I agree with the idea that we might need to train a joint model instead of piecewise model. It is more intuitive to do by this way, however pairwise is easier to calculate, because that we human will not consider all the context relationship between every two object. We just group things together.
DeleteA bigger picture here is the Geometrically Coherent Image Interpretation which jointly model the elements that make up the scene with the geometric context of the 3D space that they occupy. The emphasis here is "coherent", where all the measurement quantities of the image should be considered together in a coherent way. I invite everyone to scan through his PhD thesis "Seeing the world behind the image - Spatial layout for 3D scene understanding - 2007" http://www.cs.uiuc.edu/homes/dhoiem/publications/thesis_derek.pdf
ReplyDeleteAlso "Closing the loop in scene interpretation - cvpr 2008" http://www.cs.uiuc.edu/homes/dhoiem/publications/cvpr2008SceneInterpretation.pdf
I liked how the paper's background was a little more historical than just doing a bunch of relevant citations. It went well with the overall flavor with the paper that it is inspired from the way images are created and what they come from - something that they point out was how early vision researcher viewed it as well. Images - and we say images - are usually photos of the world taken by regular people in a regular fashion (i.e. sky is up, grass is down, etc.). As such, as the paper points out, the photos have regular structure in their 3d representation that is then lost when we project into the image. By doing geometric interpretation and representing relationships in the graphical model, we can achieve better performance. This approach seems 'correct' in many ways.
ReplyDeleteI agree, that's also a point which I like a lot. Just like what Abhinav mentioned in the class, there are many good points in the perspective of early vision research, it is only because of the limitation at that time (they don't have enough data!), they cannot capture their intuition in a reasonable way that respects the objective visual world. Since we now have tons of data, it is time to re-examine some very good points in early research to see if we can find something help us going forward.
DeleteThis comment has been removed by the author.
ReplyDeleteI really like this paper because the idea is really intuitive which is really like what we human think when we look at some image. The paper is really short but clearly enough to demonstrate the basic idea of this paper.
ReplyDeleteThe way that the paper combines viewpoint, objects and surface geometry information is really interesting. Those prior knowledge is reasonable. The framework that could combine research result from different small area in computer vision is interesting to me. It uses graphical model to build the framework right now is intuitive, however, we might build other powerful frameworks which could combine more thinks like the prior knowledge of the relationship between different objects and human object interactions.
The authors set a high goal - complete image understanding as humans do. Humans leverage context, some idea about scene geometry. Indirectly some of these cues were being used as features in earlier 2D scene understanding papers we read. Using inference information from neighboring superpixels gives semantic context, and scene geometry features gives geometry context to be more explicit.
ReplyDeleteHowever the authors model the problem as complete 3D scene understanding with priors for viewpoints, geometry and objects. They have some fairly reasonable assumptions as to how images are taken, what viewpoints, object sizes in the real world etc.
Comments - What I like:
1. Attempt at complete scene understanding.
2. flexibility to change object detectors, possibly add scene categorization to the graphical model.
3. Good results :)
General critical comments:
1. Would like some understanding as to how geometry and viewpoints are helping in detection.
2. I feel that since we are moving into 3D there should be some notion of what depth the object is detected at. The logical extension of this work is to have a bounding box in 3D space instead of the perspective view (will require camera intrinsics though)
Some other points I'd like to raise:
As Humphrey and Maheen mentioned, this system is useful when used on a robot or with a surveillance cam. I'd like to see it integrated with streaming video and using the physical world prior of "moving objects on roads are cars" or "people move slower than cars"
This paper seems to handle top-down vs bottom-up in the right way, that is, allowing information to flow in both directions. A disturbing consequence of this sort of design, though, is that it becomes harder to encapsulate the pieces of a vision system: my horizon detector has to know how to use the output of your object detector and vice versa. Perhaps that elusive final 10 or 20 percent accuracy that seems invariably missing in any given application can be achieved by a Frankenstein system that try to do everything at once.
ReplyDeleteIs it so terrible that the pipeline has to flow both ways? I could imagine something iterative like stacked labeling where each module is only required to adopt some probabilistic input/output interface.
DeleteIts still the maximum aposteriori over the joint distribution. We aren't loosing anything by having to iterate twice - its not a greedy solution. This itself is the Frankenstien system.
DeleteIn this paper, the authors propose a novel modification of the typical "bag of words" or "black box" approach to object detection. They note (rightly) that what is really going on with sliding window detectors is that they assume a uniform prior on object locations and scales. They point out that this assumption is very often false in common scenes.
ReplyDeleteInstead, they develop a graphical model of objects, surface geometries and the horizon which is able to distinguish good candidate bounding boxes from bad ones, giving them a much stronger prior of potential object locations. By learning the parameters of this model from real image data, they significantly improve the performance of typical object detectors, and present compelling results in that area.
Positives:
Something has always bothered me about the sliding window detectors. I always got the feeling that they were just throwing the image into a blender and then running it through a black box detector, without any regard for the heuristic assumptions necessary to make predictions about images.
In a sense, these detectors are reactionary. Early vision work used heuristics exclusively, wheras sliding window detectors throw away any assumptions about the image. This paper provides an excellent middle ground -- it is possible to learn the *parameters of a model* and get the best of both worlds.
In general, I think this is the right way to go about doing machine learning. Statistical black-box magic can only get you so far, and the end result is not even very useful (who cares about bounding boxes, really?) If instead, we have strong priors based on *generative models,* we are able to learn much more deeply about the scene and actually *use* the data we get.
Negatives:
I feel like an even more explicit model is necessary to get the kind of performance these authors dream of. The camera parameters should be explicitly modeled. The horizon should be a ground plane, fitted to the data. The objects should be 3D shapes. We've seen elements of these ideas in other papers already. They only need to be combined.
That was Matt Klingensmith btw... (keep forgetting.)
DeleteSome people mentioned the limitation of how geometry can help object recognition and vice versa. They mentioned the geometry can be reliably estimated from an image of the street with buildings aside, but it might not help a lot for the an image that contains only a group of salient objects. I think it falls back to the object image vs. scene image argument. And I think the notions of saliency, gaze are related as well. It is very helpful to estimate where to look at given an image. For example, context is the cue to look at if the object is too tiny, but for larger, stand-alone ones the devil are the details. To which extend shall we use from each type of information is the intelligence underlies vision.
ReplyDelete