16-824: Learning-based Methods in Vision (F'13): Reading for 10/17

Tuesday, October 15, 2013

Reading for 10/17

D. Hoiem, A.A. Efros, and M. Hebert, Putting Objects in Perspective, IJCV 2008.

and optionally

A. Torralba. Contextual priming for object detection. IJCV 2003.

Galleguillos C., Belongie S., Context Based Object Categorization: A Critical Survey. CVIU

39 comments:

Humphrey HuOctober 16, 2013 at 12:23 PM
I'm not the reviewer, but here is my unsolicited opinion anyways:

The overall method seems very geometry-driven, as objects do not interact but depend only on local evidence. However, the local evidence takes the form of inferred object size and orientation, and is at a higher scene level in comparison to image level evidence like HOG, etc.

I remember someone in class mentioned that all of CV is about overfitting to the world. In this case, the authors are "overfitting" to a distribution of object sizes, common orientations, and viewpoints. I would argue that the object size distribution is the most questionable, as we have discussed how humans tend to categorize objects based on function (ie. a toy car vs. a sedan vs. a mining truck).

From a robotics standpoint, it would be interesting to see how well this approach works when the viewpoint is very well known, as tends to be the case in robotic applications. I also wonder if this method could be easily integrated into an existing probabilistic system for joint pose estimation and detection.
ReplyDelete
Replies
Priya DeoOctober 16, 2013 at 3:45 PM
This paper uses scene geometry and camera position to inform object detection, and vice versa. This approach makes a lot of intuitive sense following the points discussed in class last lecture. This approach makes a lot of assumptions about the individual relationships between scene geometry, camera position, and object location.

One for example, is that objects are on the ground plane. This assumption is generally false for certain object classes (like birds, lamps, and computer monitors) or may be difficult to verify under heavy occlusion. I would have liked to see a mention of this problem and perhaps some ways to reconcile the framework to include these object classes. For example, how would this framework include pictures of birds mid-flight or sitting in trees. Even for the person class, how does this framework perform for close-ups of people (even knee up) where the ground plane is not visible.

Another assumption that the approach seems to rely on is that the true camera position is close to the viewpoint priors. This prior seems to ignore top-down pictures, where a person might be in a building taking pictures of the ground beneath them. We could even consider pictures where the camera is angled downward, so that the horizon is not visible. Would the algorithm be able to recover from the incorrect prior? And what would the horizon be in such a situation?
ReplyDelete
Replies
Maheen RashidOctober 16, 2013 at 4:24 PM
Most object detection methods try to detect objects at all image locations and scales. This paper proposes a framework for combining conventional object detection with estimates of the scene's 3D geometry. 3D information is incorporated in two ways: the scene's geometry estimate (surface orientation and types) and the camera's viewpoint. The viewpoint is defined by the horizon position and height. The entire system is modeled as a graphical model where object identities (type of object and bounding box) are considered independant given camera viewpoint, and local surface geometry is considered independant given object identities. Furthermore each object identity and corresponding surface geometry patch have an associated evidence probability - which is observed.

Pros
I really like the idea of using perspective and prior information about typical object heights to make object detection better. The core idea of modeling distance in the image in a manner representative of the the 3D world is intuitive and mathematically sound.

Simple representation. The a priori estimation of camera viewpoint is quite simple, and shows good results. In fact overall the probabilistic model is simple to understand and well explained.

Experiments. The authors set out to show that modeling different aspects of the image information together works than using just one ingredient alone. The experiments are well constructed to demonstrate this and the results are compelling.

Plugin detectors. There is no dependance on the type of detector that is being used. This makes the system adaptable.

Cons

Probability of background scene geometry label. The experimentally learned probability values do not work. Is this because of insufficient data, or is modeling the background label too hard a problem?

Qualitative results. I would have really liked to see what the strengths of using scene geometry and camera viewpoint individually are and how their combination leads to both compensating for the other's weakness. It would have been nice to see more pictures to get an intuitive feel of what each part is doing and whether it makes sense.

Types of scenes and objects. Would the same approach work in indoor scenes, or in scenes with more types of objects that have more height intraclass height variation?

Typical viewpoint. There is an implicit assumption that some camera viewpoints will not be very different across images. So while this system would be great for security cameras that have very typical views - it may not generalize well.
ReplyDelete
Replies
M AravindhOctober 16, 2013 at 8:31 PM
This paper combines object detection with local scene geometry and view point information. The combination is done using graphical models - a methodology that has been repeatedly used for combining domain knowledge with statistical methods. The improvement in performance is huge - great return on added investment in terms of complexity.

I am wondering if there is a algorithmic interpretation of the message passing that happens in this model. Do messages passed down the tree look like an intuitive use of the viewpoint information?
ReplyDelete
Replies
Jacob WalkerOctober 16, 2013 at 9:21 PM
I like this approach; it seems that scene geometry has a lot to say about object detection and vice versa. There is evidence that these processes are interrelated in biological vision. There is strong evidence for the existence of two streams in the visual system - the ventral stream (Object Recognition)
and the dorsal stream (Motion, Tracking, Object Localization in Space).

http://en.wikipedia.org/wiki/Two-streams_hypothesis

What researchers have noted is that these streams are interconnected
despite having roughly different functions - perhaps mutual information
is needed for either problem?

ReplyDelete
Replies
IshanOctober 16, 2013 at 10:01 PM
I like the paper for two reasons
1. The use of viewpoint to size constraint for object detection.
2. Using both the estimate of objects and horizon to help each other.

For the viewpoint constraint, I remember reading this paper - (http://europa.informatik.uni-freiburg.de/files/sudowe-geometric-constraints-icvs11.pdf) which I found to be very simple, yet efficient. Bastian Liebe's group is known for using such simple tricks for building really good detection systems for driving cars.
ReplyDelete
Replies
Abhinav ShrivastavaOctober 16, 2013 at 10:12 PM
I like the paper as a cute small paper with nice ideas :)
The idea of using scene geometry for imposing constraints on where objects can occur and using objects for imposing constraints on possible scene geometry is not new, but this paper provides a pretty straight forward way of incorporating the same for standard object detectors..

Though I don't imagine this being 'super' useful for tasks like object detection on PASCAL, where there is just too much variation in size and viewpoint to capture by simplistic model; I think it would work great for tasks KITTI dataset or autonomous driving, where we are most interested in finding object on ground with a camera position more-or-less fixed.

I wonder if it is a good idea to reduce the search space for objects first, i.e., finding the scales and locations it can occur on first and then using our object detectors, or should be just run detectors everywhere and get all results and then post-process these results using the constraints mentioned in the paper..
ReplyDelete
Replies
Divya HariharanOctober 16, 2013 at 10:14 PM
I think the idea of using surface geometry and camera viewpoints to get 3D information about the scene and use them to get contextual information and thus improve the performance of object detection is very intuitive. The method shown in the paper shows significant improvement in object detection (with other detectors too). I think this method is the right way forward towards making object detection better in a 3D sense (atleast for objects for which some relation with the ground plane can be obtained from a 2D image).
ReplyDelete
Replies
UnknownOctober 16, 2013 at 10:38 PM
This paper is also among one of the papers I like most. It proposes an object detection framework that incorporates scale cues to reject some of the impossible configurations of object hypothesis, therefore boosting the detector performance considerably. I like the paper since it tried to explore the cues humans naturally use in scene understanding, particularly when strong perspective evidences are available.

But I do want to point out some missing points here. First of all, I don't totally agree with the example given in the beginning, despite the fact that I do think it is an extremely interesting and informative example. The reason why it is so difficult to recognize a car from the green box is not totally because of context. Rather, it is because the image is simply too large and too many high frequency details are occupying you. If you move away from the computer and look at the blocky image again, you'd find it is pretty like a car instantly.

This brings out the very issue I want to point out here that the paper might be missing. The paper emphasized a lot on using scale to reject impossible configurations. But it seems the paper missed to discuss on the influence of scales on feature scales as well as potential problems of matching examples across scales. The paper used unified scales for every object patch. Large patches are ok since Gaussian pyramid might be imposed. But what about patches that are naturally small? If you up-sample them clearly your detail information is not as much as those down-sampled ones. In this case, should we treat them differently?
ReplyDelete
Replies
UnknownOctober 16, 2013 at 10:51 PM
Geometric cues seem to be very useful evidence for almost any computer vision task that involves objects at multiple scales. While the viewpoint estimation did seem highly tailored to the specific "street-scene" scenario, I think it would be possible to to extend to much larger datasets that only have some of the horizon GTs. Hopefully, a system would be able simply treat the horizon and viewpoint estimates only as evidence, and place no hard constraints.

3D evidence and predictions are clearly integral to solving tasks involving scenes, but it seems difficult to engineer systems that reason in 3D that are generalizable in the sense that they won't detriment performance in any arbitrary scene environment. Making iterative predictions, and no hard assignments, seems like the only way to approach the "chicken-and-egg" problem the authors mention.
ReplyDelete
Replies
UnknownOctober 17, 2013 at 12:17 AM
I liked this paper and the idea of trying to get a more global scene understanding to improve object detection. Figure 2g is really cool in that all those bounding boxes seems like very plausible places for people to be - walking on the sidewalk - even though 2g just shows the probability distribution of people given the viewpoint and geometry.

The process of using object detections to improve viewpoint and geometry estimates is stable because of how belief propagation works in graphical models, right? Guaranteed to converge? (I can't remember at the moment exactly how all the message passing stuff works, sorry if this is all very silly.)

I was surprised the paper didn't perform that well on the PASCAL data set. Is it because they had a bad object detector? I wonder what kind of precision they could get if they had plugged the winning algorithm into their system.
ReplyDelete
Replies
UnknownOctober 17, 2013 at 1:17 AM
I like the idea of imposing geometric constraints for object detection which is quite intuitive and is demonstrated to be useful in the experiments of this paper. Many priors learned in this paper seems to match what we humans would do in localizing and recognizing objects in a scene.

I'm wondering why context, a seemingly quite intuitive and important role in human visual functionality, does not help that much in current object detection system. Maybe it is because of the way we make use of context. Like in this paper, many works make use of context as by imposing a higher level model upon different types of context, maybe we should make the learning process for each context components be able to communicate with each other, which means we need to train a joint model instead of piecewise model for different contexts. Another possible reason is that we are still not at the right scale to use context, the impression I got from this paper is that we really need huge amount of data to reduce the variance and uncover the underlying context (e.g. estimating the viewpoint).
ReplyDelete
Replies
UnknownOctober 17, 2013 at 1:39 AM
A bigger picture here is the Geometrically Coherent Image Interpretation which jointly model the elements that make up the scene with the geometric context of the 3D space that they occupy. The emphasis here is "coherent", where all the measurement quantities of the image should be considered together in a coherent way. I invite everyone to scan through his PhD thesis "Seeing the world behind the image - Spatial layout for 3D scene understanding - 2007" http://www.cs.uiuc.edu/homes/dhoiem/publications/thesis_derek.pdf

Also "Closing the loop in scene interpretation - cvpr 2008" http://www.cs.uiuc.edu/homes/dhoiem/publications/cvpr2008SceneInterpretation.pdf
ReplyDelete
Replies
ArunOctober 17, 2013 at 1:41 AM
I liked how the paper's background was a little more historical than just doing a bunch of relevant citations. It went well with the overall flavor with the paper that it is inspired from the way images are created and what they come from - something that they point out was how early vision researcher viewed it as well. Images - and we say images - are usually photos of the world taken by regular people in a regular fashion (i.e. sky is up, grass is down, etc.). As such, as the paper points out, the photos have regular structure in their 3d representation that is then lost when we project into the image. By doing geometric interpretation and representing relationships in the graphical model, we can achieve better performance. This approach seems 'correct' in many ways.
ReplyDelete
Replies
ArunOctober 17, 2013 at 1:46 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownOctober 17, 2013 at 2:02 AM
I really like this paper because the idea is really intuitive which is really like what we human think when we look at some image. The paper is really short but clearly enough to demonstrate the basic idea of this paper.

The way that the paper combines viewpoint, objects and surface geometry information is really interesting. Those prior knowledge is reasonable. The framework that could combine research result from different small area in computer vision is interesting to me. It uses graphical model to build the framework right now is intuitive, however, we might build other powerful frameworks which could combine more thinks like the prior knowledge of the relationship between different objects and human object interactions.
ReplyDelete
Replies
GauravOctober 17, 2013 at 3:42 AM
The authors set a high goal - complete image understanding as humans do. Humans leverage context, some idea about scene geometry. Indirectly some of these cues were being used as features in earlier 2D scene understanding papers we read. Using inference information from neighboring superpixels gives semantic context, and scene geometry features gives geometry context to be more explicit.

However the authors model the problem as complete 3D scene understanding with priors for viewpoints, geometry and objects. They have some fairly reasonable assumptions as to how images are taken, what viewpoints, object sizes in the real world etc.
Comments - What I like:
1. Attempt at complete scene understanding.
2. flexibility to change object detectors, possibly add scene categorization to the graphical model.
3. Good results :)

General critical comments:
1. Would like some understanding as to how geometry and viewpoints are helping in detection.
2. I feel that since we are moving into 3D there should be some notion of what depth the object is detected at. The logical extension of this work is to have a bounding box in 3D space instead of the perspective view (will require camera intrinsics though)

Some other points I'd like to raise:
As Humphrey and Maheen mentioned, this system is useful when used on a robot or with a surveillance cam. I'd like to see it integrated with streaming video and using the physical world prior of "moving objects on roads are cars" or "people move slower than cars"
ReplyDelete
Replies
Mike McCannOctober 17, 2013 at 5:49 AM
This paper seems to handle top-down vs bottom-up in the right way, that is, allowing information to flow in both directions. A disturbing consequence of this sort of design, though, is that it becomes harder to encapsulate the pieces of a vision system: my horizon detector has to know how to use the output of your object detector and vice versa. Perhaps that elusive final 10 or 20 percent accuracy that seems invariably missing in any given application can be achieved by a Frankenstein system that try to do everything at once.
ReplyDelete
Replies
AnonymousOctober 17, 2013 at 7:34 AM
In this paper, the authors propose a novel modification of the typical "bag of words" or "black box" approach to object detection. They note (rightly) that what is really going on with sliding window detectors is that they assume a uniform prior on object locations and scales. They point out that this assumption is very often false in common scenes.

Instead, they develop a graphical model of objects, surface geometries and the horizon which is able to distinguish good candidate bounding boxes from bad ones, giving them a much stronger prior of potential object locations. By learning the parameters of this model from real image data, they significantly improve the performance of typical object detectors, and present compelling results in that area.

Positives:
Something has always bothered me about the sliding window detectors. I always got the feeling that they were just throwing the image into a blender and then running it through a black box detector, without any regard for the heuristic assumptions necessary to make predictions about images.

In a sense, these detectors are reactionary. Early vision work used heuristics exclusively, wheras sliding window detectors throw away any assumptions about the image. This paper provides an excellent middle ground -- it is possible to learn the *parameters of a model* and get the best of both worlds.

In general, I think this is the right way to go about doing machine learning. Statistical black-box magic can only get you so far, and the end result is not even very useful (who cares about bound‌ing boxes, really?) If instead, we have strong priors based on *generative models,* we are able to learn much more deeply about the scene and actually *use* the data we get.

Negatives:
I feel like an even more explicit model is necessary to get the kind of performance these authors dream of. The camera parameters should be explicitly modeled. The horizon should be a ground plane, fitted to the data. The objects should be 3D shapes. We've seen elements of these ideas in other papers already. They only need to be combined.
ReplyDelete
Replies
UnknownOctober 17, 2013 at 8:28 AM
Some people mentioned the limitation of how geometry can help object recognition and vice versa. They mentioned the geometry can be reliably estimated from an image of the street with buildings aside, but it might not help a lot for the an image that contains only a group of salient objects. I think it falls back to the object image vs. scene image argument. And I think the notions of saliency, gaze are related as well. It is very helpful to estimate where to look at given an image. For example, context is the cue to look at if the object is too tiny, but for larger, stand-alone ones the devil are the details. To which extend shall we use from each type of information is the intelligence underlies vision.
ReplyDelete
Replies

Add comment