Comments on 16-824: Learning-based Methods in Vision (F'13): Reading for 10/17

Its still the maximum aposteriori over the joint d...

2013-10-17T10:05:46.407-07:00

Its still the maximum aposteriori over the joint distribution. We aren't loosing anything by having to iterate twice - its not a greedy solution. This itself is the Frankenstien system.

Is it so terrible that the pipeline has to flow bo...

2013-10-17T09:59:13.517-07:00

Is it so terrible that the pipeline has to flow both ways? I could imagine something iterative like stacked labeling where each module is only required to adopt some probabilistic input/output interface.

I actually agree with zhiding that it looks more l...

2013-10-17T08:34:36.530-07:00

I actually agree with zhiding that it looks more like a car, rather than other stuffs named in the paper. Yes, human may not be 100% confident of what it exactly is, but I hardly imagine that computer mouse will be more likely than car. I think human eye are not used to view such pixel-ized images. The imagination beyond the pixels gave us other possibilities. But when u make the pixel smaller as it should be, we don't bother to do that again, and recognize it as car easily.

Some people mentioned the limitation of how geomet...

2013-10-17T08:28:13.443-07:00

Some people mentioned the limitation of how geometry can help object recognition and vice versa. They mentioned the geometry can be reliably estimated from an image of the street with buildings aside, but it might not help a lot for the an image that contains only a group of salient objects. I think it falls back to the object image vs. scene image argument. And I think the notions of saliency, gaze are related as well. It is very helpful to estimate where to look at given an image. For example, context is the cue to look at if the object is too tiny, but for larger, stand-alone ones the devil are the details. To which extend shall we use from each type of information is the intelligence underlies vision.

My guess is that the PASCAL dataset contains many ...

2013-10-17T07:37:51.010-07:00

My guess is that the PASCAL dataset contains many close-up framed images of individual objects, wheras their system seems to rely on having a good estimate of where the horizon and vertical support surfaces are before making a detection. I suspect their learned prior model caused them to throw out many good detections from their underlying object detector.

That was Matt Klingensmith btw... (keep forgetting...

2013-10-17T07:34:48.702-07:00

That was Matt Klingensmith btw... (keep forgetting.)

In this paper, the authors propose a novel modific...

2013-10-17T07:34:15.634-07:00

In this paper, the authors propose a novel modification of the typical "bag of words" or "black box" approach to object detection. They note (rightly) that what is really going on with sliding window detectors is that they assume a uniform prior on object locations and scales. They point out that this assumption is very often false in common scenes.

Instead, they develop a graphical model of objects, surface geometries and the horizon which is able to distinguish good candidate bounding boxes from bad ones, giving them a much stronger prior of potential object locations. By learning the parameters of this model from real image data, they significantly improve the performance of typical object detectors, and present compelling results in that area.

Positives:
Something has always bothered me about the sliding window detectors. I always got the feeling that they were just throwing the image into a blender and then running it through a black box detector, without any regard for the heuristic assumptions necessary to make predictions about images.

In a sense, these detectors are reactionary. Early vision work used heuristics exclusively, wheras sliding window detectors throw away any assumptions about the image. This paper provides an excellent middle ground -- it is possible to learn the *parameters of a model* and get the best of both worlds.

In general, I think this is the right way to go about doing machine learning. Statistical black-box magic can only get you so far, and the end result is not even very useful (who cares about bound‌ing boxes, really?) If instead, we have strong priors based on *generative models,* we are able to learn much more deeply about the scene and actually *use* the data we get.

Negatives:
I feel like an even more explicit model is necessary to get the kind of performance these authors dream of. The camera parameters should be explicitly modeled. The horizon should be a ground plane, fitted to the data. The objects should be 3D shapes. We've seen elements of these ideas in other papers already. They only need to be combined.

I agree, that's also a point which I like a lo...

2013-10-17T07:13:39.749-07:00

I agree, that's also a point which I like a lot. Just like what Abhinav mentioned in the class, there are many good points in the perspective of early vision research, it is only because of the limitation at that time (they don't have enough data!), they cannot capture their intuition in a reasonable way that respects the objective visual world. Since we now have tons of data, it is time to re-examine some very good points in early research to see if we can find something help us going forward.

Re: incorrect prior, I agree that graceful failure...

2013-10-17T05:58:14.496-07:00

Re: incorrect prior, I agree that graceful failure is lacking in some of these methods. Ideally we could use strong priors (box-shaped rooms, horizontal horizon near the middle of the image) when they make sense, but have a way to discard them when they really aren't supported by what is in the image.

This paper seems to handle top-down vs bottom-up i...

2013-10-17T05:49:11.857-07:00

This paper seems to handle top-down vs bottom-up in the right way, that is, allowing information to flow in both directions. A disturbing consequence of this sort of design, though, is that it becomes harder to encapsulate the pieces of a vision system: my horizon detector has to know how to use the output of your object detector and vice versa. Perhaps that elusive final 10 or 20 percent accuracy that seems invariably missing in any given application can be achieved by a Frankenstein system that try to do everything at once.

I think the intuition of using geometry as cues to...

2013-10-17T05:44:26.991-07:00

I think the intuition of using geometry as cues to limit the searching space of object detection is reasonable. However, the geometry information used in this paper is limited. It also makes several assumptions, which makes it sort of focusing on applications in outdoor scenes and objects such as cars and pedestrians. By leveraging a lot of other images rather than this single image, we can obtain more useful contextual informations to model more kinds of variations.

The authors set a high goal - complete image under...

2013-10-17T03:42:32.444-07:00

The authors set a high goal - complete image understanding as humans do. Humans leverage context, some idea about scene geometry. Indirectly some of these cues were being used as features in earlier 2D scene understanding papers we read. Using inference information from neighboring superpixels gives semantic context, and scene geometry features gives geometry context to be more explicit.

However the authors model the problem as complete 3D scene understanding with priors for viewpoints, geometry and objects. They have some fairly reasonable assumptions as to how images are taken, what viewpoints, object sizes in the real world etc.
Comments - What I like:
1. Attempt at complete scene understanding.
2. flexibility to change object detectors, possibly add scene categorization to the graphical model.
3. Good results :)

General critical comments:
1. Would like some understanding as to how geometry and viewpoints are helping in detection.
2. I feel that since we are moving into 3D there should be some notion of what depth the object is detected at. The logical extension of this work is to have a bounding box in 3D space instead of the perspective view (will require camera intrinsics though)

Some other points I'd like to raise:
As Humphrey and Maheen mentioned, this system is useful when used on a robot or with a surveillance cam. I'd like to see it integrated with streaming video and using the physical world prior of "moving objects on roads are cars" or "people move slower than cars"

I agree with the idea that we might need to train ...

2013-10-17T02:06:53.621-07:00

I agree with the idea that we might need to train a joint model instead of piecewise model. It is more intuitive to do by this way, however pairwise is easier to calculate, because that we human will not consider all the context relationship between every two object. We just group things together.

I really like this paper because the idea is reall...

2013-10-17T02:02:24.327-07:00

I really like this paper because the idea is really intuitive which is really like what we human think when we look at some image. The paper is really short but clearly enough to demonstrate the basic idea of this paper.

The way that the paper combines viewpoint, objects and surface geometry information is really interesting. Those prior knowledge is reasonable. The framework that could combine research result from different small area in computer vision is interesting to me. It uses graphical model to build the framework right now is intuitive, however, we might build other powerful frameworks which could combine more thinks like the prior knowledge of the relationship between different objects and human object interactions.

With PASCAL, there are still some types of viewpoi...

2013-10-17T01:47:21.988-07:00

With PASCAL, there are still some types of viewpoints probably. If we can model that as well, would it be helpful? (i.e. more top-down images, horizontal images, etc. ). This is like a mixture of 3d scenes ontop of our mixture of DPMs. I wonder if this would buy us anything. In some sense, some of the DPM mixtures will end up capturing the different viewpoint variances (e.g. if the same object is viewed from different angles as opposed to the object having multiple poses (sleeping/standing)). Having a mixture of viewpoints could serve as a higher level to the dpm. Not sure if this would actually help though...

2013-10-17T01:46:39.517-07:00

This comment has been removed by the author.

I liked how the paper's background was a littl...

2013-10-17T01:41:20.317-07:00

I liked how the paper's background was a little more historical than just doing a bunch of relevant citations. It went well with the overall flavor with the paper that it is inspired from the way images are created and what they come from - something that they point out was how early vision researcher viewed it as well. Images - and we say images - are usually photos of the world taken by regular people in a regular fashion (i.e. sky is up, grass is down, etc.). As such, as the paper points out, the photos have regular structure in their 3d representation that is then lost when we project into the image. By doing geometric interpretation and representing relationships in the graphical model, we can achieve better performance. This approach seems 'correct' in many ways.

A bigger picture here is the Geometrically Coheren...

2013-10-17T01:39:23.144-07:00

A bigger picture here is the Geometrically Coherent Image Interpretation which jointly model the elements that make up the scene with the geometric context of the 3D space that they occupy. The emphasis here is "coherent", where all the measurement quantities of the image should be considered together in a coherent way. I invite everyone to scan through his PhD thesis "Seeing the world behind the image - Spatial layout for 3D scene understanding - 2007" http://www.cs.uiuc.edu/homes/dhoiem/publications/thesis_derek.pdf

Also "Closing the loop in scene interpretation - cvpr 2008" http://www.cs.uiuc.edu/homes/dhoiem/publications/cvpr2008SceneInterpretation.pdf

I like the idea of imposing geometric constraints ...

2013-10-17T01:17:01.312-07:00

I like the idea of imposing geometric constraints for object detection which is quite intuitive and is demonstrated to be useful in the experiments of this paper. Many priors learned in this paper seems to match what we humans would do in localizing and recognizing objects in a scene.

I'm wondering why context, a seemingly quite intuitive and important role in human visual functionality, does not help that much in current object detection system. Maybe it is because of the way we make use of context. Like in this paper, many works make use of context as by imposing a higher level model upon different types of context, maybe we should make the learning process for each context components be able to communicate with each other, which means we need to train a joint model instead of piecewise model for different contexts. Another possible reason is that we are still not at the right scale to use context, the impression I got from this paper is that we really need huge amount of data to reduce the variance and uncover the underlying context (e.g. estimating the viewpoint).

I liked this paper and the idea of trying to get a...

2013-10-17T00:17:01.605-07:00

I liked this paper and the idea of trying to get a more global scene understanding to improve object detection. Figure 2g is really cool in that all those bounding boxes seems like very plausible places for people to be - walking on the sidewalk - even though 2g just shows the probability distribution of people given the viewpoint and geometry.

The process of using object detections to improve viewpoint and geometry estimates is stable because of how belief propagation works in graphical models, right? Guaranteed to converge? (I can't remember at the moment exactly how all the message passing stuff works, sorry if this is all very silly.)

I was surprised the paper didn't perform that well on the PASCAL data set. Is it because they had a bad object detector? I wonder what kind of precision they could get if they had plugged the winning algorithm into their system.

Are you sure that the reason you can clearly see a...

2013-10-17T00:03:17.877-07:00

Are you sure that the reason you can clearly see a car once you move away from the computer isn't because you already know it's a car? If you convince yourself that it looks like a computer mouse, and then move away, it looks a lot like a computer mouse (to me).

I didn't think the ground plane support assump...

2013-10-16T23:57:28.002-07:00

I didn't think the ground plane support assumption was such a big deal.

(1) Very reasonable for the authors' application (which we have already agreed upon)

(2) Even without the ground plane support assumption, you can still at least use the geometry to narrow down your search space a little, since I still think it's reasonable to assume most objects are are vertical surfaces.

(3) Objects are still supported by *something* (unless we're talking about things that can float/fly). If we want to push the detection to finding monitors and mugs, maybe we should also be able to find tables and chairs and say that detected objects and serve as ground planes for additional objects, and then do something similar to what Abhinav presented on Tuesday with regards to stability of objects resting on each other.

I think that making assumptions and solving the si...

2013-10-16T23:10:13.789-07:00

I think that making assumptions and solving the simpler problem is the way to go (since you cant solve a more complicated problem without at least understanding how to do the easier version). But I think the authors could have spent a bit of effort commenting on how this can be extended to objects that arent on the ground plane, especially since they suggested that their algorithm could be combined with any object detector.

I agree. I think that since the model object heigh...

2013-10-16T23:04:42.191-07:00

I agree. I think that since the model object height using a probability function, you just have a larger variance on the model for objects with large intraclass height variation. But at least this gives you the ability to say something like, people will never be greater than 10 ft tall and cars will never be shorter than 3ft tall.

Geometric cues seem to be very useful evidence for...

2013-10-16T22:51:46.917-07:00

Geometric cues seem to be very useful evidence for almost any computer vision task that involves objects at multiple scales. While the viewpoint estimation did seem highly tailored to the specific "street-scene" scenario, I think it would be possible to to extend to much larger datasets that only have some of the horizon GTs. Hopefully, a system would be able simply treat the horizon and viewpoint estimates only as evidence, and place no hard constraints.

3D evidence and predictions are clearly integral to solving tasks involving scenes, but it seems difficult to engineer systems that reason in 3D that are generalizable in the sense that they won't detriment performance in any arbitrary scene environment. Making iterative predictions, and no hard assignments, seems like the only way to approach the "chicken-and-egg" problem the authors mention.