16-824: Learning-based Methods in Vision (F'13): Reading for 10/1

Thursday, September 26, 2013

Reading for 10/1

P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object Detection with Discriminatively Trained Part-Based Models, IEEE Pattern Analysis and Machine Intelligence (PAMI). Sept 2010.

And optionally:

Tomasz Malisiewicz, Abhinav Gupta, Alexei A. Efros, Ensemble of Exemplar-SVMs for Object Detection and Beyond, In ICCV 2011.

B. Hariharan, J. Malik, D. Ramanan. Discriminative Decorrelation for Clustering and Classification, ECCV 2012.

59 comments:

Abhinav ShrivastavaSeptember 26, 2013 at 12:42 PM
I would also (highly) recommend:
N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, In CVPR 2005
ReplyDelete
Replies
Yuxiong WangSeptember 30, 2013 at 3:20 AM
This comment has been removed by the author.
ReplyDelete
Replies
Yuxiong WangSeptember 30, 2013 at 3:22 AM
1. Summary

While the parts-based representation seems intuitive, it is indeed on the basis of physiological and psychological evidence: perception of the whole is based on perception of its parts, one of the core concepts in certain computational theories of recognition problems. Deformable Part Models (DPM) proposed and developed in this paper together with a series of other papers is an elegant paragon in computer vision that tries to formulate a feasible model for learning object parts, especially tackling the intra-category diversity problem in object detection via decomposed parts and mixture models. This complete learning-based system now becomes the standard pipeline for detecting and localizing objects in images, and also used in facial analysis and articulated pose estimation. The key here is to introduce latent-variable models and combine them with the invariant features used by typical methods. Latent-variable models provide a natural formalism for dealing with object appearance variation. Objects are thus represented as mixtures of deformable part models. Models are trained using a weakly supervised discriminative method that only requires bounding boxes for the objects in an image. To be specific, the system mainly consists of three components responsible for its high performance: (1) Strong low-level features based on histograms of oriented gradients (HOG); (2) Efficient matching algorithms for deformable part-based models (pictorial structures); (3) Discriminative learning with latent variables (latent SVM). Experimental results in this paper and by other researchers have demonstrated the state-of-the-art performance of this approach on the PASCAL and INRIA person datasets. This work was also awarded the PASCAL VOC "Lifetime Achievement" Prize in 2010.

2. Issues of interest

(1) Parts-based or holistic representation for object detection

This seems an issue is and will still be in debate. These two have seemingly totally opposite understanding for human perception mechanism in philosophy. They also have representative algorithms working comparably in practice, such as DPM here for parts-based representation and Exemplar-SVMs (T. Malisiewicz, A. Gupta, A. Efros, Ensemble of Exemplar-SVMs for Object Detection and Beyond, In ICCV, 2011) for holistic representation. Besides, man-made AI system may not necessarily follow the biological mechanism of human.

(2) Category level or exemplar based object detection

A similar issue as before. One additional benefit of exemplar based detection is the ease of transferring the metadata.

(3) Representation of detection results

Currently, most of the detection results, when preparing samples for training and giving the final detection results, use bounding box. It is relatively simple compared with the exact shape contour by uniforming. Is contour necessary or important for detection? How can the fine detection result be achieved?

(4) Diverse performance for different categorizes (category bias)

Some clues for this issue?

(5) Possible new construction approach for object detectors

Some new schemes to break through the traditional training and testing framework?

(6) Object detection performance limit

What is the limit of current techniques for image representation and object detection? How much higher can we still achieve given that all of the available information is from HOG features.
ReplyDelete
Replies
Yuxiong WangSeptember 30, 2013 at 3:27 AM
3. Pros

The paper is really well written. Its underlying philosophy, conceptual elaboration, mathematical derivation, modules in the pipeline, visualized illustration, and quantitative results are organized coherently. Reading the paper, you feel enjoying the scenery or listening to a symphony. It also reminds you of a variety of important techniques widely used in detection systems.

4. Cons

(1) Analysis of the key ingredients in DPM

A big issue is left in this paper. That is, why DPM works, or which component in DPM makes it working. Probably, the authors may be not quite clear at that time. Later, there are some papers discussing this issues. For example, S. Divvala, A. Efros, M. Hebert, How important are Deformable Parts in the Deformable Parts Model? Parts and Attributes Workshop, In ECCV, 2012. The author of this paper also has a recent work: R. Girshick, J. Malik, Training Deformable Part Models with Decorrelated Features, In ICCV, 2013. It says "Ever wonder what makes DPM tick? We dissect DPM training to figure out what bits are important. "

(2) Speed

The efficiency of DPM is not quite well mentioned here. Actually, DPM is not known for its speed. Its selling point lies in its ability to identify complex objects. That's why a fast cascade detection algorithm as an improvement is proposed later (P. Felzenszwalb, R. Girshick, D. McAllester, Cascade Object Detection with Deformable Part Models, In CVPR, 2010)

(3) Other choices for model structure and dynamic performance making the models more deformable and adaptive

Here, the star-structured pictorial structure model is used. Some more adaptive model structures are possible.

Part models here can only deform at a fixed predetermined scale relative to that of the root model (at twice the resolution). In this way it is easy to find the optimal placement of each part efficiently. Some methods to efficiently deform the parts across scales as well are possible.

The authors pursued grammar-based models that generalized DPM to allow for objects with variable structure, mixture models at the part level and reusability of parts across components and object classes. (R. Girshick, P. Felzenszwalb, D. McAllester, Object Detection with Grammar Models, In NIPS, 2011.)

(4) Sensitivity to initialization

The algorithm is sensitive to initialization. The current mitigating method to reduce this sensitivity is to use partially or fully annotated data with part and mixture labels.
ReplyDelete
Replies
UnknownSeptember 30, 2013 at 8:00 AM
In respect to DPM, the paper "How important are Deformable Parts in the Deformable Parts Model?" Yuxiong mentioned is another very important paper. I read both papers and they are very insightful.

In fact long has there been discussions and arguments between DPM and the exemplar-matching based method. The later Efros paper is in the middle of both. If there is a good configuration of sub-categories that handle the viewing angles as well as sub-class appearance difference, then there is no need to train a complicated DPM to handle possible deformation.

No matter which way is used, one of the biggest concerns for detection is always the appearance/shape deformations caused by angle and pose. Both papers were trying to handle that. I like the pictorial description of objects and the extremely smart methods in the DPM paper which makes it look like a piece of art. But personally I bias more towards the idea of the Efros paper. In the real world, do human really care about parts under most of the circumstances? Are discriminativeness more important, or parts and pictorial description? I choose the first one.
ReplyDelete
Replies
UnknownSeptember 30, 2013 at 12:20 PM
I think there should be a good theory that explains/models the underlying inter-class deformation. This theory extracts the important parts of the object, and allows other parts to deform in a reasonable way, (by reasonable I mean functionally, physically, etc). A step further may also include view-angle change, scale change, and the entire geometry reasoning. We humans are doing detection effortlessly, yet we do not have very precise depth information from our eyes, not to mention geometry information, we are able to do the detection task well. So I am not convinced that more information (like RGB-D camera) can help discovering the underlying principles for recognition.
ReplyDelete
Replies
UnknownSeptember 30, 2013 at 12:22 PM
In the latent SVM updating scheme using stochastic gradient descent, my experience tells me that the balance between positive and negative samples and initialization are both important, in different context though. I was wondering would it be a big deal to have the aforementioned issues in object detection?
ReplyDelete
Replies
AnonymousSeptember 30, 2013 at 12:25 PM
In this paper, the authors present a method of detecting and localizing objects using deformable part models. Their novel contribution seems to be a way of training such models from a dataset containing only bounding boxes by using a latent SVM, combined with some very strong prior assumptions about how such models should be represented. This allows them to detect multiple objects at multiple scales. They show fairly good performance on object detection databases using this result. They also make certain claims about context recognition.

Positives:
- By including deformable parts, they are able to account for different aspects, sizes, and shapes of particular instances of objects. This gives them a significant advantage over single-unit templates.
- By training the deformable parts using a discriminative SVM model, the authors are able to efficiently learn new templates (really, cost functions) from few training examples.
- Multiple resolutions and scales are automatically supported to an extent.
- Their approach is robust to translation and rotation to an extent.

Negatives:
- The initialization assumptions are exceedinly strong. They assume bilateral symmetry, and a fixed number of parts (six!) These seem very fishy.
- Some of their failure cases show that context is clearly not being taken into account correctly. If we look just at the local features, we see something that looks superficially like a sofa, or a bottle, etc., but when taken as a part of the scene as a whole, the object clearly does not belong to that class. So I think their claims about context are not exactly right.
- It's not clear that the approach can handle multiple object instances occluding one another. Their non-maximum-suppression phase seems to eliminate such cases.
ReplyDelete
Replies
IshanSeptember 30, 2013 at 5:29 PM
Personally, I think the DPMs mark an important change in object
detection approaches. Although the idea of part-based-models is old
(dating back to at least Binford's generalized cylinders), this paper
showed for the first time a successful way to use them for object
detection on a challenging dataset. I see the following major
contributions of this paper
1. A more discriminative HOG feature. I especially like the PCA
analysis done in the paper. They make the feature
contrast-sensitive, tweak the normalization, and introduce 4
additional components (which capture texture information).
2. A clean theoretical framework for modeling/learning parts using
Latent-SVMs.
3. An easy solution to solving the non-convexity issues in the
optimization. I've read more complex workarounds, but this is the
one I prefer the most.
4. Their NIPS 2011 paper introduces object grammars (with occlusion
models) and shows a nice extension of this DPM framework.

I find this paper very educational especially because of the attention
to detail (e.g. explaining the hard negative mining steps, post
processing - NMS).

I like the DPMs for object detection mainly because of their
false-positive rate. The root filter and part filter interactions make
the DPM less susceptible to false-positives (as compared to other
methods using the same data for training).
ReplyDelete
Replies
M AravindhSeptember 30, 2013 at 6:13 PM
I've a doubt about stochastic gradient descent. The authors begin my saying that they will solve the combined convex problem in both the latent variables of the negative examples and the model parameters. But then they first solve for the latent variables (for negative examples) by fixing the parameters and then solve for the parameters fixing everything else. I don't think that's a good way to solve jointly for both. Am I missing something?
ReplyDelete
Replies
M AravindhSeptember 30, 2013 at 6:25 PM
I'm still reading the paper but I'm going to write these few comments here so as to get more feedback. People stop commenting close to the deadline :( .

An object is made of parts and parts can be deformed. Similarly, an image is made of objects and their positions can be deformed. Ofcourse the latter does not follow a spring model and may be something non parametric needs to be used. The intuition is that we might be able to get away without any annotations (completely unsupervised or may be semisupervised) by building this extra layer on top of the DPM PAMI paper we read for tuesday. I discussed this with Dr. Ramanan but he was not interested. What do you think?
ReplyDelete
Replies
UnknownSeptember 30, 2013 at 6:32 PM
read Lana Lazebnik paper for Scene DPMs....I think the question is how strong are the parts in DPM and do they really work as claimed in the paper or the ideal case. Do you guys see any consistency in the parts? Do you see enough spring in the parts? Can you do parts only and no root model?
ReplyDelete
Replies
M AravindhSeptember 30, 2013 at 6:34 PM
I see a intuitive relation between DPMs and deep networks.

In deep networks we are fixing the local receptive field and computing a function of the features in that local receptive field. The features in this case can be thought of as object's part correlations and the part position and corresponding deformation cost is made constant. The idea is that, like a single template can capture some variability in the object pose by weighting different gradient components differently, fixing the part position and deformation cost can be compensated for to some extent by adjusting the filters and making them fuzzy.

If instead, we don't fix the local receptive field and let the parts go arbitrarily away from the root we get DPM. Also make the number of layers to be 2.

This conceptual difference between a fixed receptive field and a fixed deformation cost vs a moving field and a cost depending on part placement has interesting consequences.
1. DPM's need to move the parts and search - O(nk) while deepnets don't.
2. DPM's are probably more discriminative because of this freedom. But the deformation is parametric and the part has to work for all places and that need not be true. This means they are making their filters fuzzy too. I think it's not taking them much far.

Any thoughts?
ReplyDelete
Replies
ArunSeptember 30, 2013 at 9:54 PM
From their papers, deformable parts models seem to be a convincing approach for object detection. It has good intuitive reasoning - a whole is made by a sum of its parts (more on this later). I had gone to the RI seminar for this work and was also pretty impressed then. I think the coolest part is the lack of specifying any parts (I don't know if this is innovative in itself - this could be common in the literature). It's cool that their model learns the part location distributions, etc. from the data.

Now back on the topic of a whole being a sum of the parts. The question becomes (and we've had this discussion in class), how many parts is that exactly? For a few object categories, like cars and humans, there are a few parts that people agree highly upon (wheels/doors/hood & face/torso/legs/arms respectively), but it becomes hard to decide upon how many parts is necessary. In the authors' system, this was hardcoded to be six. My guess is that this should instead become a parameter to cross-validate over categories. Some category's detector may do better with more parts than others. This part seems the most hack-ish, and would be good future work to optimize.

ReplyDelete
Replies
UnknownSeptember 30, 2013 at 11:06 PM
This paper really interests me a lot. Both the intuition and the math are interesting.

The math in this paper is really clear. The part of analysis PCA is also interesting and clearly represented by the visualized figures.

From my perspective, the intuition of DPMs is more like human understanding images by watching the images. I think the way human understand image by sight is that we first memory the entire figure and then we try to extract essential parts from the figure.

However the "visual grammars" mentioned in the introduction part really interests me. Because I'm not working on object recognition, I'm not sure about the current research on this topic. However, I think this is the idea we talked about in the class that how to define machine learning - computer should understand the object and reconstruct it. I think maybe we can use NLP and knowledge base on website like "WordNet" to get the grammar or the structure of the objects. And we only need to define and model some basic "visual words". I think computer could real understand the objects by this way, because they could not only recognize object but also draw a picture of the object.

I think there will be some work like this. Please give me some links...
ReplyDelete
Replies
Priya DeoSeptember 30, 2013 at 11:19 PM
This paper describes a DPM model for detecting multiple object classes. The key idea for this approach is to manipulate a template to fit the image in question and develop a cost function that describes how much energy each manipulation costs.

I really liked the use of data mining to get "hard" negative examples because it reduces the space of negative examples by removing the ones that the algorithm would succeed at anyway. I think this lets the optimization focus on a more precise boundary between positive and negative examples.

I dont think the paper gives much intuition for why the parts appearances are computed two levels lower in the image pyramid than the root filter. Why not have the root filter at the same resolution as the parts?

Looking at their precision/recall curves, I think it is interesting that for the person category, it is more important to have parts, while for the car category, it is more important to model multiple view. Maybe this is because people have much more shape variation than cars, so having multiple mixture components is not as useful. Whereas for cars, which only have so many moving parts, shape variation is due to viewpoint, and therefore the mixture components are more important. Maybe it is indicitive of some dataset bias, where cars never have doors or trunks open, and so have very little deformation at a given viewpoint.
ReplyDelete
Replies
UnknownOctober 1, 2013 at 5:08 AM
My main criticisms of this paper are that:

1. It approaches the problem of multiple object detection in an image with a sliding window approach, using filtering and a context hack as an afterthought. This seems fundamentally wrong, as essentially the type-signature of their solution does not match the type-signature of the desired output (single detection uncoupled / poorly coupled with other detections VS. multiple detections all coupled together)

2. As far as I can tell, there is no notion of occlusion, and I couldn't find anything regarding how this approach performs under occlusion.

3. Restricting 'parts' to be in rectangles seems to be restricting the generalizing power of deformability to a few object classes / scales of part representation.
ReplyDelete
Replies
UnknownOctober 1, 2013 at 8:00 AM
Personally, I think of this paper as a really important milestone for object detection since it builds a framework that is general and *robust* enough that could be adopted as building block for many many detection systems. Quote from Ross Girshick's thesis "The methods described in this dissertation account for two-thirds of the performance gain between 2007 and 2011."...

Several points I want to highlight:
1. Parts representation and tree structure.
Established a intuitive model structure which is also efficient to inference. Regarding to the importance of parts, there is a paper which I think somehow shed a light on this, http://web.mit.edu/vondrick/largetrain.pdf (Actually I'm really interested in this question, more data or better model?)
2. LSVM practicalities, including initialization, hard-negative mining and SGD.
3. Improved features from PCA analysis.
4. Context post-processing which somehow which aims at reconciling detections from different models.

Some issues call for further research;
1. How many parts and how many components to capture the appearance variations for different objects. There should be better way than just hand-coded numbers.
2. The contextual post processing seems not to hacking. However, there is a paper about incorporating NMS and context information into the training process but only small performance gain is observed about using structured prediction (http://www.ics.uci.edu/~dramanan/papers/nms.pdf). Smarter way of exploiting context information is also worth researching in the future.

Anyway, I appreciate this paper a lot both for technical reason and for its writing style.

ReplyDelete
Replies
Divya HariharanOctober 1, 2013 at 8:17 AM
I personally enjoyed the paper a lot. There were many interesting math optimizations, which were pretty intuitive.

1. Latent SVM - A detailed theoretical explanation of the model.

2. Applying PCA to reduce dimensionality - A nice analysis on using PCA for HoG features to cover the entire feature space using lesser dimensions. Given that the generative PCA model works for this problem, it might be interesting to try if the Fukunaga Koontz Transform (FKT) or Kernel-FKT (which generates a discriminative feature space) would make the classification any better.

3. Data mining hard examples - This was my favorite one - a very clever and neat way of removing negative samples that are classified properly.
ReplyDelete
Replies
UnknownOctober 1, 2013 at 8:29 AM
I like the part based approach to object detection presented by this paper. Of course as mentioned by previous comments, this approach still doesn't capture all the information that could be relevant to object detection (ie. 3d pose, etc), but what it does do very well is allowing for some generalization of object pose while remaining computationally tractable. As it turns out just modeling the amount of translation each part can move around matters a lot (looking at the performance results).

What I didn't like about this paper was there somewhat (I think) ad-hoc approach to including context. In their results, they show that context is quite important (accounting for about ~2% performance gain over pure DPMs), but their only way of accounting for context is essentially a re-normalization of class detections, which seems to work for the PASCAL datasets - I guess this is why it's in the post processing section of the paper. I think scene context as well as information could help drive object detection directly instead of a post process.
ReplyDelete
Replies
UnknownOctober 1, 2013 at 10:27 AM
I like the idea of using parts for object detection instead of rigid templates. It seems that this should produce more robust and generalizable algorithms. However, the authors themselves state that "simple models have historically outperformed sophisticated models in computer vision, speech recognition, ,machine translation and information retrieval.". They then go on to justify how they are trying to make their complicated model work - by "gradual enrichment" to "maintain performance.".

Are these complicated models really better? If the simpler ones work, can we make them work better? Is the reason were moving to more complicated models because the simpler ones have reached their performance limit, or because we philosophically like the more complicated ones better?
ReplyDelete
Replies

Add comment