P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object Detection with Discriminatively Trained Part-Based Models, IEEE Pattern Analysis and Machine Intelligence (PAMI). Sept 2010.
And optionally:
Tomasz Malisiewicz, Abhinav Gupta, Alexei A. Efros, Ensemble of Exemplar-SVMs for Object Detection and Beyond, In ICCV 2011.
B. Hariharan, J. Malik, D. Ramanan. Discriminative Decorrelation for Clustering and Classification, ECCV 2012.
I would also (highly) recommend:
ReplyDeleteN. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, In CVPR 2005
This comment has been removed by the author.
ReplyDelete1. Summary
ReplyDeleteWhile the parts-based representation seems intuitive, it is indeed on the basis of physiological and psychological evidence: perception of the whole is based on perception of its parts, one of the core concepts in certain computational theories of recognition problems. Deformable Part Models (DPM) proposed and developed in this paper together with a series of other papers is an elegant paragon in computer vision that tries to formulate a feasible model for learning object parts, especially tackling the intra-category diversity problem in object detection via decomposed parts and mixture models. This complete learning-based system now becomes the standard pipeline for detecting and localizing objects in images, and also used in facial analysis and articulated pose estimation. The key here is to introduce latent-variable models and combine them with the invariant features used by typical methods. Latent-variable models provide a natural formalism for dealing with object appearance variation. Objects are thus represented as mixtures of deformable part models. Models are trained using a weakly supervised discriminative method that only requires bounding boxes for the objects in an image. To be specific, the system mainly consists of three components responsible for its high performance: (1) Strong low-level features based on histograms of oriented gradients (HOG); (2) Efficient matching algorithms for deformable part-based models (pictorial structures); (3) Discriminative learning with latent variables (latent SVM). Experimental results in this paper and by other researchers have demonstrated the state-of-the-art performance of this approach on the PASCAL and INRIA person datasets. This work was also awarded the PASCAL VOC "Lifetime Achievement" Prize in 2010.
2. Issues of interest
(1) Parts-based or holistic representation for object detection
This seems an issue is and will still be in debate. These two have seemingly totally opposite understanding for human perception mechanism in philosophy. They also have representative algorithms working comparably in practice, such as DPM here for parts-based representation and Exemplar-SVMs (T. Malisiewicz, A. Gupta, A. Efros, Ensemble of Exemplar-SVMs for Object Detection and Beyond, In ICCV, 2011) for holistic representation. Besides, man-made AI system may not necessarily follow the biological mechanism of human.
(2) Category level or exemplar based object detection
A similar issue as before. One additional benefit of exemplar based detection is the ease of transferring the metadata.
(3) Representation of detection results
Currently, most of the detection results, when preparing samples for training and giving the final detection results, use bounding box. It is relatively simple compared with the exact shape contour by uniforming. Is contour necessary or important for detection? How can the fine detection result be achieved?
(4) Diverse performance for different categorizes (category bias)
Some clues for this issue?
(5) Possible new construction approach for object detectors
Some new schemes to break through the traditional training and testing framework?
(6) Object detection performance limit
What is the limit of current techniques for image representation and object detection? How much higher can we still achieve given that all of the available information is from HOG features.
My views on the issues:-
Delete(1) Part based detection with scenes providing a prior seem a good balance of both.
(2) We need a ton of exemplars and that means a bad sample complexity.
(3) I think contour level details are a post process. Detection/recognition need not map back into pixel space at all and need only map into the representation used for whichever task is required to be done with it.
(4) Its hard to say given that data bias will itself make some classes much harder than others. The absolute numbers for aeroplane might be low because aeroplanes are hard or because aeroplanes don't satisfy the model assumptions? Relative scores are hard to judge as well ... did the competitor make his model to work well on cars or are cars bad for this DPM?
(6) Theres a NIPS 2013 paper that's using deep nets for object detection by google - http://nips.cc/Conferences/2013/Program/event.php?ID=4018 ... may be they can show what we can do with just raw pixels and feature learning.
Drop by, I can show you that paper :P
Delete3. Pros
ReplyDeleteThe paper is really well written. Its underlying philosophy, conceptual elaboration, mathematical derivation, modules in the pipeline, visualized illustration, and quantitative results are organized coherently. Reading the paper, you feel enjoying the scenery or listening to a symphony. It also reminds you of a variety of important techniques widely used in detection systems.
4. Cons
(1) Analysis of the key ingredients in DPM
A big issue is left in this paper. That is, why DPM works, or which component in DPM makes it working. Probably, the authors may be not quite clear at that time. Later, there are some papers discussing this issues. For example, S. Divvala, A. Efros, M. Hebert, How important are Deformable Parts in the Deformable Parts Model? Parts and Attributes Workshop, In ECCV, 2012. The author of this paper also has a recent work: R. Girshick, J. Malik, Training Deformable Part Models with Decorrelated Features, In ICCV, 2013. It says "Ever wonder what makes DPM tick? We dissect DPM training to figure out what bits are important. "
(2) Speed
The efficiency of DPM is not quite well mentioned here. Actually, DPM is not known for its speed. Its selling point lies in its ability to identify complex objects. That's why a fast cascade detection algorithm as an improvement is proposed later (P. Felzenszwalb, R. Girshick, D. McAllester, Cascade Object Detection with Deformable Part Models, In CVPR, 2010)
(3) Other choices for model structure and dynamic performance making the models more deformable and adaptive
Here, the star-structured pictorial structure model is used. Some more adaptive model structures are possible.
Part models here can only deform at a fixed predetermined scale relative to that of the root model (at twice the resolution). In this way it is easy to find the optimal placement of each part efficiently. Some methods to efficiently deform the parts across scales as well are possible.
The authors pursued grammar-based models that generalized DPM to allow for objects with variable structure, mixture models at the part level and reusability of parts across components and object classes. (R. Girshick, P. Felzenszwalb, D. McAllester, Object Detection with Grammar Models, In NIPS, 2011.)
(4) Sensitivity to initialization
The algorithm is sensitive to initialization. The current mitigating method to reduce this sensitivity is to use partially or fully annotated data with part and mixture labels.
I would consider the speed comment slightly differently. I think DPMs reached a level of detection performance where people actually bothered to see how fast it was. There has been a lot of work in making them fast. The most recent being - "Fast, Accurate Detection of 100,000 Object Classes on a Single Machine - Dean et al"
DeleteOther work
1. Bounding Part Scores for Rapid Detection with Deformable Part Models - I. Kokkinos
2. Exact Acceleration of Linear Object Detectors - Dubout and Fleuret
In respect to DPM, the paper "How important are Deformable Parts in the Deformable Parts Model?" Yuxiong mentioned is another very important paper. I read both papers and they are very insightful.
ReplyDeleteIn fact long has there been discussions and arguments between DPM and the exemplar-matching based method. The later Efros paper is in the middle of both. If there is a good configuration of sub-categories that handle the viewing angles as well as sub-class appearance difference, then there is no need to train a complicated DPM to handle possible deformation.
No matter which way is used, one of the biggest concerns for detection is always the appearance/shape deformations caused by angle and pose. Both papers were trying to handle that. I like the pictorial description of objects and the extremely smart methods in the DPM paper which makes it look like a piece of art. But personally I bias more towards the idea of the Efros paper. In the real world, do human really care about parts under most of the circumstances? Are discriminativeness more important, or parts and pictorial description? I choose the first one.
I also feel that exemplar-matching is appealing and can take as very far toward solving the problem. I think that there is a class of object detection problems that cannot be solved that way, though. Imagine the problem of detecting "cup," when what we mean is "something that can hold water." Here an understanding of physics and 3d is necessary.
DeleteRemove the parts from DPM and then increase the number of mixture components to be equal to the number of positive examples. Now perform a few convex relaxation (union bound etc.) and you'll be able to decompose the problem into n+ independent SVM problems. That's exemplar SVM!!! [This is a rigorous argument and I have a well written proof which I'll try to publish. I don't think anyone in the community will be interested though - any suggested venues? tiny workshops?]
DeleteDPM is trying to get away with fewer mixture components. This gives it statistical strength. The parametric nature of deformation cost makes it generalize better to unseen deformations. This won't happen for exemplar SVM.
I had the pleasure of being part of a conversation (over beer) with Santosh and Pedro which touches on the point you describe. I'll try both point of views, my personal take and then leave it open for discussion.
DeleteI think there should be a good theory that explains/models the underlying inter-class deformation. This theory extracts the important parts of the object, and allows other parts to deform in a reasonable way, (by reasonable I mean functionally, physically, etc). A step further may also include view-angle change, scale change, and the entire geometry reasoning. We humans are doing detection effortlessly, yet we do not have very precise depth information from our eyes, not to mention geometry information, we are able to do the detection task well. So I am not convinced that more information (like RGB-D camera) can help discovering the underlying principles for recognition.
ReplyDeleteI agree. One sort of "fishy" thing about this paper that I see is the lack of rotation, shearing, or any 3D deformation of each of the parts. I don't think that it's enough to capture the large space of possible deformations. For this reason, I think 3D deformable parts are needed.
Delete-- Matt Klingensmith
Agreed, I look forward to reading the paper for Thursday as it seems to address this.
DeleteThis comment has been removed by the author.
DeletePart deformations are actually able to approximate perspective changes and rotations to a certain degree. This paper gives a much better intuition for why this is true: Yi Yang, Deva Ramanan. "Articulated Pose Estimation with Flexible Mixtures of Parts".
DeleteAlso, the paper describes using mixture model for each object category. Each component of the mixture model can capture a distinct viewpoint (front and side for example). So I think these two pieces in conjunction actually make the DPM more robust to rotation, viewpoint, and 3D deformation.
Besides parts being non-deformable, they are also only tied to the center (star-like). I mean the model in this paper is kept consistently simple. To me it was interesting that they in fact use mixture models to mitigate this.
DeleteWhat about objects that are *very* non rigid, such as clothes or animals? It seems that DPM would totally break with these classes. The authors themselves note that their model struggled a lot with cat detection. Do you of any methods that have been able to reliably (relatively speaking) detect objects of such categories?
DeleteIn the latent SVM updating scheme using stochastic gradient descent, my experience tells me that the balance between positive and negative samples and initialization are both important, in different context though. I was wondering would it be a big deal to have the aforementioned issues in object detection?
ReplyDeleteWell, there is definitely imbalance in positive and negative samples in object detection, but because of the use of hard-mining, the problem is mitigated. I'll briefly discuss this.
DeleteI don't think this is specifically related to hard-mining, and this is a little more speculative - in the scenario of object detection, there are many more things in the world (dataset) that are NOT the category than are. So there's some intuition that you will need to provide a lot more negative examples to show the learner what is not an object than what is.
DeleteI believe their hard-mining is only optimal in regards to their dataset - there is no claim that is real-life optimal (whatever that means :P). It just means that it will be the optimal learner given the negative examples and the positive examples as if you had trained it in a batch or standard online way.
With regard to the imbalance between pos and neg data which can be mitigated by placing different weights, I think initialization is much more crucial for these "latent" things. The initialization described in this paper is simple and effective. Is there any other better way doing this initialization?
DeleteIn this paper, the authors present a method of detecting and localizing objects using deformable part models. Their novel contribution seems to be a way of training such models from a dataset containing only bounding boxes by using a latent SVM, combined with some very strong prior assumptions about how such models should be represented. This allows them to detect multiple objects at multiple scales. They show fairly good performance on object detection databases using this result. They also make certain claims about context recognition.
ReplyDeletePositives:
- By including deformable parts, they are able to account for different aspects, sizes, and shapes of particular instances of objects. This gives them a significant advantage over single-unit templates.
- By training the deformable parts using a discriminative SVM model, the authors are able to efficiently learn new templates (really, cost functions) from few training examples.
- Multiple resolutions and scales are automatically supported to an extent.
- Their approach is robust to translation and rotation to an extent.
Negatives:
- The initialization assumptions are exceedinly strong. They assume bilateral symmetry, and a fixed number of parts (six!) These seem very fishy.
- Some of their failure cases show that context is clearly not being taken into account correctly. If we look just at the local features, we see something that looks superficially like a sofa, or a bottle, etc., but when taken as a part of the scene as a whole, the object clearly does not belong to that class. So I think their claims about context are not exactly right.
- It's not clear that the approach can handle multiple object instances occluding one another. Their non-maximum-suppression phase seems to eliminate such cases.
This is Matthew Klingensmith btw (6cc9cd6c)
DeleteWell, they tried multiple number of parts (I'll present that). Plus bilateral symmetry is assumed because while training (and sometimes testing), they use both original and flipped image/object. Since all objects have a flipped variant, bilateral symmetry is obvious (though I don't support assuming it).
DeleteI'll briefly discuss NMS, but that's a problem with using NMS. You can get multiple objects occluding each other, but depends a lot on NMS threshold etc.
The fact that all objects have a flipped version would argue for adding a latent variable for chirality, rather than making all models bilaterally symmetric. This would allow models like "left-facing car" and "right-facing car."
DeleteI agree with your assessment of the negatives, Matthew. Adjacent or occluding objects of the same category are very difficult for the NMS hack to cope with.
DeletePersonally, I think the DPMs mark an important change in object
ReplyDeletedetection approaches. Although the idea of part-based-models is old
(dating back to at least Binford's generalized cylinders), this paper
showed for the first time a successful way to use them for object
detection on a challenging dataset. I see the following major
contributions of this paper
1. A more discriminative HOG feature. I especially like the PCA
analysis done in the paper. They make the feature
contrast-sensitive, tweak the normalization, and introduce 4
additional components (which capture texture information).
2. A clean theoretical framework for modeling/learning parts using
Latent-SVMs.
3. An easy solution to solving the non-convexity issues in the
optimization. I've read more complex workarounds, but this is the
one I prefer the most.
4. Their NIPS 2011 paper introduces object grammars (with occlusion
models) and shows a nice extension of this DPM framework.
I find this paper very educational especially because of the attention
to detail (e.g. explaining the hard negative mining steps, post
processing - NMS).
I like the DPMs for object detection mainly because of their
false-positive rate. The root filter and part filter interactions make
the DPM less susceptible to false-positives (as compared to other
methods using the same data for training).
I've a doubt about stochastic gradient descent. The authors begin my saying that they will solve the combined convex problem in both the latent variables of the negative examples and the model parameters. But then they first solve for the latent variables (for negative examples) by fixing the parameters and then solve for the parameters fixing everything else. I don't think that's a good way to solve jointly for both. Am I missing something?
ReplyDeleteYes, there are indeed more complicated ways of doing this.
DeleteThe Max-margin clustering (MMC, NIPS 2005) paper relaxes some constraints to get a SDP solution.
A recent CVPR 2013 paper does this in an iterative fashion (much akin to the DPM paper) - Large Scale video summarization using we-image priors.
In fact, there are tons of follow up papers on MMC which basically try very simple alternatives around the NP-hard convexity. They all seem to work better than the original SDP solution (both quality and speed).
I'm still reading the paper but I'm going to write these few comments here so as to get more feedback. People stop commenting close to the deadline :( .
ReplyDeleteAn object is made of parts and parts can be deformed. Similarly, an image is made of objects and their positions can be deformed. Ofcourse the latter does not follow a spring model and may be something non parametric needs to be used. The intuition is that we might be able to get away without any annotations (completely unsupervised or may be semisupervised) by building this extra layer on top of the DPM PAMI paper we read for tuesday. I discussed this with Dr. Ramanan but he was not interested. What do you think?
Well, if you model all the objects (and everything else in the image), it might be of slight help. Lana's paper that Abhinav mentions (Scene DPMs) and Pedro's paper the Xinlei mentions below are both in this direction. Especially, the future work of Pedro's paper (Reconfigurable Models for Scene Recognition) mentions something in that stead (without modelling explicit objects).
DeleteOne problem with just modelling explicit objects is the space of their configurations is just too huge. On a small scale, visual phrases and object-groups for scene classification (https://computing.ece.vt.edu/~parikh/Publications/LiParikhChen_CVPR_2012_groups_of_objects.pdf) have that flavor.
read Lana Lazebnik paper for Scene DPMs....I think the question is how strong are the parts in DPM and do they really work as claimed in the paper or the ideal case. Do you guys see any consistency in the parts? Do you see enough spring in the parts? Can you do parts only and no root model?
ReplyDeleteAnother paper for scene recognition:
DeleteS. Naderi Parizi, J. Oberlin, P. Felzenszwalb
Reconfigurable Models for Scene Recognition
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012
With rigid grids...
Yes, you can do parts only and no root model. All the pose-estimation work (starting from Yang and Ramanan) uses only parts and no root. The tree structure of these models start from the head. I'll touch over this difference in tomorrow's class.
DeleteIs there any work on detecting one part first and then inferring the root? I think this is more like what we human do when we finding a object in a big coarse image. We just find some significant parts and enlarge the sight of view (finding root model). At last we double check by looking at other parts.
DeleteIn this paper finding the root first is necessary, because root is searched for at a lower resolution than the parts. I guess they imply that it creates much less false positives for input to optimization.
DeleteBy the way, I remember, Martial mentioned at the CV class that the problem of the non-tree-like springs, is the complexity of optimization (correct me here!) So if you want to take care of this, you may want to go further and make deformable parts as well
Delete@Ruikun--the most famous paper I can recall that tries to do parts first is the "implicit shape model" of Leibe, Leonardis, and Schiele, and there's lots of follow-up work. That paper focuses on the very easy UIUC cars dataset, though; I don't think there's much evidence that this approach works well on newer datasets. Unless, of course, you count deep learning as an instance of a parts-first algorithm.
DeleteI think parts are important, and some parts are more important (in the sense of indicative) than others, sometimes even more important than the whole for recognition (because they are more rigid while the whole might be very deformable).
DeleteTouching on the part about finding the root first at a lower resolution. I was confused about the purposed of lamda. It seemed that lamda described how many tree levels were between the root and the level with twice the resolution of the root. However, I didn't understand what they did with the levels in between.
DeleteThis is slightly unrelated but I am also curious as to what their failure cases looked like. All the images they showed had great looking parts. They also talked a bit about how some of their failures on the PASCAL data set weren't "real" failure. I wonder if they ever detected parts in completely erroneous locations, or if their root model ever threw the whole thing off.
I see a intuitive relation between DPMs and deep networks.
ReplyDeleteIn deep networks we are fixing the local receptive field and computing a function of the features in that local receptive field. The features in this case can be thought of as object's part correlations and the part position and corresponding deformation cost is made constant. The idea is that, like a single template can capture some variability in the object pose by weighting different gradient components differently, fixing the part position and deformation cost can be compensated for to some extent by adjusting the filters and making them fuzzy.
If instead, we don't fix the local receptive field and let the parts go arbitrarily away from the root we get DPM. Also make the number of layers to be 2.
This conceptual difference between a fixed receptive field and a fixed deformation cost vs a moving field and a cost depending on part placement has interesting consequences.
1. DPM's need to move the parts and search - O(nk) while deepnets don't.
2. DPM's are probably more discriminative because of this freedom. But the deformation is parametric and the part has to work for all places and that need not be true. This means they are making their filters fuzzy too. I think it's not taking them much far.
Any thoughts?
Yes, there is an obvious connection between deep nets. You can imagine learning the low-level filters for parts and root, and also learn the parameters of latent relationships between them via deep learning. This way, you can learn more complicated relationships between parts+root, as opposed to just a star-model.
DeleteActually, I am wondering why the star model is called "star". To me it is more like a tree model (a root + several children as leaves). Does it necessarily form a star-like relationship? Or does the algorithm penalize the model to have weird way of placing all its parts? (for example, all parts can be aligned on the same line).
DeleteAlthough I did not see it specified anywhere (I could have missed it) the choice of a star or tree is very important. It means that you can't have loops, which means that you can't reason about an arbitrary network of parts. For example, there can't be relationships between head-torso, torso-arm, arm-head.
DeleteThe reason is probably because in a spring model such as this, you can create oscillations if you have a loop. The mass and spring model was a common early model for placing transistors on the surface of a microchip. If I recall correctly the solution was to add the equivalent of viscous damping and iterate until the system stabilized. Somehow I think the authors would rather keep things simple and fast. Still, I wonder how performance would improve if more complex relationships were allowed.
From their papers, deformable parts models seem to be a convincing approach for object detection. It has good intuitive reasoning - a whole is made by a sum of its parts (more on this later). I had gone to the RI seminar for this work and was also pretty impressed then. I think the coolest part is the lack of specifying any parts (I don't know if this is innovative in itself - this could be common in the literature). It's cool that their model learns the part location distributions, etc. from the data.
ReplyDeleteNow back on the topic of a whole being a sum of the parts. The question becomes (and we've had this discussion in class), how many parts is that exactly? For a few object categories, like cars and humans, there are a few parts that people agree highly upon (wheels/doors/hood & face/torso/legs/arms respectively), but it becomes hard to decide upon how many parts is necessary. In the authors' system, this was hardcoded to be six. My guess is that this should instead become a parameter to cross-validate over categories. Some category's detector may do better with more parts than others. This part seems the most hack-ish, and would be good future work to optimize.
I had the same idea. I noticed that when the authors defined the mixture components, they allowed for each component to have a different number of parts. It would be interesting to see mixture components with say 3, 6, and 9 parts so that the model can automatically determine the optimal number of parts.
DeleteSo long as the object is "oversegmented," it seems like the model learning should take care of the rest. There would only be big problems when 6 parts is not enough to describe the in-class variations.
DeleteThis paper really interests me a lot. Both the intuition and the math are interesting.
ReplyDeleteThe math in this paper is really clear. The part of analysis PCA is also interesting and clearly represented by the visualized figures.
From my perspective, the intuition of DPMs is more like human understanding images by watching the images. I think the way human understand image by sight is that we first memory the entire figure and then we try to extract essential parts from the figure.
However the "visual grammars" mentioned in the introduction part really interests me. Because I'm not working on object recognition, I'm not sure about the current research on this topic. However, I think this is the idea we talked about in the class that how to define machine learning - computer should understand the object and reconstruct it. I think maybe we can use NLP and knowledge base on website like "WordNet" to get the grammar or the structure of the objects. And we only need to define and model some basic "visual words". I think computer could real understand the objects by this way, because they could not only recognize object but also draw a picture of the object.
I think there will be some work like this. Please give me some links...
This paper describes a DPM model for detecting multiple object classes. The key idea for this approach is to manipulate a template to fit the image in question and develop a cost function that describes how much energy each manipulation costs.
ReplyDeleteI really liked the use of data mining to get "hard" negative examples because it reduces the space of negative examples by removing the ones that the algorithm would succeed at anyway. I think this lets the optimization focus on a more precise boundary between positive and negative examples.
I dont think the paper gives much intuition for why the parts appearances are computed two levels lower in the image pyramid than the root filter. Why not have the root filter at the same resolution as the parts?
Looking at their precision/recall curves, I think it is interesting that for the person category, it is more important to have parts, while for the car category, it is more important to model multiple view. Maybe this is because people have much more shape variation than cars, so having multiple mixture components is not as useful. Whereas for cars, which only have so many moving parts, shape variation is due to viewpoint, and therefore the mixture components are more important. Maybe it is indicitive of some dataset bias, where cars never have doors or trunks open, and so have very little deformation at a given viewpoint.
I think having the root filter at the same resolution as the parts was not required for the problem because the method aims to capture coarse edges with the root filter and finer details with the part filters. And for capturing coarse edges, high resolution features are not necessary. This is my understanding, correct me if I'm wrong.
DeleteMy main criticisms of this paper are that:
ReplyDelete1. It approaches the problem of multiple object detection in an image with a sliding window approach, using filtering and a context hack as an afterthought. This seems fundamentally wrong, as essentially the type-signature of their solution does not match the type-signature of the desired output (single detection uncoupled / poorly coupled with other detections VS. multiple detections all coupled together)
2. As far as I can tell, there is no notion of occlusion, and I couldn't find anything regarding how this approach performs under occlusion.
3. Restricting 'parts' to be in rectangles seems to be restricting the generalizing power of deformability to a few object classes / scales of part representation.
1. I agree with you on the "multiple object detection" seeming like a hack. Unfortunately, I have not seen any successful approaches that handle it otherwise.
Delete2. Their NIPS 2011 paper handles occlusion by using object grammars.
3. Similar restriction for the root filter? One can argue that the "rectangular bounding-box" framework in itself has flaws.
Occlusion is handled in a summed-up way, which is the part I do not like, I think some parts are more important than others, for example, seeing a cat's face we can probably recognize it is a cat, although we might not able to accurately locate it.
DeletePersonally, I think of this paper as a really important milestone for object detection since it builds a framework that is general and *robust* enough that could be adopted as building block for many many detection systems. Quote from Ross Girshick's thesis "The methods described in this dissertation account for two-thirds of the performance gain between 2007 and 2011."...
ReplyDeleteSeveral points I want to highlight:
1. Parts representation and tree structure.
Established a intuitive model structure which is also efficient to inference. Regarding to the importance of parts, there is a paper which I think somehow shed a light on this, http://web.mit.edu/vondrick/largetrain.pdf (Actually I'm really interested in this question, more data or better model?)
2. LSVM practicalities, including initialization, hard-negative mining and SGD.
3. Improved features from PCA analysis.
4. Context post-processing which somehow which aims at reconciling detections from different models.
Some issues call for further research;
1. How many parts and how many components to capture the appearance variations for different objects. There should be better way than just hand-coded numbers.
2. The contextual post processing seems not to hacking. However, there is a paper about incorporating NMS and context information into the training process but only small performance gain is observed about using structured prediction (http://www.ics.uci.edu/~dramanan/papers/nms.pdf). Smarter way of exploiting context information is also worth researching in the future.
Anyway, I appreciate this paper a lot both for technical reason and for its writing style.
I agree that how many parts / components is an important question. Also their representation of parts is very simple (essentially just another object detector with limited x-y domain), and such a part model probably does not capture the full range of deformations for many types of objects. I also agree that their incorporation of contextual information leaves a little to be desired and would probably be nicer if it were incorporated into the model directly instead of as an extra post processing step.
DeleteI personally enjoyed the paper a lot. There were many interesting math optimizations, which were pretty intuitive.
ReplyDelete1. Latent SVM - A detailed theoretical explanation of the model.
2. Applying PCA to reduce dimensionality - A nice analysis on using PCA for HoG features to cover the entire feature space using lesser dimensions. Given that the generative PCA model works for this problem, it might be interesting to try if the Fukunaga Koontz Transform (FKT) or Kernel-FKT (which generates a discriminative feature space) would make the classification any better.
3. Data mining hard examples - This was my favorite one - a very clever and neat way of removing negative samples that are classified properly.
I like the part based approach to object detection presented by this paper. Of course as mentioned by previous comments, this approach still doesn't capture all the information that could be relevant to object detection (ie. 3d pose, etc), but what it does do very well is allowing for some generalization of object pose while remaining computationally tractable. As it turns out just modeling the amount of translation each part can move around matters a lot (looking at the performance results).
ReplyDeleteWhat I didn't like about this paper was there somewhat (I think) ad-hoc approach to including context. In their results, they show that context is quite important (accounting for about ~2% performance gain over pure DPMs), but their only way of accounting for context is essentially a re-normalization of class detections, which seems to work for the PASCAL datasets - I guess this is why it's in the post processing section of the paper. I think scene context as well as information could help drive object detection directly instead of a post process.
I like the idea of using parts for object detection instead of rigid templates. It seems that this should produce more robust and generalizable algorithms. However, the authors themselves state that "simple models have historically outperformed sophisticated models in computer vision, speech recognition, ,machine translation and information retrieval.". They then go on to justify how they are trying to make their complicated model work - by "gradual enrichment" to "maintain performance.".
ReplyDeleteAre these complicated models really better? If the simpler ones work, can we make them work better? Is the reason were moving to more complicated models because the simpler ones have reached their performance limit, or because we philosophically like the more complicated ones better?