16-824: Learning-based Methods in Vision (F'13): Reading for 11/21

Tuesday, November 19, 2013

Reading for 11/21

You are required to read the following paper, but not summarize it or comment on the blog.

R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation . Tech Report, 2013.

And optionally:

Abhinav Gupta and Larry S. Davis, Beyond Nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers, In ECCV 2008

Berg et. al. Understanding and Predicting Importance in Images. CVPR 2012

Krishnamuruthy and Kollar. Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World. Transactions on ACL.

5 comments:

M AravindhNovember 20, 2013 at 8:49 PM
You can read http://arxiv.org/pdf/1310.1531v1.pdf to know the details of their implementation. It works with GPUs and they also provide the learnt features for download incase you want to use them for your projects.
ReplyDelete
Replies
UnknownNovember 21, 2013 at 8:03 AM
It was very excited to read this paper which again shows the power of deep learning. The paper can be regarded as a work modified from previous deep learning works with selective inputs from region proposals. As reported, the system is also scalable to large data.

I was a little bit disappointed not seeing some qualitative results of semantic segmentation. Also, CNN has always been like a magic box to me. Using CNN with large scale data might require some expertise and system-level experience. It's not as simple and straight forward as SVM. But I do think people should learn more about system works and play more with big data.
ReplyDelete
Replies
AnonymousNovember 21, 2013 at 9:14 AM
At first, I hated CNNs because they seemd like they were going even further in the direction of "dumb learning" in the sense that the results are incomprehensible, the knowledge learned from them can't be extended easily to other domains, etc. But after reading this paper, I have started to see how they really work and have more respect for them. It's easy to see science as a means of understanding the world through natural "laws." We want to be able to write down clear, conscise rules which constitute an understanding of the underlying system.

When we write down the equation for computing a HOG feature, (or in the case of older, geometric computer vision, writing down a set of rules for various intersections of lines and their relationship to 3D space) we imagine ourselves as Newton, discovering an underlying law of the universe which our machines can use to reason about the world. For that reason, we tend to praise "general" solutions which are clear, closed-form, and reducible to simple metaphors. We hold disdain for "ad-hoc" solutions which are blindly constructed for particular problems, often with reference to "Occams Razor" for cutting away "multitudes of entities."

Unfortunately, the real world is much more complicated than our metaphors and simplifications can handle. There are always exceptions to every rule. In the case of features and classifiers, there are some kinds of features which work well for some kinds of data -- but there will always be data that a particular feature can't deal with. When this happens, you can either throw up your hands and say "the theory needs to be developed further" and look for a new Newton, or you can solve the problem with an ad-hoc solution.

Neural Networks, in my opinion, represent a way of *generating* ad-hoc solutions to particular problems in an incredibly powerful, expressive way. What they're really encoding is a vast array of simple ad-hoc rules, and ad-hoc rules upon those rules. The qualatative results of this paper show the power of such rules. Features from simple textures up to specific kinds of "animal faces" and "blob detectors" are encoded in this network. Further, most of the rules are usless, arbitrary, and don't contribute to the result (the authors claim 90% of the parameters can be cut away without a significant performance loss).

Nobody would want to write a paper where they claim to have discovered a new great feature for detecting cars:

P(car) = 0.02 * P(red_blob) + 0.01 * P(wheel) + 0.001 * P(cat) + ...

Yet that's what ends up *actually working in practice!*

This result is in a way depressing -- does it mean there is no general solution to our vision problems? But its also powerful. If the paradim of CNNs holds true, we can ignore the messy ad-hoc stuff underneath, and write our "beautiful" general solutions on top of it. Our problem space changes from finding the "perfect feature" or the "perfect detector" to actually solving problems in practice.

-- Matt K
ReplyDelete
Replies

Add comment