16-824: Learning-based Methods in Vision (F'13)

Tuesday, September 17, 2013

The reading for Thursday 9/19 is:

S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. CVPR 2006.

And optionally:

Oliva, A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 2001.

J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN Database: Large-scale Scene Recognition from Abbey to Zoo. IEEE Conference on Computer Vision and Pattern Recognition (CVPR2010)

You will need to come in to class with printed paper summaries. You will also need to post something to this blog. Use the "comment" function to this post. You can argue for or against something in the paper, or ask a question, or respond to another comment/question.

Here's Zhiding's review.

-Carl

50 comments:

UnknownSeptember 18, 2013 at 7:06 AM
Summarizing Zhiding's point:
1) Images vs. Objects
This paper provides a good intuition and direction we need to look into: taking certain level of structural information into account helps. But the proposed way through which the structural information is organized is in this paper is clearly too naïve and rigid. The essence of the weakness lies in the fact that by nature scene information is not only embedded in image level structures, but is also embedded in object levels. This brings us to the philosophy of “Images verses Objects”. Some scenes are more globally structured and bias more towards image level features. Which ultimately is the most important one that defines a scene? In my personal opinion, I tend to choose objects over images. This is because image-level structural information sometimes is simply too difficult to generalize, while generalizing objects can be relatively easier. In addition, recognizing objects has the potential to be composed into image level structures.
A more reasonable formulation for object-oriented or cluttered scenes might be performing spatial pyramid matching on object-level patches and match the images in a less rigid way.
ReplyDelete
Replies
UnknownSeptember 18, 2013 at 7:07 AM
2) Image Partitioning
The proposed way of partitioning an image clearly is not invariant to scale, translation and rotation. Recently there have been papers proposing “Spatial-Bag-of-Features” which encode geometric information and is invariant to scale, translation and rotation. They introduced two different ways for image partition. The first one is linear ordered bag-of-features, in which image is partitioned into straps along a line with an arbitrary angle. The second one is circular ordered bag-of-features, in which a center point is given and then the image is evenly divided into several sectors with the same radian. By enumerating different line angles (ranging from 0◦ to 360◦) and center locations, a family of linear and circular ordered bag-of-features can be obtained. See the paper “Spatial-bag-of-features” by Cao et al. in CVPR 2010 for more details.
3) Dataset Biases
From the paper we also know that the paper has a certain taste of datasets. While these datasets consists considerable amount of images at that time, nowadays we know they are in some sense biased. For example the fifteen scene categories typically consist of scenes with nice viewing angles and global structures, which clearly favor spatial pyramid matching. The bias may result from the relatively restricted locations (MIT), the (fixed) way they select images, as well as the (fixed) way a photographer takes photos.
Caltech 101 shows the same problem too. The dataset seldom contains images with cluttered backgrounds common encountered in real life.
ReplyDelete
Replies
IshanSeptember 18, 2013 at 11:54 AM
Just a comment: I think Pyramid kernels in vision are basically trying to do n-grams. Since we do not have a ordering of "words", we cannot form n-grams in the usual way. So we form these n-grams based on some other locality measure.
This recent CVPR 2013 paper by Kristen Graumann's group uses Spatial Pyramids for correspondences (http://people.csail.mit.edu/celiu/pdfs/CVPR13-DSPM.pdf)
ReplyDelete
Replies
Divya HariharanSeptember 18, 2013 at 12:02 PM
I agree with Zhiding's point that the image partitioning method used in this paper is not the best way to go about the problem. The Spatial Bag of Features paper looks really interesting and it addresses the problem of scale,rotation and translation invariance.

With respect to the philosophy of "Image versus Objects", I personally feel that going down to the object level is not required for the scene classification problem. Human brain can recognize scenes really fast even if objects in the scene change. Given that and the fact that segmentation/object detection for all objects in the scene can become computationally expensive, it is probably better to just consider the full image rather than break it down into objects and combine them to identify the scene.
ReplyDelete
Replies
AnonymousSeptember 18, 2013 at 1:19 PM
This id thing is pretty ridiculous. I'm not signing up to one of these services, so I will just say this: I am Matthew Klingensmith (mklingen).

I'm wondering whether these "pyramids" are the right shape, per se. I agree that the multi-resolution approach is probably on the right track for representing the "gist" of an image, but why squares? Suppose we have features in the image which are best understood via long, thin lines (like their Ant example)? Then, these may not be captured in the multi-resolution square approach.
ReplyDelete
Replies
Carl DoerschSeptember 18, 2013 at 1:41 PM
This comment is from Priya; apparently blogger isn't working for him...

I would disagree slightly with the comment that the image partitioning method is not rotation and translation invariant because the algorithm considers the entire image in the first level, and does not go too many levels deep. A rotated/translated image will match well on the first level that considers the entire image. For subsequent levels, the similarity will decrease, but will still be high for the first few levels. Some of this invariance is evident by the fact that the algorithm performs well on the minaret class in the Caltech-101 dataset. Also, we could explain the diminishing returns of increasing L from 2 to 3 as an effect due to loss of rotation and translation invariance due to decreased weighting of the first level.
ReplyDelete
Replies
M AravindhSeptember 18, 2013 at 6:36 PM
I am amazed that the authors are getting such good results with such a small dataset. I feel that the true power of these methods is hidden because they are using only 30/100 training images per class and may be [just a guess] having to use large regularization penalty to get a reasonable but not the best result. 34000 effective dimensions and just 100 training images per class is asking too much from the Support Vector Machines.
ReplyDelete
Replies
UnknownSeptember 18, 2013 at 7:11 PM
This is a kind of simple question, but why not just add spatial information into the feature vector itself? They mention this as a small footnote at the bottom of the 3rd page (footnote one), but they never try it. I imagine it would help performance - but by how much in comparison to the pyramidal scheme?
ReplyDelete
Replies
UnknownSeptember 18, 2013 at 9:04 PM
It seems like the feature-space they're using for the images is pretty large - 4200 (for the small one). Furthermore, it seems like your standard, one histogram for a whole image, would be smaller. I wonder if some of their performance comes from the fact that they're throwing in a whole lot more features.
ReplyDelete
Replies
Abhinav ShrivastavaSeptember 18, 2013 at 10:18 PM
For me, as opposed to just using frequency at the whole image (unordered), the paper adds structure in terms of spatial partitioning and capture the frequency at all these separate locations. So in some sense, it is adding order to completely unordered statistics, and using both together for matching. Now, the final classifier can choose what it likes from both ordered and unordered stats.
ReplyDelete
Replies
GauravSeptember 19, 2013 at 4:49 AM
(gsingh1 says) I like the overall thinking of the authors to perform feature matching at multiple partition hierarchies and appreciate that choosing fixed rectangular grids is computationally efficient. However, for accurate classification a superpixel based sub-region division in each image segmentation level makes sense intuitively. Also when creating the feature vector for superpixels, we could just add position information like median or average XY values of superpixels as features to encode geometric information directly as Paul points out.
ReplyDelete
Replies
UnknownSeptember 19, 2013 at 6:21 AM
Would this be a feasible approach for object detection? How much would the intra-class variation of, say cars, mandate the amount of data needed to perform reasonably well on object detection tasks? This seems like a very simple way to do object detection, and I wouldn't be surprised if someone has done it already (indeed, a search for "object detection spatial pyramid" yields promising results). The main thing I like about spatial pyramids is the 'bang for your buck' they provide. A very simplistic feature quantization and comparison yields decent results on a rather difficult task.
ReplyDelete
Replies
Mike McCannSeptember 19, 2013 at 7:36 AM
This comment has been removed by the author.
ReplyDelete
Replies
Mike McCannSeptember 19, 2013 at 7:49 AM
The discussion of the minaret images is dumbfounding to me. These are not representative of any class of natural images. Why would the authors spend any time reporting results on them? Why do they remain in the dataset?

[deleted previous comment since there seems to be no way to edit]
ReplyDelete
Replies
IshanSeptember 19, 2013 at 8:29 AM
Taking a cue from HOG, would it not be better to do soft-voting for visual words across "grid partitions"? I'm sure someone must have done this.
ReplyDelete
Replies
ArunSeptember 19, 2013 at 9:50 AM
I'm a little disturbed about the decrease in performance a few times when L increased. It would seem that given a large enough data set, then the learning algorithm should be able to down weight the finer-resolution (L=3) histogram features from high-order ones (for L = 0,1). That should be the power of learning.
This seems to be either a result of 1) having a small dataset (as Aravindh, I believe, mentions above) and we are unable to see the noise that the larger L levels brings to the classification.
Or, the problem stems from improper regularization. We see that for the weak features (M=16), increasing L improves the results, but as M gets large, the performance eventually drops from L=2 and L=3 for the scene classification and in the Caltech-101 set. It would have been interesting to see how high L can go for the M=16 case for either. This all seems to be symptomatic of regularization problems in the high-dimensional space (as M increases our feature vector size for learning goes up). Without enough data or regularization, performance will drop.
ReplyDelete
Replies
UnknownSeptember 19, 2013 at 10:43 AM
In their approach, the multi-scale pyramid is not considered, I think part of the reasons is that the features they used eg SIFT GIST already incorporate multi scale information. But the size of pyramid is chosen in discrete manner, so I think it will help when the pyramid in the descriptor can be designed such that they are interlaced with the pyramid in scene image.
ReplyDelete
Replies
UnknownSeptember 19, 2013 at 9:43 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownSeptember 19, 2013 at 9:44 PM
"2) But on the other hand, cutting image into straps or circles is in a way fine-tuning of the general idea, and fine-tuning always results in more parameters to fix. And more parameters tend to make models less flexible. I think the strength of the original paper is the simplicity of the idea and of the implementation (especially, considering the recognition rate boost). Also the good thing is that only a part of the pipeline is changed - only producing the histogram from the visual words part is changed."

I agree with this point. That's exactly the issue I have thought of when I think about adding more freedom.

And I tend to think the levels of importance of translation, rotation and scale are different. When we talk about scene we are typically talking about a canonical photography setting where the generated images are some what biased: They are shot with directions pointing to horizons. So horizontal translations and scale changes might be factors that are a little bit more important than rotation

And many thanks to Abhinav for helping to report some of my points :)
ReplyDelete
Replies

Add comment