Tuesday, September 17, 2013

The reading for Thursday 9/19 is:

S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. CVPR 2006.

And optionally:



J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN Database: Large-scale Scene Recognition from Abbey to Zoo. IEEE Conference on Computer Vision and Pattern Recognition (CVPR2010)

You will need to come in to class with printed paper summaries.  You will also need to post something to this blog.  Use the "comment" function to this post.  You can argue for or against something in the paper, or ask a question, or respond to another comment/question.

Here's Zhiding's review.

-Carl

50 comments:

  1. Summarizing Zhiding's point:
    1) Images vs. Objects
    This paper provides a good intuition and direction we need to look into: taking certain level of structural information into account helps. But the proposed way through which the structural information is organized is in this paper is clearly too naïve and rigid. The essence of the weakness lies in the fact that by nature scene information is not only embedded in image level structures, but is also embedded in object levels. This brings us to the philosophy of “Images verses Objects”. Some scenes are more globally structured and bias more towards image level features. Which ultimately is the most important one that defines a scene? In my personal opinion, I tend to choose objects over images. This is because image-level structural information sometimes is simply too difficult to generalize, while generalizing objects can be relatively easier. In addition, recognizing objects has the potential to be composed into image level structures.
    A more reasonable formulation for object-oriented or cluttered scenes might be performing spatial pyramid matching on object-level patches and match the images in a less rigid way.

    ReplyDelete
    Replies
    1. I can recall a paper from ECCV 2012 about this topic, although I only remember the title:

      Object-centric spatial pooling for image classification
      http://ai.stanford.edu/~olga/papers/eccv12-OCP.pdf

      Also a relevant paper in CVPR 2012 of receptive field learning:

      Beyond Spatial Pyramids: Receptive Field Learning for Pooled Image Features
      http://www.eecs.berkeley.edu/~jiayq/assets/pdf/cvpr12_pooling.pdf

      Delete
  2. 2) Image Partitioning
    The proposed way of partitioning an image clearly is not invariant to scale, translation and rotation. Recently there have been papers proposing “Spatial-Bag-of-Features” which encode geometric information and is invariant to scale, translation and rotation. They introduced two different ways for image partition. The first one is linear ordered bag-of-features, in which image is partitioned into straps along a line with an arbitrary angle. The second one is circular ordered bag-of-features, in which a center point is given and then the image is evenly divided into several sectors with the same radian. By enumerating different line angles (ranging from 0◦ to 360◦) and center locations, a family of linear and circular ordered bag-of-features can be obtained. See the paper “Spatial-bag-of-features” by Cao et al. in CVPR 2010 for more details.
    3) Dataset Biases
    From the paper we also know that the paper has a certain taste of datasets. While these datasets consists considerable amount of images at that time, nowadays we know they are in some sense biased. For example the fifteen scene categories typically consist of scenes with nice viewing angles and global structures, which clearly favor spatial pyramid matching. The bias may result from the relatively restricted locations (MIT), the (fixed) way they select images, as well as the (fixed) way a photographer takes photos.
    Caltech 101 shows the same problem too. The dataset seldom contains images with cluttered backgrounds common encountered in real life.

    ReplyDelete
    Replies
    1. 2) But on the other hand, cutting image into straps or circles is in a way fine-tuning of the general idea, and fine-tuning always results in more parameters to fix. And more parameters tend to make models less flexible. I think the strength of the original paper is the simplicity of the idea and of the implementation (especially, considering the recognition rate boost). Also the good thing is that only a part of the pipeline is changed - only producing the histogram from the visual words part is changed.

      3) This is true, and in fact, authors mention this. Another mentioned feature of Caltech-101 is that objects are centered and large, compared to the image. This also must have made life easier

      Delete
  3. Just a comment: I think Pyramid kernels in vision are basically trying to do n-grams. Since we do not have a ordering of "words", we cannot form n-grams in the usual way. So we form these n-grams based on some other locality measure.
    This recent CVPR 2013 paper by Kristen Graumann's group uses Spatial Pyramids for correspondences (http://people.csail.mit.edu/celiu/pdfs/CVPR13-DSPM.pdf)

    ReplyDelete
    Replies
    1. I think calling it n-gram might be a stretch. It is just simple frequency statistics. Then, may be having frequency stats over entire image is a bit too coarse, so you capture these stats at multiple coarse to fine partitions.

      This being said, depending on how they are being used, it might have more flavor of n-grams, e.g. the paper you cite, where they explicitly model relationships.

      Delete
    2. Isn't capturing frequency statistics at multiple partitions similar to n-grams?

      Delete
    3. isn't n-grams trying to capture co-occurence statistics of a set of n features together? At one partition or multiple shouldn't matter. Let's discuss the details offline!

      Delete
  4. I agree with Zhiding's point that the image partitioning method used in this paper is not the best way to go about the problem. The Spatial Bag of Features paper looks really interesting and it addresses the problem of scale,rotation and translation invariance.

    With respect to the philosophy of "Image versus Objects", I personally feel that going down to the object level is not required for the scene classification problem. Human brain can recognize scenes really fast even if objects in the scene change. Given that and the fact that segmentation/object detection for all objects in the scene can become computationally expensive, it is probably better to just consider the full image rather than break it down into objects and combine them to identify the scene.

    ReplyDelete
    Replies
    1. While I agree that it may not be necessary to detect all objects in a scene to do scene classification, I would lean much farther to the side of objects than you. If you see a couch in an image, you'll probably say it's a living room. If you see a bed, then you'd say bedroom. Computer monitor - office. Water - coast. House - suburb. Road and building - street. Just road - highway. My point is that, assuming we could detect these specific objects easily, then classification of certain scenes might be not too hard (I dare not say "easy").

      You say that humans can recognize scenes really fast even if the objects change. Assume we showed a bedroom picture to a human without a bed in it. Depending on the other objects in the image, the human may spend some time deciding whether the image was bedroom, office or living room. Unnecessary objects could change - bedside tables, dressers, etc. But I think the bed is pretty critical for classification.

      Delete
    2. I am not very confident of my references to existing neuroscience literature here, so please correct me if I am wrong.

      There is evidence that humans (or may be primates) use peripheral vision to decide scene semantics. This happens before object level details are decoded. Peripheral vision is too low resolution to decipher fine object categories but is perhaps detailed enough to get an overall sense of the scene.
      That said, I would add that humans do not necessarily succeed at this instantaneous/subconscious version of scene classification. In such difficult images (bedroom without bed) they would perhaps use object cues to logically put together a best possible answer.
      The later is slow and some may think of it as outside the domain of scene classification (may be a subject for the entire AI module of our system together) depending on the task that the scene classification sub-sub-system wants to achieve.

      Delete
    3. I have seen the point that Aravindh mentioned somewhere. I couldn't find the exact reference though. Humans do use peripheral vision for scene classification. One more point that I would like to add is, technically when you want to classify a scene using a couch/bed/water/building etc., these objects cover a major part of the scene (in general) that you are trying to classify. Hence I think it is a safe approximation to just consider the entire image rather than going down to the level of object detection.

      Delete
    4. Well, for me, this discussion has 2 components:

      1) Scenes vs. objects and 2) Human perception vs. Machine perception

      I frankly think what humans do might or might not be the right thing for do for machines. But in any case, saying that humans do it this way, so machines should also do it this way is not correct. Indeed, human perception can give insights and some potential new ways of dealing with machine perception, but in case of humans, we have other sensory input, we might have other things going in our heads which we can't observe yet. On the other side, it might be easier for a computer to do both object->scene and scene->object parsing in parallel, and then combine noisy results from both to give final output. Just a thought :)

      Delete
    5. I would differ with Divya with regards to not needing to go down to object level - I think the confidence of machine scene recognition must always be established so machines can make a yes/no decision whether to parse objects and establish object relationships to get a more confident answer.

      Delete
    6. I agree with Abhinav Shrivastava's point. Humans generally have other sensors like taste, touch. We also have other prior knowledge like we enter a living room should happen after entering a house, so it is more likely to be in a living room than conference room if we already know that we are in a house.

      For me, if we only look at a image, some how it is a better way to recognize some ‘characteristic object’ to infer the scenes. However, if we have more temporal information like a video captured by a rescue robot, we could use those prior knowledge other than objects.

      Delete
    7. I agree that it is a better way to have a 'characteristic object' to classify a scene based on that. But is it worth the computational cost? Is it actually required? How accurate is the object detector? These are questions that need to be addressed before we can go down to object detection for the scene classification problem. There are methods that perform reasonably well without doing object detection. And I personally feel that should be sufficient in general.

      Delete
  5. This id thing is pretty ridiculous. I'm not signing up to one of these services, so I will just say this: I am Matthew Klingensmith (mklingen).

    I'm wondering whether these "pyramids" are the right shape, per se. I agree that the multi-resolution approach is probably on the right track for representing the "gist" of an image, but why squares? Suppose we have features in the image which are best understood via long, thin lines (like their Ant example)? Then, these may not be captured in the multi-resolution square approach.

    ReplyDelete
    Replies
    1. I think the problem you mention extends to more than just the choice of squares vs other shapes. The parameters chosen for relative weights between these histograms and the decision to not make the squares not overlap etc. seem to be arbitrary suboptimal choices. I think that instead of using a rigid pyramid matching kernel the authors should have experimented with using a powerful kernel (like RBF) in the single layer L3 setting. Given enough data they would learn the right composition of these parts rather than bias the algorithm.

      Delete
    2. Frankly, I like the idea of starting with a simple square pyramid and showing that this simple idea works (at that time)! Specially given that fact that Pyramid Matching Kernels had the necessary theory to provide correct weights to combine these levels, it was a simple extension that worked.

      But going forward, I agree with both of you that exploring other partitions is a must. An excellent example is http://koen.me/research/colordescriptors/. They used not just squared, but also horizontal stripes, vertical stripes, different sized stripes etc. to make the pyramid. They even gave an option where you can provide different partitioning to choose from (combination of squares and rectangles). This kind of goes in the direction Matt mentioned. But still, they are pretty simplistic.
      Aravindh's idea of using powerful kernel also seems good, but I don't know if that will help or hurt. It can either help to capture regions which have consistent words/features, similar to what mid-level patches try to do, or it might loose the spatial robustness given by this pooling.. But it is great idea for someone to try!

      Delete
    3. Would creating partitions boundaries by clustering pixels together in terms of the color intensities, or the actual feature vectors (or even a subset of them) be a better idea than using rectangular or any other fixed partitions? More simply, dividing the whole image into 4 giant superpixels at the first level, and each of those into 4 more at the next level and so on.
      Let's assume one image has the sky taking up approximately half its height and the second image has only a quarter of the height being taken up by the sky because of a slightly different camera pitch. I would expect that if partitions were created by clustering similar looking pixels as mentioned above rather than rectangular regions, the partitions containing the sky in this case would have very similar feature distributions now (after normalizing, of course).

      Delete
  6. This comment is from Priya; apparently blogger isn't working for him...

    I would disagree slightly with the comment that the image partitioning method is not rotation and translation invariant because the algorithm considers the entire image in the first level, and does not go too many levels deep. A rotated/translated image will match well on the first level that considers the entire image. For subsequent levels, the similarity will decrease, but will still be high for the first few levels. Some of this invariance is evident by the fact that the algorithm performs well on the minaret class in the Caltech-101 dataset. Also, we could explain the diminishing returns of increasing L from 2 to 3 as an effect due to loss of rotation and translation invariance due to decreased weighting of the first level.

    ReplyDelete
    Replies
    1. The authors themselves note that the minaret results are artificially high due to the rotation artifacts (black corners). I do agree with you about rigid transformation invariance though, but only for small displacements. While even dramatic rotations will not affect the L0 histograms, those histograms are downweighted.

      Delete
    2. I agree with Priya regarding the synthetic example in the report. Yes, the finer levels won't match well, but the global level would match perfectly. And given the weights, this global match will go a long way!

      Delete
    3. Priya and humphrey seem to be correct in their assessment. Even at the coarser grid levels, it is only rotationally/translationally invariant up to some relative amount depending on the level (i.e. at the first level you can have larger rotations/translations than at the L=3 layer). The learning should take care of this and give larger importance at the lower levels than the higher ones if the dataset has lots of rotations/translations.

      Delete
  7. I am amazed that the authors are getting such good results with such a small dataset. I feel that the true power of these methods is hidden because they are using only 30/100 training images per class and may be [just a guess] having to use large regularization penalty to get a reasonable but not the best result. 34000 effective dimensions and just 100 training images per class is asking too much from the Support Vector Machines.

    ReplyDelete
    Replies
    1. I think a lot of the performance is coming from the fact that the datasets they use have a maximum of 15 scene categories - it's really easy to discriminate between these categories, especially since there is little overlap between each category. Things get a lot worse as # categories increases and it becomes harder to discriminate (it also may not even be clear to a human which category an image can belong to). Looking at the SUN results: http://people.csail.mit.edu/jxiao/SUN/benchmark.png shows the performance of spatial pyramid on 397 categories (the hog2x2 line), which is much lower.

      Delete
  8. This is a kind of simple question, but why not just add spatial information into the feature vector itself? They mention this as a small footnote at the bottom of the 3rd page (footnote one), but they never try it. I imagine it would help performance - but by how much in comparison to the pyramidal scheme?

    ReplyDelete
    Replies
    1. I doubt that if simply adding image coordinates into feature vector would work. Here the coordinates to me seems not really mean much, however, I do think that the relative distance between each patch would serve better for embedding structure information. In the paper we will read later, Felzenszwalb proposed a deformable model for capturing such property. But it seems that we can also put this as mid-level method and leave the feature to be as simple as it should be.

      Delete
    2. Along the lines of what Jackie said, adding XY doesn't make much sense here, because the way pooling is happening (capturing the frequency). In the limit, you can imaging making the level of pyramid so fine that each pixel is a partition. In that case, it would be similar to what u described, but would loose the in-variance.

      Delete
    3. Actually you can think of this pyramid thing as expanding another dimension from the histogram vector into a histogram matrix by factorizing the "sum" into different spatial cells.
      According to Zhiding's comment on "Images vs. Objects", actually I'm thinking about the real importance of incorporating this spatial pyramid. Let's say if we have perfect object detector of all objects we are interested in, do we still need this spatial pyramid or what we need is just spatial invariance?

      Delete
  9. It seems like the feature-space they're using for the images is pretty large - 4200 (for the small one). Furthermore, it seems like your standard, one histogram for a whole image, would be smaller. I wonder if some of their performance comes from the fact that they're throwing in a whole lot more features.

    ReplyDelete
    Replies
    1. More features need not necessarily contribute to increase in performance. I think the way they get more features is what matters. Since they compute features at varying spatial resolutions, they preserve more information and there is a better chance that the features represent the image better than one histogram for a whole image.

      Delete
    2. I agree. The results indicate that the quantity of features used correlates with performance. However, it would have been interesting to see whether a smaller set of features used over more than 3 levels can give comparable results, or whether its performance is also likely to plateau as it does for the 200+ strong features.

      Delete
    3. I think the way a longer concatenated feature works is that they consists of features complementing each other. The more one feature is capturing the lost information by another, the better the performance would be. Classifiers will do the final job to automatically figure out the decision boundary.

      But having a feature that is too long hurts. Probably due to "Curse of Dimensionality".

      Delete
  10. For me, as opposed to just using frequency at the whole image (unordered), the paper adds structure in terms of spatial partitioning and capture the frequency at all these separate locations. So in some sense, it is adding order to completely unordered statistics, and using both together for matching. Now, the final classifier can choose what it likes from both ordered and unordered stats.

    ReplyDelete
  11. (gsingh1 says) I like the overall thinking of the authors to perform feature matching at multiple partition hierarchies and appreciate that choosing fixed rectangular grids is computationally efficient. However, for accurate classification a superpixel based sub-region division in each image segmentation level makes sense intuitively. Also when creating the feature vector for superpixels, we could just add position information like median or average XY values of superpixels as features to encode geometric information directly as Paul points out.

    ReplyDelete
    Replies
    1. It's actually been quite rare in recent years to see a segmentation as an input to higher-level processing like classification. I think the issue is that segmentation--even into tiny segments like superpixels--is inherently unstable. That is, if you make a small, local change to the image, you have a good chance of making a drastic, global change to the whole set of segments. If you're hoping to match images based on the similarity of your descriptors, then this is a very bad property for the descriptors to have.

      Delete
    2. You make a good point Carl. But take a look at Daniel Munoz's method in http://www.ri.cmu.edu/pub_files/2010/9/munoz_eccv_10.pdf. Modeling label proportions over regions apparently helps reduce the effect of imperfect segmentation and he also shares label proportions across segmentation layers. This approach actually works very well for semantic scene segmentation.

      Delete
  12. Would this be a feasible approach for object detection? How much would the intra-class variation of, say cars, mandate the amount of data needed to perform reasonably well on object detection tasks? This seems like a very simple way to do object detection, and I wouldn't be surprised if someone has done it already (indeed, a search for "object detection spatial pyramid" yields promising results). The main thing I like about spatial pyramids is the 'bang for your buck' they provide. A very simplistic feature quantization and comparison yields decent results on a rather difficult task.

    ReplyDelete
    Replies
    1. Well, if you use something like oriented edges as your feature space, you are going to get something similar to HOG at different scales, with these pyramids, correct?

      Delete
    2. Though HOG is doing pooling over gradients on different scales, it is different from what is done here since it considers to pool gradients in nearby regions. There are some works doing spatial pyramid for detection, e.g. http://research.google.com/pubs/pub40665.html

      Delete
  13. This comment has been removed by the author.

    ReplyDelete
  14. The discussion of the minaret images is dumbfounding to me. These are not representative of any class of natural images. Why would the authors spend any time reporting results on them? Why do they remain in the dataset?

    [deleted previous comment since there seems to be no way to edit]

    ReplyDelete
    Replies
    1. I was curious, so I went and figured it out. Here's the answer from the original caltech-101 paper:

      -------------------
      Additionally, categories with a predominantly vertical structure were rotated to an arbitrary angle, as the model parts are ordered by their x-coordinate, so have trouble with vertical structures. One could also avoid rotating the image by choosing the y-coordinate as ordering reference.
      -------------------

      Basically the model relied on the x-ordering of some form of parts (specifically, it normalized the locations of all part detections with respect to the leftmost part detection, regardless of its vertical position), and this ordering becomes highly variable if there's lots of part detections one above the other. Hence, they hacked their dataset to make their algorithm work.

      Delete
  15. Taking a cue from HOG, would it not be better to do soft-voting for visual words across "grid partitions"? I'm sure someone must have done this.

    ReplyDelete
    Replies
    1. People do that a lot! They even do overlapping partitions..

      Delete
  16. I'm a little disturbed about the decrease in performance a few times when L increased. It would seem that given a large enough data set, then the learning algorithm should be able to down weight the finer-resolution (L=3) histogram features from high-order ones (for L = 0,1). That should be the power of learning.
    This seems to be either a result of 1) having a small dataset (as Aravindh, I believe, mentions above) and we are unable to see the noise that the larger L levels brings to the classification.
    Or, the problem stems from improper regularization. We see that for the weak features (M=16), increasing L improves the results, but as M gets large, the performance eventually drops from L=2 and L=3 for the scene classification and in the Caltech-101 set. It would have been interesting to see how high L can go for the M=16 case for either. This all seems to be symptomatic of regularization problems in the high-dimensional space (as M increases our feature vector size for learning goes up). Without enough data or regularization, performance will drop.

    ReplyDelete
  17. In their approach, the multi-scale pyramid is not considered, I think part of the reasons is that the features they used eg SIFT GIST already incorporate multi scale information. But the size of pyramid is chosen in discrete manner, so I think it will help when the pyramid in the descriptor can be designed such that they are interlaced with the pyramid in scene image.

    ReplyDelete
  18. This comment has been removed by the author.

    ReplyDelete
  19. "2) But on the other hand, cutting image into straps or circles is in a way fine-tuning of the general idea, and fine-tuning always results in more parameters to fix. And more parameters tend to make models less flexible. I think the strength of the original paper is the simplicity of the idea and of the implementation (especially, considering the recognition rate boost). Also the good thing is that only a part of the pipeline is changed - only producing the histogram from the visual words part is changed."

    I agree with this point. That's exactly the issue I have thought of when I think about adding more freedom.

    And I tend to think the levels of importance of translation, rotation and scale are different. When we talk about scene we are typically talking about a canonical photography setting where the generated images are some what biased: They are shot with directions pointing to horizons. So horizontal translations and scale changes might be factors that are a little bit more important than rotation

    And many thanks to Abhinav for helping to report some of my points :)

    ReplyDelete