16-824: Learning-based Methods in Vision (F'13): Reading for 10/8

Thursday, October 3, 2013

Reading for 10/8

Ali Farhadi, Ian Endres, Derek Hoiem, David Forsyth, Describing Objects by their Attributes, CVPR 2009

And optionally:

Neeraj Kumar, Alex Berg, Peter Belhumer, Shree Nayar. Attribute and Simile Classifiers for Face Verification. ICCV 2009

Devi Parikh and K. Grauman, Relative Attributes, ICCV 2011.

39 comments:

Humphrey HuOctober 7, 2013 at 9:17 AM
Describing Objects by their Attributes
Ali Farhadi, Ian Endres, Derek Hoiem, David Forsyth

This work focuses on learning semantic attributes to describe object categories, and studying the effectiveness of attributes for object recognition. Attribute-based description and recognition are not new concepts, but the presented work makes significant contributions in learning and generalizing their classifier.

*Summary
Using a new dataset annotated with semantic attributes, the system learns a set of attribute classifiers from a set of base features. Each classifier is learned using selected features such that the classifier generalizes across classes. The predicted attributes are then used to produce semantic models for object classes. Experiments study the accuracy of the attribute prediction, the learned semantic models, and the generalizability of the system.

*Contributions
This work has two identifiable contributions. The first is the feature selection technique for learning the attribute classifiers. As there are many semantically-irrelevant class-feature correlations (i.e. metal wheels on cars, as described in the paper), the authors first select features that strongly correlate to the attribute within classes, and then learn an overall classifier using all of these selected in-class features. The authors also present statistics demonstrating that this technique significantly reduces incorrect prediction correlations when learning from a biased dataset for their specific "metal wheel" example. This generalization technique differs from existing approaches, such as dimensional reduction, by being much more heuristic in nature. Though it is intuitive, it would be nice to see a more rigorous explanation as to why features that discriminate within a class are important, and why a classifier learned over all such features generalizes to unseen classes.

The second contribution is the expansive semantic category model. A similar concept has been shown before for outdoor scenes by Vogel and Schiele [24], but using a distribution of semantic occurrences and only texture/material attributes instead. The presented work uses three categories: shape to describe the form of objects, part to identify common components, and material to identify common materials and material properties. The authors demonstrate the descriptive power of their semantic lexicon by demonstrating object recognition from purely textual descriptions and by learning new classes with very few examples.

*Pros
The method presented is intuitive and demonstrates good performance in the attribute prediction task. The experiments present a broad evaluation of the approach. Stylistically, the paper is well-written and easy to follow.

*Cons
The paper is a bit sparse on technical details, but given the simplicity of the approach and length restrictions, this is mostly forgivable. It would also be nice to see more correlation statistics for feature selection across other classes and features.
ReplyDelete
Replies
M AravindhOctober 7, 2013 at 1:54 PM
Object recognition often uses linguistic information to learn representations (mid level/high level). The language input has predominantly been a class label per bounding box. This paper pushes it one step further and uses language input for the mid level representation too.

One can imagine an alternate formulation where mid level and high level representations are learnt in an unsupervised manner using visual information alone. These can now connect with language input through a cross modal linkage. A transfer of information across this linkage lets us understanding the visual input corresponding to textual description and vice versa. Such a model would perhaps not be biased by the limited information present in text (which partly forced the authors to use a tailor made feature selection method) and is still able to use text information through the linkage. I see this formulation as having all the advantages mentioned in the paper without constraining the mid level representation to be language based.

I see the discriminative attributes as an escape from limitations of language based attributes ... why not do away with them all together and use visual mid level features to relate to language externally. This external process can connect in more sophisticated ways enabling a more expressive model to be used.
ReplyDelete
Replies
Srivatsan VaradharajanOctober 7, 2013 at 2:05 PM
This paper takes a novel approach to the recognition problem by shifting the focus from identifying object classes to detecting the presence of visual properties or attributes.
There are a plenty of reasons to like this paper:
1. Moving towards recognizing inter-class "attributes" is overall a great idea. Being able to describe, using attributes, objects that haven't been seen before and the ability to learn from attributes directly without really seeing an example of what needs to be learnt are both significant achievements. Taking this idea to the extreme, I would really like to see a system that can learn from a small number of annotated images and lots of text data (for example, wikipedia pages for each object of interest) with weakly supervised methods.
2. The idea of using semantic as well as discriminative attributes is good, although I'm not entirely convinced about the way they select discriminative attributes.
3. On a lighter note: the paper manages to convey a really interesting and useful idea without any mathematical notation at all. I don't remember the last time I saw this happen in a computer vision paper.
Issues:
1. The dataset for this paper contains 64 semantic attributes that seem to have been selected manually. It would have been good to see a method with a sound basis for the selection of semantic attributes. Also, does treating all attributes (Shape, Part and Material attributes) equally make sense? Would a structured model for attributes (something like a hierarchy) be better?
2. Since semantic attributes don't always distinguish classes (cat and dog for instance), the paper proposes creating discriminative attributes. These discriminative attributes are computed by randomly selecting pairs of classes (or pairs of class-groups) and testing to see if they are well separated in the feature space. There is no justification given for why it is good enough to do this selection in a random manner.
ReplyDelete
Replies
UnknownOctober 7, 2013 at 2:41 PM
Why attributes are useful? I have two views about it:

1. This is actually from Abhinav, because for attributes, we usually have a lot more data to train with. For example, we might have 1000 images for cars, and 1000 images for yellow objects, yet only 10 images for yellow cars. Since data is essential for most of the machine learning algorithms, we can train two separate classifiers for yellow and car independently and whenever they co-fires, we say that is a yellow car.

2. Attributes often corresponds to modifiers (adjectives, adverbs, etc.) in a language, while traditional categories are usually nouns (scene, objects, etc.) or verbs (actions,etc.). The truth is, humans do not have a compact semantically meaningful word to describe every object in the world, and whenever the don't have a proper phrase, they often resort to attributes: modifiers that can discriminate this particular thing from the others. So attributes are in fact a very good at classification, they are born with that purpose. It is like a hash map that are within the language as it evolves.

However, I am not convinced that attribute should correspond to specific words in a language, I like the idea of unsupervised attribute discovery a lot, I don't think the language has exhausted the discriminative bits in our visual world.
ReplyDelete
Replies
UnknownOctober 7, 2013 at 2:58 PM
Describing Objects by their Attributes
Ali Farhadi, Ian Endres, Derek Hoiem, David Forsyth

Summary:
This paper proposed a attribute-based object recognition framework. By characterizing objects into semantic and discriminative attributes, a classifier is then built upon those attributes to name the object.

Key points:
1. Propose to use the attribute as an intermediate level representation for vision tasks.
2. Across category generalization via within category prediction. Use feature selection to reduce variance during learning which enables learning with much less training data.
3. Show some new visual functions provided by this mid-level representation.
4. Provide insights about the dataset bias issue.

For the above contributions, I give some comments point by point:
I like the idea of constructing mid-level representation connecting low-level features like color, gradient to the category labels of visual data, since it could possibly provide a compact yet informative representation which reduces the amount of training data we need to learn vision models. Using semantics of images as features seems to be an intuitive thing to do. However, the authors didn't mention how did they learn to predict the attributes (whether through a linear or non-linear machine learning algorithm?). If they learn this through a linear way, it seems to be a little bit problematic to me (here I refer to their method without feature selection, aka "whole feature" version), since they are using linear SVM, building a linear classifier over another linear classifier does not make any sense. And this pretty much explains why they cannot get better performence (Table 1 "whole features"). The author proposes many new visual functions using this mid-level representation, even though, some of them seem to be not that useful (personal view...). For example, instead of saying "there is some probability of ocurring a face in a bus picture" using this method, why don't we just run both a face and a bus detector?

Across category generalization via within category prediction is a nice idea since it helps to reduce the number of factors which correlate with the target attribute we want to learn, as shown with a statistical experiment. Feature selection via sparsity regularized regression is something people in ML field have always been doing (e.g. Lasso). Doing this here makes a lot sense since it greatly reduces the number of training data we need by discarding some confusion features. In fact, this can be regarded as a denoising process in learning the models.

Another very nice point about this paper is that the authors discussed the dataset bias issue at the end of the paper which people unintentionally ignored usually.

Generally, I like the basic idea presented by this paper, but concern about the results in the paper. It seems that what really makes sense is to do feature selection in this mid-level representation space. The authors did not prove the superiority of the representation themselves in terms of the performance in recognition task.
ReplyDelete
Replies
Divya HariharanOctober 7, 2013 at 6:13 PM
The paper presents a very interesting and novel way to describe objects using semantic attributes. It is very intuitive because I'm guessing that is how people identify objects. But there is one point very confusing to me. The authors say they are learning a "wheel" classifier. So I'm assuming they are learning a classifier for each semantic attribute. If that is the case, it is not very clear from the paper how they handle cases where there is no (or very less) data without a semantic attribute (say, a person without a head?).
ReplyDelete
Replies
IshanOctober 7, 2013 at 7:49 PM
This paper proposes an interesting idea of using semantic
representations as features. They pose the problem as that of
inferring attributes for test data, rather than "class label
association". This attribute association is challenging also because
the number of attributes is greater than number of classes (hence more
distractors).

Ideas I liked
- The way of learning the discriminative attributes via random
splits. In cases where you don't know what to do, random helps!

Would have liked to see
- The authors don't say why their feature selection performs poorly
across datasets (Fig 4). I am certain this is a dataset bias issue,
which weakens the authors' claim of "selection of features" actually
learning the semantics as opposed to some correlation (the wheel
example).
- How does this approach scale with number of attributes (# of
distractors)?
- How were the reliable attribute classifiers determined? This would
make sense if it was done via cross-validation.
- Feature vs. attribute correlation. A simple experiment showing that
say texture features help these attributes. It would be a nice
insight into the way features work themselves.
ReplyDelete
Replies
UnknownOctober 7, 2013 at 8:57 PM
This paper presented a novel task and interesting approach for solving it. Attributes are very intuitive for humans, and thus an algorithm that reasons about them in images seems appealing. However, I wish the authors would've substantiated their claim that attributes are 'essential' for object detection. Perhaps attributes are how humans perform object recognition, but also, perhaps not, and even if so, why do we want our systems to do so as well? If the final output from an algorithm is useful (either to humans or to another algorithm), then why bother?

The main benefit I see to this approach is the ability to name unknown classes. I think attributes can be fairly well characterized by features, and thus their underlying nature "has " or "is " aren't relevant for the reasoning within the algorithm itself. It would be interesting to try a lifelong learning approach that introspects every so often to attempt to interatively cluster attributes as data is encountered, adding additional attribute categories as they correspond to new clusters. This would require either very robust features, such that features of truly novel "attributes" are sufficiently discriminable from previous attributes.

I think their idea can be much improved upon, the marginal benefit coming from (in order):

1. segmented ground truth labeling (not bounding boxes)
2. much more data
3 (or maybe, 0?). a process of information sharing and iterative prediction across attribute classifiers
ReplyDelete
Replies
UnknownOctober 7, 2013 at 9:24 PM
I think this paper definitely presents an interesting new way of representing objects other than directly using features. It partially answered my problem raised in last course about uneven importance of features in an object for classifications.

But the point I want to address here is that: For object categorization, the concept of "attribute" doesn't matter. The important thing about this paper is that it is actually addressing the problem of "a set of different object classes sharing highly correlated features". These features on one hand confuses the discrimination between these classes but on the other hand may help to distinguish these object classes from other object classes. The paper clearly realized this problem and emphasized the problem of "correlation" in this paper for many times.

Given this fact, the strategy of using a "flat" classification scheme based on single features clearly would fail under many circumstances. The biggest problem with this kind of strategy is that it overlooked the inherent hierarchy and the potentially unbalanced levels of appearance/feature differences required to distinguish objects. These factors are essentially the factors that are making object recognition tasks subtle. A good example would be A sheep actually looks very similar to a cow as they both share four legs, which corresponds to large vertical HOG responses at the bottom of the objects. It is relatively easy to distinguish them from non-animal categories such as buildings, but to further distinguish them more subtle features are needed.

A single classification process clearly can not achieve good performance with respect to the above situation. A better strategy might be: 1. Accept the fact that certain object classes do share features. 2. Find discriminative features that generalizes well in one or several categories (e.g., separate 4-legged animals from buildings and natural scenes). 3. Find discriminative features that further separates these categories (e.g., distinguish cows and sheep, or cats and dogs based on more subtle features).
ReplyDelete
Replies
UnknownOctober 7, 2013 at 9:34 PM
Another interesting perspective of attributes is that they are in a sense similar to dictionary words created by supervised learning.

From a machine learning view, these attributes are like sparse dictionary components that represent each object in a very compact way. They are also like principle components which reduces the feature space dimension which help to overcome "curse of dim".
ReplyDelete
Replies
Priya DeoOctober 7, 2013 at 10:13 PM
This paper classifies objects using an attribute feature set rather than the traditional image feature set. Using attributes helps the paper generalize to new categories with fewer visual examples. Attributes describe the shape, parts, or material properties of the object. The authors also include many discriminative attributes that help distinguish between similar classes.

Pros:

Seems like you can get an arbitrarily descriptive classification using this approach. I.e. has wings + has beak indicates a bird; has wings + has beak + is blue indicates a bluejay.

They take a linguistic approach to classification, which is how humans describe their world, so it learns very much the same way a human would learn.

Attempt to correct/generalize over dataset bias.

Cons:

I am not convinced by their argument for generalizing attributes across all categories. Human legs look completely different from bird legs and cat legs. Does it even make sense to group all of these together?

The discriminitive attributes seem like the major driving force for this algorithm, especially since there are 1000 of these. Its difficult to really comment on this since they never describe how they enumerate the semantic attributes or how many of semantic attributes there are.
ReplyDelete
Replies
ArunOctober 7, 2013 at 10:56 PM
The selection of features is a good idea, but it isn't explored why it was good to do it per class. To recap, they use an L1 - regularizer to select which features to use on a per attribute per class basis. This seems like an 'incorrect' way to do it. It "should" be equally good (and arguable more 'correct') to let that selection come from the learning by doing an L1 regularized selection across all classes per attribute. If the argument is that metallic vs nonmetallic wheels show up in the car data, then they should also show up when pooled and the L1 regularizer should deal with that by trying for sparsity in the whole set. Their 'whole features' comparison isn't completely telling the whole picture. It would be necessary to show L1 regularized selected features on [all classes] vs [per class + pooling] to prove that their novel feature selection scheme is better at generalization.
ReplyDelete
Replies
UnknownOctober 8, 2013 at 12:05 AM
I have a very whimsical mental experiment about the comparison between attributes and objects, whether attributes should lay as a mid-level representation for objects, or not. The idea is actually stolen from Daniel's stacked hierarchical labeling machine. In each layer we train a classifier for every attribute & object, and then see what would happen?
ReplyDelete
Replies
Abhinav ShrivastavaOctober 8, 2013 at 3:02 AM
I love the idea of sharing properties between objects using named or un-named attributes. This paper tries few most simple and intuitive experiments to demonstrate the idea. Even though there were some other works (like [10]) which were along the same lines, I think the simplicity of this paper made it so famous. Also for me, the idea of using attributes (because of their shared nature) for describing unseen objects is very inviting!

How I see attributes (semantic and non-semantic, named or unnamed) is that they just give you multiple partitions of the same data. For each binary attribute, you can use images from all objects as either positive or negative, which makes them easy to learn. From that standpoint, it can be thought of as asking our recognition/classification system to predict multiple things (object class, scene, attribute etc.) and then combining these results together.. (yes, that reminds me of multi-task learning). Another paper on scene attributes (http://cs.brown.edu/~gen/sunattributes.html) shows this idea partitioning of data pictorially.

I like most of the paper, so not listing everything, but here are my few concerns:
1. I agree with this discussion above -- why did they do just within class category prediction (cars with wheels and no wheels)? If they had labels for all the images, they could have easily trained all wheels and no wheels together. Since (still) there are metallic stuff occurring on the negative side (like cars and buses without visible wheels), the classifier could have de-correlated the wheel and metallic features. I think this should have been included to justify single-class learning.
2. The discussion/comments on L1 sparsity by Arun..
3. One likely experiment they could have tried was to prune out false positives of any standard detector (like DPM) by matching lists of attributes. :)
4. I would have liked to see 3 more analysis experiments (even at the cost of some of the current ones)
4.1 Which attributes are most confused with others? (Ishan mentioned this)
4.2 Which features are helpful for which attribute? This would give us insights in choosing/designing features.
4.3 Which are the easiest and most difficult attributes to learn? By extension, which are the most and least reliable attributes? By further extension, which are the most and least helpful attributes?

Minor comments on writing:
- Too add to the \small{} in figure descriptions, the texts in graphs were too small to be easily read in a print-out.
- Paper had lot of typos (but hey, even my papers have that!)
- I didn't like the closing argument regarding INRIA and PASCAL and pedestrian and person detection. May be I'm just being pedantic or nit-picking, but INRIA dataset is just meant for pedestrians. Papers written for it only claim that they work on pedestrians. It's not a dataset with people in the wild. So yes it is clear that pedestrian datasets are special, they have people standing/walking in upright pose, as opposed to random dancing, jumping and playing person from images in the wild... alright, I'll stop this now..

Overall, I love the idea of attributes and how the authors presented it. For me, it indeed is very exciting idea and very critical to computer vision (recognition).
ReplyDelete
Replies
UnknownOctober 8, 2013 at 7:23 AM
I like the idea behind attributes and learning the features which are useful behind each attribute. Compared to something very coarse like full scene categorization, which I think is very poorly defined as scene categories start to overlap one another, I think attributes provides a good way to essentially group things which are similar in an image (which probably has a fairly consistent feature representation).

One experiment I thought was interesting was using attributes to learn from textual descriptions of images (since attributes naturally lends itself to transferring text labels to images), however I wish there was more discussion about what the attributes are / which features represent each attribute / and the similarity between attributes (does it affect recognition).
ReplyDelete
Replies
UnknownOctober 8, 2013 at 9:37 AM
I really like this paper even though there lacks a lot of technical details and math formulations and also some methods here are killing the result or make no sense as a lot of people says.

In this paper, the idea that using semantic attributes as mid level features is really a good idea. The contributes of this paper is that the semantic attributes can generalize both within and across the categories and can not only recognize the known objects but also give a description of unknown objects which means that the computer is learning the object structures. The feature selection method in the paper is also interesting. Although it is sensible as M Aravindh said, the intuition is really good and make sense. It is really hard to select features for all categories which may somehow decrease the performance but selecting features within the categories and across some subset of categories really helps. However, the method to predict attributes is not that clear and may be not right in this paper, we could still have further study on that.
ReplyDelete
Replies
UnknownOctober 8, 2013 at 10:18 AM
I like the idea of attributes. For the last paper, I mentioned in the blog a concern that the authors' algorithm wouldn't be easily generalizable to solving object recognition as a whole since we would need to do so much hand labeling. I think this algorithm's ability to learn new objects makes it very promising. However, I really don't understand their experiment for "standard object recognition in new categories." It seems like they trained on classes in a-Pascal and tried to recognize classes in a-Yahoo, but there are no overlapping classes in the datasets, so I don't understand how they manage to recognize a centaur after only having trained on humans and horses. Unless they're simply naming attributes... Or maybe the point was that they could learn the new objects with much fewer training examples?

The idea of using attributes for object recognition reminds me a bit of the idea of using object recognition to do scene recognition. I remember there being a somewhat involved discussion here about whether we should do scene recognition through objects or as a whole. My impression was that most people thought we should do it as a whole, since object recognition was not robust, and it would take a lot of computation time to try to recognize all the objects in a scene. Interestingly here, it seems like most people are in support of this attributes for object recognition methodology. Is it because we think attribute detection is more robust? Is it more computationally efficient than object recognition in a scene?
ReplyDelete
Replies

Add comment