Thursday, October 3, 2013

Reading for 10/8

Ali Farhadi, Ian Endres, Derek Hoiem, David Forsyth, Describing Objects by their Attributes, CVPR 2009

And optionally:

Neeraj Kumar, Alex Berg, Peter Belhumer, Shree Nayar. Attribute and Simile Classifiers for Face Verification. ICCV 2009

Devi Parikh and K. Grauman, Relative Attributes, ICCV 2011.


  1. Describing Objects by their Attributes
    Ali Farhadi, Ian Endres, Derek Hoiem, David Forsyth

    This work focuses on learning semantic attributes to describe object categories, and studying the effectiveness of attributes for object recognition. Attribute-based description and recognition are not new concepts, but the presented work makes significant contributions in learning and generalizing their classifier.

    Using a new dataset annotated with semantic attributes, the system learns a set of attribute classifiers from a set of base features. Each classifier is learned using selected features such that the classifier generalizes across classes. The predicted attributes are then used to produce semantic models for object classes. Experiments study the accuracy of the attribute prediction, the learned semantic models, and the generalizability of the system.

    This work has two identifiable contributions. The first is the feature selection technique for learning the attribute classifiers. As there are many semantically-irrelevant class-feature correlations (i.e. metal wheels on cars, as described in the paper), the authors first select features that strongly correlate to the attribute within classes, and then learn an overall classifier using all of these selected in-class features. The authors also present statistics demonstrating that this technique significantly reduces incorrect prediction correlations when learning from a biased dataset for their specific "metal wheel" example. This generalization technique differs from existing approaches, such as dimensional reduction, by being much more heuristic in nature. Though it is intuitive, it would be nice to see a more rigorous explanation as to why features that discriminate within a class are important, and why a classifier learned over all such features generalizes to unseen classes.

    The second contribution is the expansive semantic category model. A similar concept has been shown before for outdoor scenes by Vogel and Schiele [24], but using a distribution of semantic occurrences and only texture/material attributes instead. The presented work uses three categories: shape to describe the form of objects, part to identify common components, and material to identify common materials and material properties. The authors demonstrate the descriptive power of their semantic lexicon by demonstrating object recognition from purely textual descriptions and by learning new classes with very few examples.

    The method presented is intuitive and demonstrates good performance in the attribute prediction task. The experiments present a broad evaluation of the approach. Stylistically, the paper is well-written and easy to follow.

    The paper is a bit sparse on technical details, but given the simplicity of the approach and length restrictions, this is mostly forgivable. It would also be nice to see more correlation statistics for feature selection across other classes and features.

    1. *Discussion Points
      1. Is the feature selection process sensible? What happens if we take classification to the extreme with exemplar approaches?

      2. If the attributes generalize to unseen classes, why does it help with the rejection task? Is this because of the semantic attributes or discriminative attributes?

    2. 1. The feature selection process seems intuitive as you said and also sensible.

      The authors are trying to study a category with an attribute and the same category without that attribute. This is in contrast to a standard approach where we are trying to predict the attribute across all categories that have it.

      The latter of these is tough because each of the categories will want their own set of features to be selected. A simple L1 regularized logistic regression is unable to do that well enough. The authors are simplifying its job by reducing the diversity in the train set.

      Another reason to do this is to avoid the metal instead of the wheel:-
      If you want to pick what's common across all the categories, then you'll have to pick metal because wheels all look different. But if you do it within categories then wheels look more similar than the metal does.

    3. Regarding the per class argument: I'm not fully convinced that per-class is necessary. Why not L1 regularize over all the classes? They don't compare to this or mention it, but it would make their claim more substantial.

      I don't think they look at the exemplar approach, but I wonder how well attribute transfer with exemplars would work. There would be some generalization problems, but as the datasets grow, that again becomes less and less of a problem.

  2. Object recognition often uses linguistic information to learn representations (mid level/high level). The language input has predominantly been a class label per bounding box. This paper pushes it one step further and uses language input for the mid level representation too.

    One can imagine an alternate formulation where mid level and high level representations are learnt in an unsupervised manner using visual information alone. These can now connect with language input through a cross modal linkage. A transfer of information across this linkage lets us understanding the visual input corresponding to textual description and vice versa. Such a model would perhaps not be biased by the limited information present in text (which partly forced the authors to use a tailor made feature selection method) and is still able to use text information through the linkage. I see this formulation as having all the advantages mentioned in the paper without constraining the mid level representation to be language based.

    I see the discriminative attributes as an escape from limitations of language based attributes ... why not do away with them all together and use visual mid level features to relate to language externally. This external process can connect in more sophisticated ways enabling a more expressive model to be used.

    1. Yes! Forcing the midlevel features to have names creates some problems: Figure 7 shows an airplane with a "beak" which is considered a false positive. But from the information present in the image, I'd say the airplane really does have a "beak." The concept that "airplanes don't have beaks" is not needed for classification, as you can distinguish airplanes from birds in other ways.

    2. I agree that a lot of object / scene recognition is tied to linguistic information, which I personally don't like. For the most part, I think a lot of these language representations are solely there to help humans understand what is going on and don't really add much to help with object detection and similar tasks (and I think we should avoid language binding until as late as possible - ie. when the result needs to be interpreted). I agree that it would be interesting to instead let the algorithm determine the mid and high level representations + connections. And what would be cool is if some of these representations actually corresponded to some attributes / categories (ie. grassy, sky, etc.).

    3. I partially agree with the argument against tying verbal attributes to visual data. One problem with hard-associations of verbal attributes with visual data is that attributes need not necessarily be directly linked to the local features of an object, but might depend on other things such as functionality or context. For instance, attributes that describe functionality can look vastly different between different classes (wings on an aeroplane and a bird, a bike saddle and a horse saddle, etc. - these are actual attributes used in this paper.) In this case, as Aravindh and Paul mention, it may be a good idea to:

      - avoid language binding until as late as possible, i.e., when the interpretation of objects has reached the level where their functions and capabilities can be described.
      - link to language attributes through some external processes

      However, I feel that this need not always be the right thing to do, because sometimes, even at the lower levels of visual recognition, it is possible to assign attributes that can help distinguish certain classes from others. In the airplane example given by Mike above, while it is true that airplanes don't have real beaks, detecting beak-like structures on an airplane can help distinguish it from some other objects easily. Say we have an attribute called "artificial", for instance, and it fires on both cars and airplanes. Then other attributes that indicate the presence of beak and wing-like structures can help distinguish airplanes from cars, which don't have these. A child might probably say that airplanes have beaks - it is just that we outgrow the inclination to describe things in a certain way and learn to disassociate "beaks" with "artificial" objects. I don't think it is easy to take either extreme points of view: completely disassociating language till the last stage of interpretation or tightly binding visual features and language attributes in very early stages.

  3. This paper takes a novel approach to the recognition problem by shifting the focus from identifying object classes to detecting the presence of visual properties or attributes.
    There are a plenty of reasons to like this paper:
    1. Moving towards recognizing inter-class "attributes" is overall a great idea. Being able to describe, using attributes, objects that haven't been seen before and the ability to learn from attributes directly without really seeing an example of what needs to be learnt are both significant achievements. Taking this idea to the extreme, I would really like to see a system that can learn from a small number of annotated images and lots of text data (for example, wikipedia pages for each object of interest) with weakly supervised methods.
    2. The idea of using semantic as well as discriminative attributes is good, although I'm not entirely convinced about the way they select discriminative attributes.
    3. On a lighter note: the paper manages to convey a really interesting and useful idea without any mathematical notation at all. I don't remember the last time I saw this happen in a computer vision paper.
    1. The dataset for this paper contains 64 semantic attributes that seem to have been selected manually. It would have been good to see a method with a sound basis for the selection of semantic attributes. Also, does treating all attributes (Shape, Part and Material attributes) equally make sense? Would a structured model for attributes (something like a hierarchy) be better?
    2. Since semantic attributes don't always distinguish classes (cat and dog for instance), the paper proposes creating discriminative attributes. These discriminative attributes are computed by randomly selecting pairs of classes (or pairs of class-groups) and testing to see if they are well separated in the feature space. There is no justification given for why it is good enough to do this selection in a random manner.

    1. This comment has been removed by the author.

    2. I agree that a structured model would be better. For example, if "legs" and "wheels" are easy to detect, then having the top level distinguish between "wheels" and "legs" first to distinguish between animals and vehicles. Then we can detect attributes like "snout" and "head" for those objects with "legs" and attributes like "doors" and "mirrors" for those objects with "wheels" to further disambiguate between similar categories.

  4. Why attributes are useful? I have two views about it:

    1. This is actually from Abhinav, because for attributes, we usually have a lot more data to train with. For example, we might have 1000 images for cars, and 1000 images for yellow objects, yet only 10 images for yellow cars. Since data is essential for most of the machine learning algorithms, we can train two separate classifiers for yellow and car independently and whenever they co-fires, we say that is a yellow car.

    2. Attributes often corresponds to modifiers (adjectives, adverbs, etc.) in a language, while traditional categories are usually nouns (scene, objects, etc.) or verbs (actions,etc.). The truth is, humans do not have a compact semantically meaningful word to describe every object in the world, and whenever the don't have a proper phrase, they often resort to attributes: modifiers that can discriminate this particular thing from the others. So attributes are in fact a very good at classification, they are born with that purpose. It is like a hash map that are within the language as it evolves.

    However, I am not convinced that attribute should correspond to specific words in a language, I like the idea of unsupervised attribute discovery a lot, I don't think the language has exhausted the discriminative bits in our visual world.

    1. I agree that the idea of unsupervised attributes is interesting and it might even be better for classification tasks. But for describing an object in a way that makes sense to humans, there is no guarantee that an attribute discovered through unsupervised means will make semantic sense. Atleast in this paper, I think the purpose is to understand objects and describe them as humans do, for which language is important.

    2. I really like this approach. Using attributes to describe objects
      moves away from a simple paradigm of putting objects in simple conceptual boxes and towards the Wittgensteinian notion of "Family Resemblances," where humans don't describe things in terms of categories per se but actually describe a set of objects with overlapping, non-universal characteristics (attributes). Consider the concept of "Game"; Basketball, Crysis, and Chess are all "games", but one would struggle to find a common feature underlying all of them.

    3. Interesting comment, Jacob. It appears that because it is so difficult to delineate exactly the boundary of categories (e.g. what are the support vectors for calling something a "game" and "not a game"), that perhaps humans approach the problem by formulating categories as sets of sets across attribute instances, and then categorical membership might look like a weighted set metric (I'm "soft", "small", and "have four legs", so I'm probably a "cat" or "dog", but you can say almost certainly I'm not a car)

      if so, reasoning about attributes appears generalizable to learning new categories as combinations of attributes, given that you can reason about categorical nonmembership with high accuracy. a semi-supervised version of this paper would be very cool

    4. I agree with Jacob that this approach is very interesting. I think that it can bring more tasks to solve to the area. I don't really see the point in the task of better recognizing a dozen of objects (besides writing your thesis). But here the authors actually invent problems that can be solved, like finding non-typical attributes in an instance of a category. I think this alone is worth noticing :)

    5. The benefit of attributes is that it's easy to generalize across categories, while the learning results by traditional ways are often confined to certain category. For each attribute, it partitions the whole feature space using rather weak classifier, which simplifies the scenario, and makes more data available. Along with other emerging techniques, it shows that traditional category-based way of approaching the problem of object recognition may be hard.

    6. @Divya

      But what is the purpose of describing objects in a way that humans do? Is it just to make the output of the algorithm more comprehensible to humans? I am much more in favor of Nick's thinking (bit below) - that if the output of the algorithm is useful, then it's good.

      I like the idea of taking inspiration from how humans categorize objects (attributes), but maybe the specifics should be tuned to be most usable by machines.

      My impression was the purpose of this paper was to make object recognition more generalizable across categories.

  5. Describing Objects by their Attributes
    Ali Farhadi, Ian Endres, Derek Hoiem, David Forsyth

    This paper proposed a attribute-based object recognition framework. By characterizing objects into semantic and discriminative attributes, a classifier is then built upon those attributes to name the object.

    Key points:
    1. Propose to use the attribute as an intermediate level representation for vision tasks.
    2. Across category generalization via within category prediction. Use feature selection to reduce variance during learning which enables learning with much less training data.
    3. Show some new visual functions provided by this mid-level representation.
    4. Provide insights about the dataset bias issue.

    For the above contributions, I give some comments point by point:
    I like the idea of constructing mid-level representation connecting low-level features like color, gradient to the category labels of visual data, since it could possibly provide a compact yet informative representation which reduces the amount of training data we need to learn vision models. Using semantics of images as features seems to be an intuitive thing to do. However, the authors didn't mention how did they learn to predict the attributes (whether through a linear or non-linear machine learning algorithm?). If they learn this through a linear way, it seems to be a little bit problematic to me (here I refer to their method without feature selection, aka "whole feature" version), since they are using linear SVM, building a linear classifier over another linear classifier does not make any sense. And this pretty much explains why they cannot get better performence (Table 1 "whole features"). The author proposes many new visual functions using this mid-level representation, even though, some of them seem to be not that useful (personal view...). For example, instead of saying "there is some probability of ocurring a face in a bus picture" using this method, why don't we just run both a face and a bus detector?

    Across category generalization via within category prediction is a nice idea since it helps to reduce the number of factors which correlate with the target attribute we want to learn, as shown with a statistical experiment. Feature selection via sparsity regularized regression is something people in ML field have always been doing (e.g. Lasso). Doing this here makes a lot sense since it greatly reduces the number of training data we need by discarding some confusion features. In fact, this can be regarded as a denoising process in learning the models.

    Another very nice point about this paper is that the authors discussed the dataset bias issue at the end of the paper which people unintentionally ignored usually.

    Generally, I like the basic idea presented by this paper, but concern about the results in the paper. It seems that what really makes sense is to do feature selection in this mid-level representation space. The authors did not prove the superiority of the representation themselves in terms of the performance in recognition task.

    1. I really agree with Fanyi about that using less but fine data set is a kind of denoising process. For me, I think using those fine data makes more sense even though the dataset size is small. Actually, when human starts to learn something, we generally learn from some picture or looking at one or two objects. Although we can say that we are capturing thousands of pictures from that, those pictures captured from one or two objects are too similar to each other which means that they are just used to generalize the geometry information. So the big data we human used may not be like the big data computer uses right now.

  6. The paper presents a very interesting and novel way to describe objects using semantic attributes. It is very intuitive because I'm guessing that is how people identify objects. But there is one point very confusing to me. The authors say they are learning a "wheel" classifier. So I'm assuming they are learning a classifier for each semantic attribute. If that is the case, it is not very clear from the paper how they handle cases where there is no (or very less) data without a semantic attribute (say, a person without a head?).

    1. I think unsupervised discovery of useful attributes is more important. The attributes should also come from data themselves, not necessarily sophisticatedly designed by semantic. See the following work by the authors: Mohammad Rastegari, Ali Farhadi, and David Forsyth. Attribute discovery via predictable discriminative binary codes, In ECCV 2012.

  7. This paper proposes an interesting idea of using semantic
    representations as features. They pose the problem as that of
    inferring attributes for test data, rather than "class label
    association". This attribute association is challenging also because
    the number of attributes is greater than number of classes (hence more

    Ideas I liked
    - The way of learning the discriminative attributes via random
    splits. In cases where you don't know what to do, random helps!

    Would have liked to see
    - The authors don't say why their feature selection performs poorly
    across datasets (Fig 4). I am certain this is a dataset bias issue,
    which weakens the authors' claim of "selection of features" actually
    learning the semantics as opposed to some correlation (the wheel
    - How does this approach scale with number of attributes (# of
    - How were the reliable attribute classifiers determined? This would
    make sense if it was done via cross-validation.
    - Feature vs. attribute correlation. A simple experiment showing that
    say texture features help these attributes. It would be a nice
    insight into the way features work themselves.

    1. Also, I would have liked to see a confusion matrix. It helps me analyze things better.

  8. This paper presented a novel task and interesting approach for solving it. Attributes are very intuitive for humans, and thus an algorithm that reasons about them in images seems appealing. However, I wish the authors would've substantiated their claim that attributes are 'essential' for object detection. Perhaps attributes are how humans perform object recognition, but also, perhaps not, and even if so, why do we want our systems to do so as well? If the final output from an algorithm is useful (either to humans or to another algorithm), then why bother?

    The main benefit I see to this approach is the ability to name unknown classes. I think attributes can be fairly well characterized by features, and thus their underlying nature "has " or "is " aren't relevant for the reasoning within the algorithm itself. It would be interesting to try a lifelong learning approach that introspects every so often to attempt to interatively cluster attributes as data is encountered, adding additional attribute categories as they correspond to new clusters. This would require either very robust features, such that features of truly novel "attributes" are sufficiently discriminable from previous attributes.

    I think their idea can be much improved upon, the marginal benefit coming from (in order):

    1. segmented ground truth labeling (not bounding boxes)
    2. much more data
    3 (or maybe, 0?). a process of information sharing and iterative prediction across attribute classifiers

  9. I think this paper definitely presents an interesting new way of representing objects other than directly using features. It partially answered my problem raised in last course about uneven importance of features in an object for classifications.

    But the point I want to address here is that: For object categorization, the concept of "attribute" doesn't matter. The important thing about this paper is that it is actually addressing the problem of "a set of different object classes sharing highly correlated features". These features on one hand confuses the discrimination between these classes but on the other hand may help to distinguish these object classes from other object classes. The paper clearly realized this problem and emphasized the problem of "correlation" in this paper for many times.

    Given this fact, the strategy of using a "flat" classification scheme based on single features clearly would fail under many circumstances. The biggest problem with this kind of strategy is that it overlooked the inherent hierarchy and the potentially unbalanced levels of appearance/feature differences required to distinguish objects. These factors are essentially the factors that are making object recognition tasks subtle. A good example would be A sheep actually looks very similar to a cow as they both share four legs, which corresponds to large vertical HOG responses at the bottom of the objects. It is relatively easy to distinguish them from non-animal categories such as buildings, but to further distinguish them more subtle features are needed.

    A single classification process clearly can not achieve good performance with respect to the above situation. A better strategy might be: 1. Accept the fact that certain object classes do share features. 2. Find discriminative features that generalizes well in one or several categories (e.g., separate 4-legged animals from buildings and natural scenes). 3. Find discriminative features that further separates these categories (e.g., distinguish cows and sheep, or cats and dogs based on more subtle features).

    1. I like this paper and recognize the usefulness of have the intermediate representation of an attribute, especially when it comes to finding novel examples or examples with poor discrimination. However, if you look at the list of attributes in the plots, they include things like metal, shiny, etc, but also things like exhaust, engine, torso. These are things to me, not so much attributes, and they seem more suitable for a compositional representation so when can reason about their location on the object. I agree with Zhiding that something is lost in the way all these "attributes" are lumped together in the classifier.

  10. Another interesting perspective of attributes is that they are in a sense similar to dictionary words created by supervised learning.

    From a machine learning view, these attributes are like sparse dictionary components that represent each object in a very compact way. They are also like principle components which reduces the feature space dimension which help to overcome "curse of dim".

    1. It is very much like a sparse dictionary. The feature selection process does some of that by using the sparsity regularizer.

    2. I agree, the problem with this dictionary (as shown in the experiments in the paper) is that they are hand-crafted and therefore not designed for maximizing the discrimination among different categories (especially for similar categories as in the example of sheep and cow).

  11. This paper classifies objects using an attribute feature set rather than the traditional image feature set. Using attributes helps the paper generalize to new categories with fewer visual examples. Attributes describe the shape, parts, or material properties of the object. The authors also include many discriminative attributes that help distinguish between similar classes.


    Seems like you can get an arbitrarily descriptive classification using this approach. I.e. has wings + has beak indicates a bird; has wings + has beak + is blue indicates a bluejay.

    They take a linguistic approach to classification, which is how humans describe their world, so it learns very much the same way a human would learn.

    Attempt to correct/generalize over dataset bias.


    I am not convinced by their argument for generalizing attributes across all categories. Human legs look completely different from bird legs and cat legs. Does it even make sense to group all of these together?

    The discriminitive attributes seem like the major driving force for this algorithm, especially since there are 1000 of these. Its difficult to really comment on this since they never describe how they enumerate the semantic attributes or how many of semantic attributes there are.

    1. My first thought was that discriminative attributes were going to be the major forces as well, but Table 1 argues against this. Actually it seems like a common trend in these papers: any one thing you do achieves X accuracy, anything more fancy that you add improves the accuracy by only a tiny fraction.

    2. I agree in that semantic attributes should have been discussed more thoroughly - and it would have been great (as Ishan pointed out) to identify common confusions. At the same time, I think the idea of discriminative attributes has more potential and should, ideally, reflect an increase in performance similar to one seen after feature selection in attribute training. Perhaps, the random split idea, or the size and quality of the training set is not sufficient for training discriminative attributes.

  12. The selection of features is a good idea, but it isn't explored why it was good to do it per class. To recap, they use an L1 - regularizer to select which features to use on a per attribute per class basis. This seems like an 'incorrect' way to do it. It "should" be equally good (and arguable more 'correct') to let that selection come from the learning by doing an L1 regularized selection across all classes per attribute. If the argument is that metallic vs nonmetallic wheels show up in the car data, then they should also show up when pooled and the L1 regularizer should deal with that by trying for sparsity in the whole set. Their 'whole features' comparison isn't completely telling the whole picture. It would be necessary to show L1 regularized selected features on [all classes] vs [per class + pooling] to prove that their novel feature selection scheme is better at generalization.

  13. I have a very whimsical mental experiment about the comparison between attributes and objects, whether attributes should lay as a mid-level representation for objects, or not. The idea is actually stolen from Daniel's stacked hierarchical labeling machine. In each layer we train a classifier for every attribute & object, and then see what would happen?

  14. I love the idea of sharing properties between objects using named or un-named attributes. This paper tries few most simple and intuitive experiments to demonstrate the idea. Even though there were some other works (like [10]) which were along the same lines, I think the simplicity of this paper made it so famous. Also for me, the idea of using attributes (because of their shared nature) for describing unseen objects is very inviting!

    How I see attributes (semantic and non-semantic, named or unnamed) is that they just give you multiple partitions of the same data. For each binary attribute, you can use images from all objects as either positive or negative, which makes them easy to learn. From that standpoint, it can be thought of as asking our recognition/classification system to predict multiple things (object class, scene, attribute etc.) and then combining these results together.. (yes, that reminds me of multi-task learning). Another paper on scene attributes ( shows this idea partitioning of data pictorially.

    I like most of the paper, so not listing everything, but here are my few concerns:
    1. I agree with this discussion above -- why did they do just within class category prediction (cars with wheels and no wheels)? If they had labels for all the images, they could have easily trained all wheels and no wheels together. Since (still) there are metallic stuff occurring on the negative side (like cars and buses without visible wheels), the classifier could have de-correlated the wheel and metallic features. I think this should have been included to justify single-class learning.
    2. The discussion/comments on L1 sparsity by Arun..
    3. One likely experiment they could have tried was to prune out false positives of any standard detector (like DPM) by matching lists of attributes. :)
    4. I would have liked to see 3 more analysis experiments (even at the cost of some of the current ones)
    4.1 Which attributes are most confused with others? (Ishan mentioned this)
    4.2 Which features are helpful for which attribute? This would give us insights in choosing/designing features.
    4.3 Which are the easiest and most difficult attributes to learn? By extension, which are the most and least reliable attributes? By further extension, which are the most and least helpful attributes?

    Minor comments on writing:
    - Too add to the \small{} in figure descriptions, the texts in graphs were too small to be easily read in a print-out.
    - Paper had lot of typos (but hey, even my papers have that!)
    - I didn't like the closing argument regarding INRIA and PASCAL and pedestrian and person detection. May be I'm just being pedantic or nit-picking, but INRIA dataset is just meant for pedestrians. Papers written for it only claim that they work on pedestrians. It's not a dataset with people in the wild. So yes it is clear that pedestrian datasets are special, they have people standing/walking in upright pose, as opposed to random dancing, jumping and playing person from images in the wild... alright, I'll stop this now..

    Overall, I love the idea of attributes and how the authors presented it. For me, it indeed is very exciting idea and very critical to computer vision (recognition).

  15. I like the idea behind attributes and learning the features which are useful behind each attribute. Compared to something very coarse like full scene categorization, which I think is very poorly defined as scene categories start to overlap one another, I think attributes provides a good way to essentially group things which are similar in an image (which probably has a fairly consistent feature representation).

    One experiment I thought was interesting was using attributes to learn from textual descriptions of images (since attributes naturally lends itself to transferring text labels to images), however I wish there was more discussion about what the attributes are / which features represent each attribute / and the similarity between attributes (does it affect recognition).

  16. I really like this paper even though there lacks a lot of technical details and math formulations and also some methods here are killing the result or make no sense as a lot of people says.

    In this paper, the idea that using semantic attributes as mid level features is really a good idea. The contributes of this paper is that the semantic attributes can generalize both within and across the categories and can not only recognize the known objects but also give a description of unknown objects which means that the computer is learning the object structures. The feature selection method in the paper is also interesting. Although it is sensible as M Aravindh said, the intuition is really good and make sense. It is really hard to select features for all categories which may somehow decrease the performance but selecting features within the categories and across some subset of categories really helps. However, the method to predict attributes is not that clear and may be not right in this paper, we could still have further study on that.

  17. I like the idea of attributes. For the last paper, I mentioned in the blog a concern that the authors' algorithm wouldn't be easily generalizable to solving object recognition as a whole since we would need to do so much hand labeling. I think this algorithm's ability to learn new objects makes it very promising. However, I really don't understand their experiment for "standard object recognition in new categories." It seems like they trained on classes in a-Pascal and tried to recognize classes in a-Yahoo, but there are no overlapping classes in the datasets, so I don't understand how they manage to recognize a centaur after only having trained on humans and horses. Unless they're simply naming attributes... Or maybe the point was that they could learn the new objects with much fewer training examples?

    The idea of using attributes for object recognition reminds me a bit of the idea of using object recognition to do scene recognition. I remember there being a somewhat involved discussion here about whether we should do scene recognition through objects or as a whole. My impression was that most people thought we should do it as a whole, since object recognition was not robust, and it would take a lot of computation time to try to recognize all the objects in a scene. Interestingly here, it seems like most people are in support of this attributes for object recognition methodology. Is it because we think attribute detection is more robust? Is it more computationally efficient than object recognition in a scene?