## Tuesday, October 8, 2013

J. Lim, R. Salakhutdinov and A. Torralba, Transfer Learning by Borrowing Examples for Multiclass Object Detection, NIPS, 2011.

and optionally:

K. Saenko, B. Kulis, M. Fritz and T. Darrell, Adapting Visual Category Models to New Domains, ECCV, 2010.

Ian Endres, Vivek Srikumar, Ming-wei Chang, and Derek Hoiem, Learning Shared Body Plans. CVPR, 2012.

1. The goal of this paper is to compensate for the lack of training data in some classes using data from other classes that have similar examples. They do this transfer on an example by example basis (reason about examples individually) but regularize at both the individual and class level. I think this group lasso regularization is helping avoid picking outlier examples that happened to occur on the right side of the decision boundary during the optimization. The paper extends this to include transformed examples and examples from other datasets. I'll post the pros and cons in separate posts to help start threads for more discussion.

2. Pros 1. They make minimal assumptions about the type of classifier or image features used. They only need that the approach be able to write itself into a empirical risk minimization form. The details need to be reworked as far as the transformations are concerned (they use the latent update of the DPM to pick the transformation parameters from a discrete set).

3. Pros 2. They are able to leverage multiple datasets in a way which beats naive concatenation of data. I think that this is really cool as it helps decentralize data collection. It gives further strength to their approach as it helps them argue that their method is of use even if people started collecting more data for all the categories. In other words, people should use their method when adding more data to a dataset rather than claim that adding data alone solves the problem.

1. I disagree that borrowing is the "right" way to aggregate datasets, but only because it's difficult to say what we really want from these detectors. The authors' experiment (Table 4) demonstrates that borrowing from another dataset improves testing performance on the original dataset. This is distinct from improving testing performance on both datasets, as it was mentioned multiple times that PASCAL and SUN09 have very different distributions.

This capability is definitely valuable, but I think it is a form of overfitting. I'm not saying that overfitting is bad, though. If you're training a detector for a tall robot, is there really a need for it to be able to recognize objects from low angles?

2. I think overfitting might look more like this in robots: the robot perfectly recognizes the 20 objects in the room from any angle, but is completely clueless if a very similar object is shown to it in a different room, or if we try to get it to learn a 21st object.
Coming back to the question of whether the algorithm is overfitting: if it is able to learn by borrowing examples from a dataset with a very different distribution and perform better on the original dataset, isn't that really an indication of its ability to generalize well?

3. I think they mentioned somewhere that they specifically want to keep the bias of a dataset, so the mechanism is not really like merging datasets, but more like extending some dataset with similar images of what it already has, that is, keeping the dataset peculiarities.

4. I agree with Srivatsan that the algorithm has the ability to generalize well and it doesn't look like it is overfitting. The authors are adding data in a very structured way that makes more intuitive sense than just throwing in hundreds of images into the dataset and hoping that it will improve the performance.

5. I believe an algorithm generalizes well when you can train it on a dataset and expect it to perform well on other datasets. This seems redundant considering we are using disjoint test and training sets, but the key point is that both sets have similar underlying distributions. The experiment shows that training on dataset A with borrowed examples from B gives a boost when evaluating on A. This doesn't mean that it will boost performance on B.

6. I think that the object detectors themselves are more general because they train on the borrowed images, which might not have the same peculiarities as the images in the original dataset. For the bookshelf example that the paper gives, it borrows from the shelf class which doesnt have the peculiarity of having books (and therefore a lot of vertical gradients). Because they are merging data from another dataset, I think the borrowed images are going to help generalize the object detector.

7. The authors mention "we also note that borrowing similar examples tends to introduce some confusions between related object categories," and that "setting corresponding regularization parameters $\lambda_1$ and $\lambda_2$ to high enough values ... would amount to borrowing all examples, which would result in learning a 'generic' object detector." Summing this up, it seems like they are just purposefully getting confused between categories, but they do a good job of mixing together the right categories. I guess this is a little like when adults tell kids, "Zebra, it's like a horse, but with stripes."

For some reason though, this borrowing method does not sit well with me. Maybe it's because I'm thinking back to the scene classification paper where they transformed all images to be as similar to the test image as possible, and then picked the 1NN. In that paper, their transformations were more complex than simple translation and affine, and many extra objects were hallucinated into the various scenes to get the images to match. My feeling is that this algorithm might start morphing images like crazy, and then the borrowed images would look like the original class, but slightly different. The "slightly different" part might end up hurting performance in the end.

However, my worries are probably unnecessary, since the authors prove that they have high performance. I wonder what would happen if they allowed the extra transformations that the other paper I mentioned above used. I also wonder how much boost in performance this method offers when even more data is added in. My feeling is that this method, in general, will only offer a small boost in performance. Here, they're adding ~2%, and that seems small to me.

4. Cons 1. They have cited past work which has tried to share data from entire classes (not on a per example basis but an entire class at once). They have not compared with these approaches in their experiments. I think this comparison is important because of the massive increase in the number of parameters when dealing with individual examples. They will terribly overfit to data without heavy regularization.

Looks like there is more to it. They run only 1 iteration of their alternate minimization strategy to prevent adding too many examples. They also initialize the optimization in favor of not sharing. Are these tricks that save them from the perils of such a large parameter space? Is per example the right way to go or should we reason about groups of examples at a time. Will the per example reasoning scale to large datasets with 100,000 categories.

1. I wonder why they limited their test to the top 100 classes of SUN09 by number of examples. This technique seems most useful in improving performance in the long tail of the dataset yet if you look at Figure 4b and 4c, they stop just as things are getting interesting. At what point does this plot get worse as the number of examples in a class continues to decrease? Is this a complexity issue, an overfitting issue on too few examples, what?

2. That also bugged me. Where are the results on the tail end? It's clear that there is still that classic 1/f curve there, and the less common classes still tend to have just a dozen or so examples. I can imagine they get nonsense at a certain point. I wonder where that point is.

-- Matt Klingensmith

5. Cons 2. This approach is adding entire examples into the train set for a given class. This is made worse by them thresholding W at the end. This will reduce the ability to discriminate between these sharing classes (atleast in the case where no transformation is involved). Doesn't this defeat the purpose of hard mining. They test their approach and the baseline on a randomly sampled subset of negatives. This partially hides the loss in discrimination power which might actually be significant.

The authors argue (without empirical results) that the confusion between similar looking classes is better than the existing confusion between totally unrelated objects. But we are solving one problem while aggravating another which we still don't know how to solve. Do we have to copy entire examples? Cant we solve both problems at the same time?

Figure 4(a) says that the problem is far from solved by example transfer alone. Is this the right way forward?

1. I would really have liked to see some statistics on the increase (and decrease) in confusion between classes. From the statistics collected on the pascal dataset (with dpm detectors) that we saw in class, it seems like confusion across visually similar looking classes is a more statistically significant problem than confusion between dissimilar classes. I think this point needed to be argued better than mentioned in passing. Without these statistics, I'm not convinced that borrowing is a good idea in general - rather than a short term hack we should use till we get bigger training sets for these problem classes.

2. I guess the approached was intended to be a short-term hack. Looks like they wanted a method to manage the few-training-images problem while doing their research and came up with this. And as it grew more mature, they added affine-transformations and some ideas of where else it could be used :)

3. I think the approach is well formulated (not hacky). A natural dataset will always have few training images for most of the object classes.

But the broader question of example sharing vs parameter sharing or something else exists beyond the details of this paper. Are there reasons to favor example sharing over other approaches.

4. I think the authors' view is valid, in that confusion between similar looking classes is definitely tolerable (usually). Though it might be statistically more significant, I would expect that it isn't functionally important to distinguish between classes that are visually very close. In any case object recognition doesn't have to be solved in one stage - a second layer of fine grained classification can always be used if it is really necessary to distinguish between a large van and a small bus.

5. However, on the point of sharing of entire instances, I agree with Aravindh that it would be better to share mid-level representations. Sharing whole instances of object appearances doesn't seem to be really addressing the root of the problem and I believe it is somewhere in the middle levels that things start looking similar/dissimilar. Ideally if feature representations could be modeled into hierarchial layers (I have no clue how), maybe we might expect to see things like 'attributes' pop off at different levels. In that scenario, similar looking objects can share representations at the lower levels, but might start to diverge as we move up the hierarchy of feature representations.

6. Confusion between similar classes seems to be a failing of semantic categories. It really is absurd to attempt to train a computer to differentiate chairs from couches. First, people can't do this consistently. Second, what would be the use of such a system?

7. I think there is value in differentiating similar looking objects. If I confuse a wolf for a dog then things can go terribly wrong. If I confuse a snake for a rope then life is going to be much worse. I'm not arguing in favour of differentiating every noun from everything other noun but some discrimination is required for a real system.

8. @Aravindh - Getting confused between large dogs and wolves or suspicious looking ropes and snakes is something that happens even to humans sometimes. Our perception usually resolves this by being cautious and assuming the worst (not that snakes and wolves can hurt robots though), which translates into associating attributes like 'dangerous' or 'poisnonous'. This isn't really a problem of discriminative learning - it can be partially solved by resolving context and partially by associating a higher bias towards objects that are associated with critical attributes such as the ones mentioned above.

9. Whether we want to discriminate between similar categories is really a question of task. The distinction between sofa and armchair might be really important to a funiture-selling robot, whereas the difference between car and truck might not be as important.

10. I have feeling that their proposed system could be more useful if it is applied to the attribute learning in the paper we read previously. Transferring entire object images from similar categories seems less convincing to me, but if somehow we can transfer similar attributes from other categories, the attribute classifier might perform better and overall help the object recognition. I think this is more similar to the way how human infer things by their experiences.

11. I conjecture that humans are able to differentiate similar looking objects through logic and context (explicit reasoning in the prefrontal cortex in place of neurons firing in the ventral stream). But as they see more and more examples they integrate this differentiation capacity into the ventral stream itself. I think that discrmination is important but there are more ways to do it. The paper is, I feel, is comprimising on discrimination more that required though.

6. Cons 3. The authors have ignored and therefore not compared with approaches that use parameter sharing as a way to provide statistical strength for rare classes. The parameter sharing approaches often learn a mid level representation that uses data from multiple classes - eg: fully connected two layer neural networks, deep neural networks, attribute based approach from Tuesday's class. This parameter sharing technique is apparently very different from the regularization based technique more commonly used in the papers cited by the authors. Both have the benefit of reducing the sample complexity - while this paper copies examples, the mid level features untangle the object manifold and make the feature space more separable leading to a reduced sample complexity.

1. Yes, but you can not use a transformed image of category A to train models for category B. Something similar to your idea is mentioned in this paper (http://pub.ist.ac.at/~chl/papers/tommasi-accv2012.pdf) where they try to learn a shared parameter space and a dataset specific parameter space; though this is about transferring example from datasets for bias and not for different categories.. But you can imagine doing something like that..

7. Comment 1. The authors fix w_i^c to be 1 for all examples i \in c. It will be interesting to see what the system does if these are allowed to change. The initialization can have these as 1 and the optimization could be allowed to change them. Can this discard bad training examples? or learn a fine grained concept which is a subset of the training data. Can this be used to sequentially learn mixture models?

1. Wouldn't this just give you a degenerate solution where all the positive examples have their weights set to 0? Similarly, if you let the background example weights get set by the optimization, the risk over an empty set of examples is exactly 0.

2. The group lasso constaint on w* is trying to force everything to share from everything. The risk term on the other hand is trying to not share from anything at all. If we set w to 0, then w* becomes 1 and the regularization penalty will become huge. That wont be the optimum after removing the non degeneracy constraints.

8. In this paper, the authors present a novel technique of *borrowing* examples from one class and using them in another class for the purpose of detection. To do this, they learn a set of weights for each class on each other class which determines how well features from one class transfer to another. They do this using a simple regression technique. Then, borrowed examples are transformed into the "canonical" 2D poses of the class in question, and a classifier is trained on these images instead.

Their results show much improvement over simpler detectors, especially for rare classes. They even show that they can borrow images from one dataset and use them in another.

I think the biggest strength of this paper is the idea of *scoring* individual examples of a class based on how similar they are to another. I think this leads to a deeper understanding of the underlying visual structure of the classes, and how classes relate to one another. The fact that they get more useful classification rates is just a happy byproduct of this knowledge.

I have some doubts about their transformations. Is aspect ratio + scale really good enough to represent useful transformations? Perhaps the technique could be extended to include the "label transformations" we looked at in class earlier.

-- Matt Klingensmith

9. The authors present their methodology for transfer learning between object classes. Its main difference from previous methods is that they learn both which classes and examples to borrow from. They formulate this learning process as learning the individual weights corresponding to "how much" to borrow an example.

The intuition and predications of their methods are that information from other classes can help inform knowledge about the classes intending to be predicted. This doesn't strike me as particularly novel, and is more compensating for the fact that object detectors take a very narrow view of the visual world. They show augmentation of performance of object detectors using information from other
classes.

It seems like this boils down to instead of training a "car" detector, they train e.g. a "hopefully car, maybe bus, maybe van" detector. They mix information from other classes, despite the fact that other classes are "names". I think a more principled approach of information sharing utilizes attributes or parts. Again, their approach is compensating for the fact that object detectors alone often don't quite make sense in the first place, as there is so much more inter-class reasoning and information sharing needed to perceive a scene with any modicum of success. The only compelling argument for sharing classes between object detectors (in a world with much larger CV datasets) is that from certain viewpoints, objects from different classes can look nearly identical. (side view of couch can look like a side view of a chair)

Their argument could've been supported if they included a matrix of weights across all classes. They write a verbal story that inevitably conforms to their own biased assumptions of what their method does, whereas a less biased full quantitative analysis would've been more convincing.

10. I tend to think the term "borrowing" here in this paper can actually be called as "semi-supervised learning". Regarding what I've discussed in the last paper about different classes sharing visual similarity and the unbalancing inter-object-class similarity levels, this paper is one example that emphasizes and makes use of this point.

The pros about this paper is that it presents an object detection method that is able to explore the power from other classes or even other datasets. In addition they also introduced certain deformation scheme for transformation to handle visual differences caused by viewing points in an organized way (parametric way).

But my biggest concern about this paper is it is likely to further aggravate the already existing confusion between similar object classes. This is exactly the point that makes this paper much less convincing, at least to me.

1. I agree with Zhiding's concern. This paper only gives average precision of the performance but doesn't shows the recall as well as the confusion matrix. I strongly doubt that the recall is really bad and if we compare with the confusion matrix, we might find that they make more confused with similar classes like sofa and chair. However, for most application in robots this confusion is not that bad, because they own the same function.

2. I second Zhiding. I think that this would destroy the distinction between animals such as cats and dogs. Also, how many similar objects really differ visually by something as simple as an affine transformation?

3. I think the main question to be answered here is what are we doing this for. Is it for just object detection in a functional sense, say for robots, in which case it might not be too important to discriminate between a sofa and a chair or is it to actually to understand the objects fully and name them? And as mentioned in previous comments, we can always have a second stage of classification for discrimination between similar classes. The proposed algorithm might be a good starting point for multi-class object detection by adding more data from similar classes. However, I do agree that the results section does not give all the details to trust this algorithm completely in terms of both performance and scalability.

4. I agree with Diva, why do we want to have a perfect detectors if the data we have actually do not suffice to handle fine-grained classification?? I believe the concern is important for some tasks, yet for other tasks like to get a surface to sit on, why do we bother to do that kind of disambiguation? When people are J-walking they only concern is whether a "moving object" is coming, who cares it is a Dodge Viper or Toyota Corolla?

5. I would assume the transformation helping a bit there.. But till we see confusion matrix between classes with and without transformation, it is hard to say..

11. I invite everyone to also take a look at their previous cvpr2011 paper "Learning to Share Visual Appearance for Multiclass Object Detection" (http://people.csail.mit.edu/torralba/publications/sharingCVPR2011.pdf). This is a somewhat related work where they introduced the idea of sharing across rigid classifier templates. More importantly, they learn a tree to organize hundreds of object categories. The tree structure defines how the sharing is carried out: the root node is global which is shared across all categories, the mid-level nodes are super-categories (animal, vehicle...) and the leaves are object categories. They also use a CRP (Chinese Restaurant Process) to learn a tree without having to specify the number of super-categories.

1. That is an example of sharing model parameters instead of sharing training examples across different classes. Actually there is another paper about sharing parameters by learning a discriminative basis over all model parameters with sparsity constraints:
http://www.cs.berkeley.edu/~rbg/papers/dsparselets.pdf

12. In this paper, the author present a state-of-art algorithm based on the novel idea - "borrowing" examples which is actually multi-class object detection. The author not only presents the learning with borrowing examples but also gives transformation borrowing example method. However, the transformation method is not clear to me which I think is really interesting.

The experiments looks like perfect good. But I think they should also show and compare the recall and confusion matrix, because it is necessarily to see how much will this algorithm confuse the similar classes. Also when they are comparing borrowing examples from other classes, it is better for them to compare with the algorithms with and without transformation as well.

13. This paper is quite interesting, in the sense that it tries to borrow examples from other categories to boost the performance. It actually confirms the idea of sharing between object categories: the ontology (list of categories we can recognize) is not flat, the classes out there are not independent. The classes can have a lot to share with each other. For example if you want to distinguish between cat and dog, between tiger and dog, and between cat and tiger, the weight it learned should be very different. And intuitively the weights learned from separating cats from dogs can be quite similar (or useful) for separating tigers from dogs. This is because animals have a tree-like taxonomy. The other interesting stuff is about the functionality: armchairs and sofa look similar because they want to serve a similar function, they utilized this fact, too. Neat.

My opinion about this paper is that, at the end of the day, is it actually training to detect object classes any longer, or it is detecting something else? Like the armchair vs. sofa thing, is't the detector detecting the physical property, "a flat surface people can sit"? Plus the sharing across categories look very familiar to me as "attributes". In this case, we do not have a very good name for the detector trained, but it is actually a good way to start an attribute discovery step given the object categories we already have. Maybe we can get a "furry" detector by starting with dog and borrowing examples from all the other animals with furs? How about starting with sheep and learn a "white and furry" detector?

1. This comment has been removed by the author.

2. I think the key difference between attributes and this method is that for identifying attributes, we need different sets of features (shape, color, texture, parts, etc.) whereas in this case, the focus is getting as many similar images as possible (with the same set of features). I agree that attributes (in some sense) try to generalize over multiple classes. But the attribute "red" need not be correspond to similar looking objects (like the example of car and wine that Abhinav gave). And the focus of this paper is to try to get more examples from classes whose "shapes" are similar (since they use HOG features).

3. Well if it comes down to features, then I believe everything can be added to the classification problem here... Attributes can be also defined as shapes, like 'heart-shaped', 'round-shaped', 'star-shaped'.

14. When we mine for hard examples, we're looking for things that are misclassified (e.g. a chair wrongly called "couch") and then placing more weight on these examples and training again. We thought mining for hard examples was part of what worked in the DPM paper. This paper does what seems to be the opposite: when we misclassify a chair as "couch," we relabel that chair as a positive example of "couch" and train again. Can these two strategies be used in tandem? If they cannot be combined, it seems to say something disturbing: do X, improve performance; do NOT X, improve performance.

1. I think the approach proposed in this paper is sort of dual to hard mining of negative examples.

15. Object recognition has come a long way with the development of sophisticated features, but the features still don't seem to be good enough. I think the philosophy of this paper is more inclined towards accepting that your feature space might not allow a clean separation of highly similar classes. Given that, the best thing to do would be to recognize that borrowing features from visually similar classes tends to help make better object detectors better, but in a principled way rather than just clubbing classes together. Mining for hard-negatives is useful only when negative examples occur very close to the (prospective) decision boundary, but can still be separated from the positive instances. It is really not the algorithm's fault that an example labeled 'chair' appears bang in the middle of a cluster of couches in the feature space. I believe that really smart algorithms are the ones which have some ability to reinterpret the human given labels in some way so as to improve performance on some counts.

16. This paper proposes an interesting method which aims at borrowing training examples from neighbour classes. Here are things I like about this paper:
1. I like the idea of borrowing training examples from other classes for multi-class object detection, as the authors point out, there are few examples for certain classes due to the long-tail distribution.
2. The formulation seems to be intuitive and captures the trade-off between sharing and discriminativeness.

My concerns are two-fold, one for the sharing method and one for the experiments:
For the method itself, even though the formulation proposed seems to be intuitive, why did the authors terminate the optimization procedure after one iteration, I would like to see the effect (performance) of the optimizing this criteria versus the number of iterations. Also, the step of post-pruning the sharing weights seems crucial to me, a comparison with the version without pruning would be interesting as well. Mentioned in the experiments section, the author said that they binarize all weights obtained from learning procedure without explaining any reasons, which makes me a little bit confused, why do not stick on the continuous weighting value? From the perspective of experiments, it would be great if there are some figures showing the confusion between the shared classes before and after the sharing, which I think would give us more insight about this sharing mechanism.

1. I agree with you on the binarization part. The entire formulation uses soft indicator variables, and then all of a sudden we have a strong binarization above threshold 0.6

I would like to see the examples especially from the truck,van,car category (+9% mAP). Is it an artifact of the dataset, or is their approach really that good?

17. I believe the method presented in this paper is an intuitive and interesting way to make the learning process more directed toward a specific task, where this task is represented by the original dataset.

An interesting implication of the results in this paper is that there is a tradeoff between generalizability and performance, perhaps more obviously stated as generalizability vs. specificity. If our problem was truly well-defined, why would using more data be less effective than using a selected set of data? There are probably some effects due to bad examples, but is that enough to create the strong trend shown in the paper?

18. This paper proposes an interesting idea of augmenting existing data from similar data, which could be very useful for helping to deal with the inherent dataset bias of almost all datasets out there. Unfortunately, as many other comments mentioned, they only evaluate on the top 100 most well represented categories in the SUN dataset, and while they show some improvement, it would be nice if they could also show some improvement on some of the poorly represented classes as well.

On a more philosophical note, I feel that this paper is fixing a problem that really is an artifact of poor representation of images (namely assigning discrete language based labels to images for categorization). This is of course useful since it's hard for humans to interpret images without this type of labeling (but it still feels like its fixing an artifact of discrete labeling).

19. I think this paper has a really intuitive way to go about the problem of object detection. The algorithm seems to perform well on the subset of 100 classes. But the scalability is definitely questionable.

As a lot of people have stated above, this paper does seem to contradict the concept of hard mining. But is that the goal of the method? To be able to identify every single object in the real world? If the goal is to overfit the real world data, then all that we have to do is take all possible images from the world, keep adding more as you keep seeing new objects and just use k-NN. But if the goal is to get some level understanding about the real world with the limited data that we have, I think this is a very good method to try to choose what category you want to learn, what examples you want to choose and how much weight you should give to these examples. Since we don't have strong-enough features to discriminate everything in the real world, it is okay to borrow examples and get one level of classification done and this paper has given us a step forward towards this approach.

1. There was a previous paper on scene classification, where training scenes were transformed to look like the test image (some objects were hallucinated and whatnot), and then the authors used 1NN. I wonder how this algorithm would perform for object detection - morphing the images (including the bounding box), picking 1NN, and then getting a new estimate of the bounding box in the current image by looking at how the bounding box was morphed with the training image.

Side note: I think it's cool that the authors get the algorithm to pick which classes to borrow from, and that the classes it ends up borrowing from are similar to what we would intuit.

20. I like the idea of transfer learning for example transfer. I see that
many people have concerns over transferring entire examples as opposed
to more "useful parts of the example" like mid-level
representation. While that certainly seems like a good and more
extendible idea, I think that for many simple cases, the entire
example transfer is much simpler and intuitive.

I particularly like Eqn(4). It shows that this approach tries to
tighten all the class parameters by regularizing on an averaged model.

Things I would like to see
- I had to flip the paper once more to make sure that this was
correct. There are no baselines!!
- I would have liked to see an iterative borrowing approach. The
authors could justify the choice of not using one, by showing that