16-824: Learning-based Methods in Vision (F'13): Reading for 10/10

Tuesday, October 8, 2013

Reading for 10/10

J. Lim, R. Salakhutdinov and A. Torralba, Transfer Learning by Borrowing Examples for Multiclass Object Detection, NIPS, 2011.

and optionally:

K. Saenko, B. Kulis, M. Fritz and T. Darrell, Adapting Visual Category Models to New Domains, ECCV, 2010.

 Ian Endres, Vivek Srikumar, Ming-wei Chang, and Derek Hoiem, Learning Shared Body Plans. CVPR, 2012.

55 comments:

M AravindhOctober 8, 2013 at 8:55 PM
The goal of this paper is to compensate for the lack of training data in some classes using data from other classes that have similar examples. They do this transfer on an example by example basis (reason about examples individually) but regularize at both the individual and class level. I think this group lasso regularization is helping avoid picking outlier examples that happened to occur on the right side of the decision boundary during the optimization. The paper extends this to include transformed examples and examples from other datasets. I'll post the pros and cons in separate posts to help start threads for more discussion.
ReplyDelete
Replies
M AravindhOctober 8, 2013 at 8:56 PM
Pros 1. They make minimal assumptions about the type of classifier or image features used. They only need that the approach be able to write itself into a empirical risk minimization form. The details need to be reworked as far as the transformations are concerned (they use the latent update of the DPM to pick the transformation parameters from a discrete set).
ReplyDelete
Replies
M AravindhOctober 8, 2013 at 8:57 PM
Pros 2. They are able to leverage multiple datasets in a way which beats naive concatenation of data. I think that this is really cool as it helps decentralize data collection. It gives further strength to their approach as it helps them argue that their method is of use even if people started collecting more data for all the categories. In other words, people should use their method when adding more data to a dataset rather than claim that adding data alone solves the problem.
ReplyDelete
Replies
M AravindhOctober 8, 2013 at 8:58 PM
Cons 1. They have cited past work which has tried to share data from entire classes (not on a per example basis but an entire class at once). They have not compared with these approaches in their experiments. I think this comparison is important because of the massive increase in the number of parameters when dealing with individual examples. They will terribly overfit to data without heavy regularization.

Looks like there is more to it. They run only 1 iteration of their alternate minimization strategy to prevent adding too many examples. They also initialize the optimization in favor of not sharing. Are these tricks that save them from the perils of such a large parameter space? Is per example the right way to go or should we reason about groups of examples at a time. Will the per example reasoning scale to large datasets with 100,000 categories.
ReplyDelete
Replies
M AravindhOctober 8, 2013 at 8:59 PM
Cons 2. This approach is adding entire examples into the train set for a given class. This is made worse by them thresholding W at the end. This will reduce the ability to discriminate between these sharing classes (atleast in the case where no transformation is involved). Doesn't this defeat the purpose of hard mining. They test their approach and the baseline on a randomly sampled subset of negatives. This partially hides the loss in discrimination power which might actually be significant.

The authors argue (without empirical results) that the confusion between similar looking classes is better than the existing confusion between totally unrelated objects. But we are solving one problem while aggravating another which we still don't know how to solve. Do we have to copy entire examples? Cant we solve both problems at the same time?

Figure 4(a) says that the problem is far from solved by example transfer alone. Is this the right way forward?
ReplyDelete
Replies
M AravindhOctober 8, 2013 at 9:00 PM
Cons 3. The authors have ignored and therefore not compared with approaches that use parameter sharing as a way to provide statistical strength for rare classes. The parameter sharing approaches often learn a mid level representation that uses data from multiple classes - eg: fully connected two layer neural networks, deep neural networks, attribute based approach from Tuesday's class. This parameter sharing technique is apparently very different from the regularization based technique more commonly used in the papers cited by the authors. Both have the benefit of reducing the sample complexity - while this paper copies examples, the mid level features untangle the object manifold and make the feature space more separable leading to a reduced sample complexity.
ReplyDelete
Replies
M AravindhOctober 8, 2013 at 9:00 PM
Comment 1. The authors fix w_i^c to be 1 for all examples i \in c. It will be interesting to see what the system does if these are allowed to change. The initialization can have these as 1 and the optimization could be allowed to change them. Can this discard bad training examples? or learn a fine grained concept which is a subset of the training data. Can this be used to sequentially learn mixture models?
ReplyDelete
Replies
AnonymousOctober 9, 2013 at 4:50 PM
In this paper, the authors present a novel technique of *borrowing* examples from one class and using them in another class for the purpose of detection. To do this, they learn a set of weights for each class on each other class which determines how well features from one class transfer to another. They do this using a simple regression technique. Then, borrowed examples are transformed into the "canonical" 2D poses of the class in question, and a classifier is trained on these images instead.

Their results show much improvement over simpler detectors, especially for rare classes. They even show that they can borrow images from one dataset and use them in another.

I think the biggest strength of this paper is the idea of *scoring* individual examples of a class based on how similar they are to another. I think this leads to a deeper understanding of the underlying visual structure of the classes, and how classes relate to one another. The fact that they get more useful classification rates is just a happy byproduct of this knowledge.

I have some doubts about their transformations. Is aspect ratio + scale really good enough to represent useful transformations? Perhaps the technique could be extended to include the "label transformations" we looked at in class earlier.

-- Matt Klingensmith
ReplyDelete
Replies
UnknownOctober 9, 2013 at 7:53 PM
The authors present their methodology for transfer learning between object classes. Its main difference from previous methods is that they learn both which classes and examples to borrow from. They formulate this learning process as learning the individual weights corresponding to "how much" to borrow an example.

The intuition and predications of their methods are that information from other classes can help inform knowledge about the classes intending to be predicted. This doesn't strike me as particularly novel, and is more compensating for the fact that object detectors take a very narrow view of the visual world. They show augmentation of performance of object detectors using information from other
classes.

It seems like this boils down to instead of training a "car" detector, they train e.g. a "hopefully car, maybe bus, maybe van" detector. They mix information from other classes, despite the fact that other classes are "names". I think a more principled approach of information sharing utilizes attributes or parts. Again, their approach is compensating for the fact that object detectors alone often don't quite make sense in the first place, as there is so much more inter-class reasoning and information sharing needed to perceive a scene with any modicum of success. The only compelling argument for sharing classes between object detectors (in a world with much larger CV datasets) is that from certain viewpoints, objects from different classes can look nearly identical. (side view of couch can look like a side view of a chair)

Their argument could've been supported if they included a matrix of weights across all classes. They write a verbal story that inevitably conforms to their own biased assumptions of what their method does, whereas a less biased full quantitative analysis would've been more convincing.
ReplyDelete
Replies
UnknownOctober 9, 2013 at 9:04 PM
I tend to think the term "borrowing" here in this paper can actually be called as "semi-supervised learning". Regarding what I've discussed in the last paper about different classes sharing visual similarity and the unbalancing inter-object-class similarity levels, this paper is one example that emphasizes and makes use of this point.

The pros about this paper is that it presents an object detection method that is able to explore the power from other classes or even other datasets. In addition they also introduced certain deformation scheme for transformation to handle visual differences caused by viewing points in an organized way (parametric way).

But my biggest concern about this paper is it is likely to further aggravate the already existing confusion between similar object classes. This is exactly the point that makes this paper much less convincing, at least to me.
ReplyDelete
Replies
UnknownOctober 9, 2013 at 10:32 PM
I invite everyone to also take a look at their previous cvpr2011 paper "Learning to Share Visual Appearance for Multiclass Object Detection" (http://people.csail.mit.edu/torralba/publications/sharingCVPR2011.pdf). This is a somewhat related work where they introduced the idea of sharing across rigid classifier templates. More importantly, they learn a tree to organize hundreds of object categories. The tree structure defines how the sharing is carried out: the root node is global which is shared across all categories, the mid-level nodes are super-categories (animal, vehicle...) and the leaves are object categories. They also use a CRP (Chinese Restaurant Process) to learn a tree without having to specify the number of super-categories.
ReplyDelete
Replies
UnknownOctober 10, 2013 at 12:07 AM
In this paper, the author present a state-of-art algorithm based on the novel idea - "borrowing" examples which is actually multi-class object detection. The author not only presents the learning with borrowing examples but also gives transformation borrowing example method. However, the transformation method is not clear to me which I think is really interesting.

The experiments looks like perfect good. But I think they should also show and compare the recall and confusion matrix, because it is necessarily to see how much will this algorithm confuse the similar classes. Also when they are comparing borrowing examples from other classes, it is better for them to compare with the algorithms with and without transformation as well.

ReplyDelete
Replies
UnknownOctober 10, 2013 at 12:17 AM
This paper is quite interesting, in the sense that it tries to borrow examples from other categories to boost the performance. It actually confirms the idea of sharing between object categories: the ontology (list of categories we can recognize) is not flat, the classes out there are not independent. The classes can have a lot to share with each other. For example if you want to distinguish between cat and dog, between tiger and dog, and between cat and tiger, the weight it learned should be very different. And intuitively the weights learned from separating cats from dogs can be quite similar (or useful) for separating tigers from dogs. This is because animals have a tree-like taxonomy. The other interesting stuff is about the functionality: armchairs and sofa look similar because they want to serve a similar function, they utilized this fact, too. Neat.

My opinion about this paper is that, at the end of the day, is it actually training to detect object classes any longer, or it is detecting something else? Like the armchair vs. sofa thing, is't the detector detecting the physical property, "a flat surface people can sit"? Plus the sharing across categories look very familiar to me as "attributes". In this case, we do not have a very good name for the detector trained, but it is actually a good way to start an attribute discovery step given the object categories we already have. Maybe we can get a "furry" detector by starting with dog and borrowing examples from all the other animals with furs? How about starting with sheep and learn a "white and furry" detector?
ReplyDelete
Replies
Mike McCannOctober 10, 2013 at 5:18 AM
When we mine for hard examples, we're looking for things that are misclassified (e.g. a chair wrongly called "couch") and then placing more weight on these examples and training again. We thought mining for hard examples was part of what worked in the DPM paper. This paper does what seems to be the opposite: when we misclassify a chair as "couch," we relabel that chair as a positive example of "couch" and train again. Can these two strategies be used in tandem? If they cannot be combined, it seems to say something disturbing: do X, improve performance; do NOT X, improve performance.
ReplyDelete
Replies
Srivatsan VaradharajanOctober 10, 2013 at 7:26 AM
Object recognition has come a long way with the development of sophisticated features, but the features still don't seem to be good enough. I think the philosophy of this paper is more inclined towards accepting that your feature space might not allow a clean separation of highly similar classes. Given that, the best thing to do would be to recognize that borrowing features from visually similar classes tends to help make better object detectors better, but in a principled way rather than just clubbing classes together. Mining for hard-negatives is useful only when negative examples occur very close to the (prospective) decision boundary, but can still be separated from the positive instances. It is really not the algorithm's fault that an example labeled 'chair' appears bang in the middle of a cluster of couches in the feature space. I believe that really smart algorithms are the ones which have some ability to reinterpret the human given labels in some way so as to improve performance on some counts.
ReplyDelete
Replies
UnknownOctober 10, 2013 at 7:55 AM
This paper proposes an interesting method which aims at borrowing training examples from neighbour classes. Here are things I like about this paper:
1. I like the idea of borrowing training examples from other classes for multi-class object detection, as the authors point out, there are few examples for certain classes due to the long-tail distribution.
2. The formulation seems to be intuitive and captures the trade-off between sharing and discriminativeness.

My concerns are two-fold, one for the sharing method and one for the experiments:
For the method itself, even though the formulation proposed seems to be intuitive, why did the authors terminate the optimization procedure after one iteration, I would like to see the effect (performance) of the optimizing this criteria versus the number of iterations. Also, the step of post-pruning the sharing weights seems crucial to me, a comparison with the version without pruning would be interesting as well. Mentioned in the experiments section, the author said that they binarize all weights obtained from learning procedure without explaining any reasons, which makes me a little bit confused, why do not stick on the continuous weighting value? From the perspective of experiments, it would be great if there are some figures showing the confusion between the shared classes before and after the sharing, which I think would give us more insight about this sharing mechanism.
ReplyDelete
Replies
Humphrey HuOctober 10, 2013 at 8:23 AM
I believe the method presented in this paper is an intuitive and interesting way to make the learning process more directed toward a specific task, where this task is represented by the original dataset.

An interesting implication of the results in this paper is that there is a tradeoff between generalizability and performance, perhaps more obviously stated as generalizability vs. specificity. If our problem was truly well-defined, why would using more data be less effective than using a selected set of data? There are probably some effects due to bad examples, but is that enough to create the strong trend shown in the paper?
ReplyDelete
Replies
UnknownOctober 10, 2013 at 8:54 AM
This paper proposes an interesting idea of augmenting existing data from similar data, which could be very useful for helping to deal with the inherent dataset bias of almost all datasets out there. Unfortunately, as many other comments mentioned, they only evaluate on the top 100 most well represented categories in the SUN dataset, and while they show some improvement, it would be nice if they could also show some improvement on some of the poorly represented classes as well.

On a more philosophical note, I feel that this paper is fixing a problem that really is an artifact of poor representation of images (namely assigning discrete language based labels to images for categorization). This is of course useful since it's hard for humans to interpret images without this type of labeling (but it still feels like its fixing an artifact of discrete labeling).
ReplyDelete
Replies
Divya HariharanOctober 10, 2013 at 9:06 AM
I think this paper has a really intuitive way to go about the problem of object detection. The algorithm seems to perform well on the subset of 100 classes. But the scalability is definitely questionable.

As a lot of people have stated above, this paper does seem to contradict the concept of hard mining. But is that the goal of the method? To be able to identify every single object in the real world? If the goal is to overfit the real world data, then all that we have to do is take all possible images from the world, keep adding more as you keep seeing new objects and just use k-NN. But if the goal is to get some level understanding about the real world with the limited data that we have, I think this is a very good method to try to choose what category you want to learn, what examples you want to choose and how much weight you should give to these examples. Since we don't have strong-enough features to discriminate everything in the real world, it is okay to borrow examples and get one level of classification done and this paper has given us a step forward towards this approach.
ReplyDelete
Replies
IshanOctober 10, 2013 at 9:40 AM
I like the idea of transfer learning for example transfer. I see that
many people have concerns over transferring entire examples as opposed
to more "useful parts of the example" like mid-level
representation. While that certainly seems like a good and more
extendible idea, I think that for many simple cases, the entire
example transfer is much simpler and intuitive.

I particularly like Eqn(4). It shows that this approach tries to
tighten all the class parameters by regularizing on an averaged model.

Things I would like to see
- I had to flip the paper once more to make sure that this was
correct. There are no baselines!!
- I would have liked to see an iterative borrowing approach. The
authors could justify the choice of not using one, by showing that
later iterations do not add significantly (instead of a footnote on
Pg 4)
- I think transfer learning works best for fine-grained categories
which are visually very similar, but "categorized" due to linguistic
representation. If this is correct, then the authors should have
focused on more fine-grained categories.
- I also wonder how this approach would scale to something like
ImageNet. Right now they deal with rigid objects. If we were to
extend this to say species of dogs, then we could see if we can have
an intra-species borrowing.
ReplyDelete
Replies

Add comment