16-824: Learning-based Methods in Vision (F'13): Reading for 11/19

Friday, November 15, 2013

Reading for 11/19

Krizhevsky, A., Sutskever, I. and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012

And optinoally:

Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, A.Y. Ng. Building high-level features using large scale unsupervised learning.. ICML, 2012.

We will also cover the following paper, but we cannot yet post links for download (ask us if you would like a copy):

C. Szegedy, A. Toshev, D. Erhan Deep Neural Networks for Object Detection, NIPS 2013 (in press)

37 comments:

UnknownNovember 17, 2013 at 8:41 PM
The paper presents a neural network that is trained to classify 1.2 million ImageNet images into 1000 classes. The presented algorithm performs extremely well ILSVRC-2012 competition by far passing the nearest competitor. In ILSVRC-2010 the method shows top-5 error 17.5% (vs 28.2% by the 2nd result). In ILSVRC-2012 - 15.3% (vs 26.2 by the 2nd result).
Besides describing the general architecture (of 5 convolution layers + 3 fully connected layers) the authors describe several “tricks” that enable them to enhance the performance. For example, in order to reduce overfitting they introducing multiple images, similar to every one provided. In particular, every image is translated, horizontally flipped, and its brightness and color histogram are changed.
Overall, the created neural network definitely makes a great contribution, reducing the recognition error almost on half. Moreover, as pointed out in the paper, some images may have multiple labels, so zero-error is not achievable by definition.
The things I liked about the paper:
- overfitting reduction techniques. Though not they are not new, I didn’t read about them before
- evaluation of contribution of every trick into the result
- class named “dead man’s fingers”, before I looked it up in Wikipedia (Fig.4, left, 2nd image in 2nd row)
The things to clarify:
- How was general architecture chosen? Yes, reducing the number of convolution layers affects the performance. What about adding layers? What is the rationale behind choosing the number of kernels for every layer? That is, if I want to build my own neural network how should I start?
- They say that for reducing overfitting the size of training set is multiplied by the factor of 2048 via random image translation. Why do they care about every factor of 2 in other sections.
- Failure cases are not really failure – they only illustrate “almost hits”
Besides, I don’t know if it is good or bad - I feel that this paper is moving away from computer vision. There are not many decisions and tricks specific to images. Many of the tricks concern about computational time or splitting work into 2 GPU units. In general, I would describe this paper as presenting building and tuning a neural network, with the example application in computer vision.
ReplyDelete
Replies
UnknownNovember 18, 2013 at 1:45 PM
Deep learning is a really hot topic these days. The most important message that it delivers is that "it works", although it is still yet to know "why it works". This paper talks about motivation (since current feature is manually designed but not learned from the data) quite clearly, but only mentions how this architecture is selected and why it is selected in a slippery manner. It seems like according to a paper from Berkeley, deep learning is also finding the discriminative regions like the what the mid-level patches are doing. That might be one intuitive explanation. On the other hand, I believe the patches discovered with Carl's paper are more explainable and discriminative, so can we also build fully connected network (possibly with max-polling) as an easy experiment to demonstrate its performance?
ReplyDelete
Replies
JackieNovember 18, 2013 at 3:28 PM
In this paper, the authors present a deep convolutional neural network (CNN) for large-scale image classification problem. They target to classify 1.2M images in the ImageNet LSVRC-2010 contest into 1000 classes. Specifically, they propose an eight layers CNN with five convolutional and three fully-connected layers. The proposed CNN is designed to perform on 2 GPU system where is neurons after the second layer are splitted and running independently on each GPU. To mimic human perception system, the proposed CNN incorporates local response normalization and overlapping pooling. In addition, the authors also proposed several tricks to reduce overfitting problem. They demonstrated significantly improvement over the testing dataset of LSVRC dataset.

Deep learning was getting lots attentions recently on various applications. The central focuses in this field is to build intelligent systems that can extract high-level representations from sensory data. Such system require multiple layers design and non-linear processing. I like this paper that it proposed a good architecture in which the authors tried to design processes that mimic human perception system.

Concerns:

1. Size of the input images. In the proposed method, they normalized the high-resolution images from the dataset into smaller one of 256x256, and if the image is not square, only the central 256x256 patch is used). This normalization undoubtedly discard some information contained in original images. How about using zero padding? or even adding a summarizing layer which dynamically fit the size of the input image and generate fixed size of summary for the following layers.

2. The proposed system worked directly on the RGB color space. Although they augmented the data by appending the result from the PCA on the set of RGB pixel values, it is known that human visual system doesn’t really working directly on RGB system. For example, under the low light condition, human vision only processes grayscale images and maintained similar performance as with color. It would be very interesting to see how this could be added into the system and how it would perform.
ReplyDelete
Replies
IshanNovember 18, 2013 at 8:44 PM
I like the idea of using deep nets for feature learning. I think these data-driven features are the way to intuitively discriminatively summarize images.

However, I am always concerned by the fact that no clear analysis of the parameters are presented. How do you choose the architecture? How do you choose the number of kernels at each layer? I feel deep nets have so many non-intuitive parameters - normalization constants, non-linearity hyperparameters, etc.
So is the way to "learn features" to move from hand-tuned features (some parameters of which we understand) to non-intuitive parameters?

Recent papers, "Segmentation Driven Object Detection" show a shift in object detection to using a plethora of features on "proposal" windows. Is this the way to go?
ReplyDelete
Replies
Humphrey HuNovember 18, 2013 at 9:14 PM
Many folks seem to be concerned about the mystical nature of neural net engineering. Indeed, it does seem like a dark art, but given that the net takes 5-6 days to train, I think this is forgivable. The authors do mention that removing layers decreases their performance, but without some way of making this faster (perhaps with specialized hardware), they can't be expected to generate enough data to do any meaningful analysis. Another way to look at it is, despite our limited understanding of neural nets, it is still possible to produce results that rival better understood methods, suggesting that this is worth further investigation.
ReplyDelete
Replies
UnknownNovember 18, 2013 at 9:39 PM
In this paper, the author reported their recent work about large scale object recognition on Image-Net, using convolutional neural network.

In a sense, this paper is mostly focused on system and architecture level. Although generally people still have a lot of questions about the underlying reasons why deep learning work so well, we could still find some evidence.

I believe one of the reasons is the way raw features are regularized and max-pooled layer by layer. It is essentially a joint discriminative feature learning and dimensionality reduction process. This makes the learned features highly effective.

On top of this, I saw an interesting point of view from Kai Yu's Weibo. During his discussion with Geoff, Geoff mentioned that what finally will be left by Deep Neural Net is the idea of layered feature construction. And Kai generalized this information as: The specific form of Deep Learning is not important, what ultimately matters is the idea of deep models with layered structures. This will be the long lasting value of deep neural nets.
ReplyDelete
Replies
ArunNovember 18, 2013 at 9:45 PM
The paper presented good results (state of the art), but did not seem to contribute much theoretically regarding neural networks. However, the paper plays an important role in exhibiting the power of modern computation in regards to neural networks. The authors successfully create a system using two ~$500 GPUs (not too expensive).
The authors also detail how their engineering decisions improved test time results, which gives us a little bit of insight into what helped. Some comments on these:
- Anyone who has naively implemented an neural net has likely been bitten by the saturation of output=tanh(x) ( I think 5 hrs debugging was my personal record on this ). The ReLU (just a hinge function / rectifier ) does a good job on this
- The dropout technique is cool. I think I recently heard about it and it's nice to hear that it works well.

Mostly, this was a systems paper that showed good results, but it would have been awesome to see them try something crazy on their system that would really teach us something new about neural nets (whether you hate or love them)!
ReplyDelete
Replies
Divya HariharanNovember 18, 2013 at 10:14 PM
The paper showed nice and interesting results of deep nets in object recognition. Personally, I really do not get the intuition behind how deep nets work. It is certainly disturbing that non-intuitive methods for feature learning perform better than the ones that make logical sense (to the human brain). We do not know what the deep nets are actually learning. Since it seems to perform well on most of the datasets that we currently have, a lot of people are interested in it. But without having atleast a vague idea of why it is behaving the way it does, I'm not too enthusiastic about using it. And the paper certainly does not address the intuition behind deep nets.
ReplyDelete
Replies
UnknownNovember 18, 2013 at 11:22 PM
I thought this paper was nicely written and explained the general concepts quite well. Like everyone else, I have zero intuition for what's going on in the neural net itself, and how the authors decided on the structure, parameters, etc. It seemed to me that there were quite a bit of "engineering" (hacking?) involved to make it work - data augmentation, dropout, local response normalization, overlapping pooling, number of convolution layers and fully connected layers, number of nodes...

I'm also unclear about what a "convolutional" layer is - I assume this will be explained tomorrow.

If performance decreases when removing layers, does that mean it would increase when adding layers? Can we try even bigger neural nets? Is there something so terrible about a neural net taking one month to train if it works really well? Do we need more data? Can it handle weakly labeled web images?
ReplyDelete
Replies
Priya DeoNovember 18, 2013 at 11:59 PM
I was really unhappy about the fact that they took the center crop of the image. Websites like Facebook have worked to make slightly more intelligent cropping algorithms without being computationally expensive. And even if center-cropping was the only reasonable option, the authors later go on to use 224x244 patches sampled from the cropped center image. I'm not entirely sure why they didnt just use 256x256 patches sampled from the original image, so they did not lose out on this information completely.

I'm not sure if I just missed this, but how did they figure out which kernels go on which GPU? Seems like magic.

Also, a minor nitpick, but I wish the authors would quantify how much overfitting was occurring. They frequently mention that such and such change (e.g. Dropout and pooling) reduce overfitting, but never quantify how much. It would be nice, especially for something like Dropout, if you wanted to change the tradeoff between overfitting and training time.
ReplyDelete
Replies
Priya DeoNovember 19, 2013 at 12:02 AM
This comment has been removed by the author.
ReplyDelete
Replies
Srivatsan VaradharajanNovember 19, 2013 at 5:32 AM
The deep network this paper talks about seems to have been designed extremely carefully - there are so many details in the design! As people have pointed out already, there are a lot of parameters to be set and many of the design decisions taken by the authors have reasoning along the lines of "we found that .... works best", "we saw that ... is necessary" or "we followed .... heuristic". It is difficult not to wonder if deep learning is just about moving complex hand-tuning from the feature design to the system design - I guess this is what Ishan has also pointed out. However, in spite of skepticism about deep learning, it does seem to be actually used in production in some places very recently. Most notably the new photo search feature on Google+ uses Prof. Hinton's work (http://techcrunch.com/2013/06/12/how-googles-acquisition-of-dnnresearch-allowed-it-to-build-its-impressive-google-photo-search-in-6-months/).
ReplyDelete
Replies
UnknownNovember 19, 2013 at 5:56 AM
This paper was very interesting, as the scale of the problem and the scale and the complexity of the solution are all quite large. The authors present many different techniques for combating overfitting, yet it's unfortunate that they did not compare results of the entire procedure in an ablative manner. This data would probably help us understand a bit more about what's important to the training process. A minor concern of mine is 10 region softmax at test time.

My overall feeling is that the results are impressive for this problem formulation . However, I think for slightly more complex tasks (scene parsing, multi object detection, etc.) it's not immediately clear how this approach can be adapted. Again I would point out that images are collections of full and partial views of objects, and the assumption of this task is that these collections are of size 1
ReplyDelete
Replies
Yuxiong WangNovember 19, 2013 at 5:57 AM
I think the magic of deep networks lies in that it incorporates the human high-level prior into the model itself. Unlike the previous feature extraction schemes widely accepted in vision community such as SIFT and HOG which enforce the priors directly into the design of features, deep learning allows the model to learn useful features automatically. They demonstrate different thinking of dealing with prior informations.
ReplyDelete
Replies
GauravNovember 19, 2013 at 8:47 AM
Leveraging large datasets using deep learning seems to be a hot topic and this paper provides a good view of a complete deep learning system. The authors basically combine work done in different sub-areas like ReLU non-linearity, Data augmentation and dropout for their particular task. I was impressed by the detail with which they describe their CNN.

Also an interesting engineering challenge was the use of two GPUs and the authors address this successfully. I like how the authors use simple techniques like augmentation and dropout to prevent overfitting.

The qualitative results look good though I'm disappointed that the authors don't show images where their system fails completely and their failure cases are all near misses.
ReplyDelete
Replies
AnonymousNovember 19, 2013 at 9:52 AM
This seminal work on deep learning explores the use of extremely complex convolutional neural networks to classify large datasets. Because of the computational complexity of training large neural networks, the main challenge is efficiently implementing the network in a way which can be easily parallelized on GPUs. To that end, this paper is largely about implementation, and contains very little theory. The authors provide reasonable techniques for representing the (60 million parameter) network on a GPU, and detail several other implementation techniques which further reduce the computational overhead of training the network.

Throughout this class, the main idea was that even "simple" algorithms, combined with "big data," result in very high performance. Generally, the underlying classifiers and features are very simple, and the data is what drives the performance of the algorithm. This paper bucks the trend by creating an extraordinarily complex model which is learned from the bottom up. The main concern is that as the model complexity increases, paradoxically the performance of the algorithm tends to go down due to overfitting. Another big concern is that as the complexity of the model increases, the ease of *testing* and *re-using* the model decreases dramatically.

The authors address the problem of overfitting through a series of what I would consider, quite frankly, "hacks". They force scale, translation, and mirror invariance by feeding in synthetic data to the network which contains those transformations.

Nevertheless, they manage to perform much better than state of the art image classifiers. Their qualatative results are also compelling.

-- Matt K
ReplyDelete
Replies
AnonymousNovember 19, 2013 at 9:53 AM
It was also incredibly interesting that their first layer learned what essentially resembles texton features!

-- Matt K
ReplyDelete
Replies
UnknownNovember 19, 2013 at 10:36 AM
It is a really interesting and famous paper. The deep learning is a really hot area and state-of-art method on computer vision. However, there are so many parameters in deep learning. It is really necessary to study the parameter systematicly.
ReplyDelete
Replies

Add comment