Krizhevsky, A., Sutskever, I. and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012
And optinoally:
Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, A.Y. Ng. Building high-level features using large scale unsupervised learning.. ICML, 2012.
We will also cover the following paper, but we cannot yet post links for download (ask us if you would like a copy):
C. Szegedy, A. Toshev, D. Erhan Deep Neural Networks for Object Detection, NIPS 2013 (in press)
The paper presents a neural network that is trained to classify 1.2 million ImageNet images into 1000 classes. The presented algorithm performs extremely well ILSVRC-2012 competition by far passing the nearest competitor. In ILSVRC-2010 the method shows top-5 error 17.5% (vs 28.2% by the 2nd result). In ILSVRC-2012 - 15.3% (vs 26.2 by the 2nd result).
ReplyDeleteBesides describing the general architecture (of 5 convolution layers + 3 fully connected layers) the authors describe several “tricks” that enable them to enhance the performance. For example, in order to reduce overfitting they introducing multiple images, similar to every one provided. In particular, every image is translated, horizontally flipped, and its brightness and color histogram are changed.
Overall, the created neural network definitely makes a great contribution, reducing the recognition error almost on half. Moreover, as pointed out in the paper, some images may have multiple labels, so zero-error is not achievable by definition.
The things I liked about the paper:
- overfitting reduction techniques. Though not they are not new, I didn’t read about them before
- evaluation of contribution of every trick into the result
- class named “dead man’s fingers”, before I looked it up in Wikipedia (Fig.4, left, 2nd image in 2nd row)
The things to clarify:
- How was general architecture chosen? Yes, reducing the number of convolution layers affects the performance. What about adding layers? What is the rationale behind choosing the number of kernels for every layer? That is, if I want to build my own neural network how should I start?
- They say that for reducing overfitting the size of training set is multiplied by the factor of 2048 via random image translation. Why do they care about every factor of 2 in other sections.
- Failure cases are not really failure – they only illustrate “almost hits”
Besides, I don’t know if it is good or bad - I feel that this paper is moving away from computer vision. There are not many decisions and tricks specific to images. Many of the tricks concern about computational time or splitting work into 2 GPU units. In general, I would describe this paper as presenting building and tuning a neural network, with the example application in computer vision.
While the paper certainly does read like a pure neural net paper, the data generation/overfitting reduction techniques are domain specific. On the other hand, what else is there really to say? It will probably be a while before we can even start studying the effects of the neural net structure, much less specialize to the domain.
DeleteI agree that the paper did not have much insight into the vision aspects of the object recognition problem. However, that does not bother me too much because it is a machine learning journal and you would expect to see more learning related stuff. Having said that, the paper just says "throw in a bunch of images and the deep net will do what it is supposed to do" and no more than that. The intuition behind the learning component (all the parameters and the method itself) is missing.
DeleteAs for designing the architecture, that is one frustrating point. Why x kernels at layer y? They alude to the fact that they would want a larger net if they had more computational resources/time. Although they do not explicitly talk about it, I suspect that the first third of the layer builds and selects for good initial features, the second third focuses on building complex representations of these features, and the last third (the fully connected part) is simply a non-linear classifier over these complex features.
DeleteIs there a method for pruning kernels at each level? Perhaps these values were hand tuned by training for a day, checking performance, repeat.
DeleteI think they mention that the number of layers was limited by the size of their gpu. So yes probably adding a layer would perform better, but then they would need better hardware. Though at a certain point adding too many layers will start to perform worse for a fixed training time because the algorithm can't learn fast enough.
Deep learning is a really hot topic these days. The most important message that it delivers is that "it works", although it is still yet to know "why it works". This paper talks about motivation (since current feature is manually designed but not learned from the data) quite clearly, but only mentions how this architecture is selected and why it is selected in a slippery manner. It seems like according to a paper from Berkeley, deep learning is also finding the discriminative regions like the what the mid-level patches are doing. That might be one intuitive explanation. On the other hand, I believe the patches discovered with Carl's paper are more explainable and discriminative, so can we also build fully connected network (possibly with max-polling) as an easy experiment to demonstrate its performance?
ReplyDeleteI agree with you on this. I guess this was the paper you were referring to - http://arxiv.org/abs/1311.2524 ?
DeleteYes, that is it...
DeleteAh, an interesting paper, a paper from DPM people beating DPM on PASCAL dataset...
DeleteIn this paper, the authors present a deep convolutional neural network (CNN) for large-scale image classification problem. They target to classify 1.2M images in the ImageNet LSVRC-2010 contest into 1000 classes. Specifically, they propose an eight layers CNN with five convolutional and three fully-connected layers. The proposed CNN is designed to perform on 2 GPU system where is neurons after the second layer are splitted and running independently on each GPU. To mimic human perception system, the proposed CNN incorporates local response normalization and overlapping pooling. In addition, the authors also proposed several tricks to reduce overfitting problem. They demonstrated significantly improvement over the testing dataset of LSVRC dataset.
ReplyDeleteDeep learning was getting lots attentions recently on various applications. The central focuses in this field is to build intelligent systems that can extract high-level representations from sensory data. Such system require multiple layers design and non-linear processing. I like this paper that it proposed a good architecture in which the authors tried to design processes that mimic human perception system.
Concerns:
1. Size of the input images. In the proposed method, they normalized the high-resolution images from the dataset into smaller one of 256x256, and if the image is not square, only the central 256x256 patch is used). This normalization undoubtedly discard some information contained in original images. How about using zero padding? or even adding a summarizing layer which dynamically fit the size of the input image and generate fixed size of summary for the following layers.
2. The proposed system worked directly on the RGB color space. Although they augmented the data by appending the result from the PCA on the set of RGB pixel values, it is known that human visual system doesn’t really working directly on RGB system. For example, under the low light condition, human vision only processes grayscale images and maintained similar performance as with color. It would be very interesting to see how this could be added into the system and how it would perform.
Not understanding the specific details of CNNs vs traditional NNs and having studied NNs a long time ago, I recall that a simple linear transform of the data, like RGB to grayscale, can be learned by a NN. In other words, it will learn to transform RGB using whatever linear transform is most useful to it along with any other features it learns during training. How to separate this from the final result to see what it learned is not clear, which is always a problem with NNs.
DeleteI agree that human visual perception does provide a lot of high-level insight into how we can construct algorithms. But I'm not completely convinced that mimicking the human visual system is how computer vision can be solved. Understanding the former in enough detail to 'mimic' it seems to be much harder than designing vision algorithms to perform specific tasks with good accuracy. Even in this particular paper, I think the performance comes out of the meticulous design of the network for the task of image classification and the measures taken by the authors to avoid over-fitting. Maybe this will be a good thing to discuss in class.
DeleteI like the idea of using deep nets for feature learning. I think these data-driven features are the way to intuitively discriminatively summarize images.
ReplyDeleteHowever, I am always concerned by the fact that no clear analysis of the parameters are presented. How do you choose the architecture? How do you choose the number of kernels at each layer? I feel deep nets have so many non-intuitive parameters - normalization constants, non-linearity hyperparameters, etc.
So is the way to "learn features" to move from hand-tuned features (some parameters of which we understand) to non-intuitive parameters?
Recent papers, "Segmentation Driven Object Detection" show a shift in object detection to using a plethora of features on "proposal" windows. Is this the way to go?
It seems it can be very hard to get the tuning right to get good results from neural nets. In fact, the authors mention that the learning rate was "adjusted *manually* throughout the training". I'm sure all the other ones take a lot of human-learning first before getting the final system training!
DeleteYet, there is something attractive with a neural net. It's an attempt to have a system learn a 'memory'. This paper only shows the gabor filters, but the cat's paper shows all the higher level nodes that seem to have meaning.
But we seem to be hand tuning to get gabor filter like kernels. Couldn't we just start with gabor filters then?
They sort of do use multiple proposal windows, but they are all fixed at test time, do not respect segmentation boundaries, and are softmaxxed to form a single distribution of labels , as opposed to multiple object hypotheses. It's not immediately clear how they could extend their approach to train it to behave in a more complex manner, as it would require more descriptive ground truth labelling or some kind of clustering and guessing on ground truth to build a training set
DeleteMany folks seem to be concerned about the mystical nature of neural net engineering. Indeed, it does seem like a dark art, but given that the net takes 5-6 days to train, I think this is forgivable. The authors do mention that removing layers decreases their performance, but without some way of making this faster (perhaps with specialized hardware), they can't be expected to generate enough data to do any meaningful analysis. Another way to look at it is, despite our limited understanding of neural nets, it is still possible to produce results that rival better understood methods, suggesting that this is worth further investigation.
ReplyDeleteIn this paper, the author reported their recent work about large scale object recognition on Image-Net, using convolutional neural network.
ReplyDeleteIn a sense, this paper is mostly focused on system and architecture level. Although generally people still have a lot of questions about the underlying reasons why deep learning work so well, we could still find some evidence.
I believe one of the reasons is the way raw features are regularized and max-pooled layer by layer. It is essentially a joint discriminative feature learning and dimensionality reduction process. This makes the learned features highly effective.
On top of this, I saw an interesting point of view from Kai Yu's Weibo. During his discussion with Geoff, Geoff mentioned that what finally will be left by Deep Neural Net is the idea of layered feature construction. And Kai generalized this information as: The specific form of Deep Learning is not important, what ultimately matters is the idea of deep models with layered structures. This will be the long lasting value of deep neural nets.
The paper presented good results (state of the art), but did not seem to contribute much theoretically regarding neural networks. However, the paper plays an important role in exhibiting the power of modern computation in regards to neural networks. The authors successfully create a system using two ~$500 GPUs (not too expensive).
ReplyDeleteThe authors also detail how their engineering decisions improved test time results, which gives us a little bit of insight into what helped. Some comments on these:
- Anyone who has naively implemented an neural net has likely been bitten by the saturation of output=tanh(x) ( I think 5 hrs debugging was my personal record on this ). The ReLU (just a hinge function / rectifier ) does a good job on this
- The dropout technique is cool. I think I recently heard about it and it's nice to hear that it works well.
Mostly, this was a systems paper that showed good results, but it would have been awesome to see them try something crazy on their system that would really teach us something new about neural nets (whether you hate or love them)!
The paper showed nice and interesting results of deep nets in object recognition. Personally, I really do not get the intuition behind how deep nets work. It is certainly disturbing that non-intuitive methods for feature learning perform better than the ones that make logical sense (to the human brain). We do not know what the deep nets are actually learning. Since it seems to perform well on most of the datasets that we currently have, a lot of people are interested in it. But without having atleast a vague idea of why it is behaving the way it does, I'm not too enthusiastic about using it. And the paper certainly does not address the intuition behind deep nets.
ReplyDeletePersonally, I think they are doing what HOG does but now in a hierarchical fashion (over subsequently more and more complex features). Given gradients, they make an invariant representation of a collection of them, and then given this representation they do it over again. Features in the higher parts of the network might be shapes or parts of a class of objects. There was a VASC seminar talk about this earlier this year:
Deletehttp://www.ri.cmu.edu/event_detail.html?event_id=763&&menu_id=427&event_type=seminars
I thought this paper was nicely written and explained the general concepts quite well. Like everyone else, I have zero intuition for what's going on in the neural net itself, and how the authors decided on the structure, parameters, etc. It seemed to me that there were quite a bit of "engineering" (hacking?) involved to make it work - data augmentation, dropout, local response normalization, overlapping pooling, number of convolution layers and fully connected layers, number of nodes...
ReplyDeleteI'm also unclear about what a "convolutional" layer is - I assume this will be explained tomorrow.
If performance decreases when removing layers, does that mean it would increase when adding layers? Can we try even bigger neural nets? Is there something so terrible about a neural net taking one month to train if it works really well? Do we need more data? Can it handle weakly labeled web images?
same feeling here, the deep nets always looks like a blackbox to me, what intermediate nodes learns is not clear and the connectivity/structure is chosen ad hoc. It may help if I can see what intermediate weight vectors would look like, so we know what it is trying to learn. I think max-pooling makes sense, it is biologically inspired.
DeleteDeep nets are just like a magic box for me.....There are so many parameters here and it seems that it is too hard to analyze the influence of each parameter mathematically.
DeleteLast time when Rob Fergus gave a talk at CMU, he mentioned there was a PhD student that spent almost his whole PhD tuning and optimizing deep neural net. In general, it took a lot of efforts to optimize. It appears that 7 layers are just the right number of layers, not one more, not one less :)
DeleteI was really unhappy about the fact that they took the center crop of the image. Websites like Facebook have worked to make slightly more intelligent cropping algorithms without being computationally expensive. And even if center-cropping was the only reasonable option, the authors later go on to use 224x244 patches sampled from the cropped center image. I'm not entirely sure why they didnt just use 256x256 patches sampled from the original image, so they did not lose out on this information completely.
ReplyDeleteI'm not sure if I just missed this, but how did they figure out which kernels go on which GPU? Seems like magic.
Also, a minor nitpick, but I wish the authors would quantify how much overfitting was occurring. They frequently mention that such and such change (e.g. Dropout and pooling) reduce overfitting, but never quantify how much. It would be nice, especially for something like Dropout, if you wanted to change the tradeoff between overfitting and training time.
I was also really curious why the performance is so different between ILSVRC (15%) and ImageNet (67%) since one is a subset of the other. Are the images in ILSVRC specifically chosen to be the easier images of ImageNet? Or is this some sort of random sampling bias?
DeleteCould it be that their system fails on small datasets? They did keep mentioning that they either need a large dataset, or make their neural net smaller.
DeleteThis comment has been removed by the author.
ReplyDeleteThe deep network this paper talks about seems to have been designed extremely carefully - there are so many details in the design! As people have pointed out already, there are a lot of parameters to be set and many of the design decisions taken by the authors have reasoning along the lines of "we found that .... works best", "we saw that ... is necessary" or "we followed .... heuristic". It is difficult not to wonder if deep learning is just about moving complex hand-tuning from the feature design to the system design - I guess this is what Ishan has also pointed out. However, in spite of skepticism about deep learning, it does seem to be actually used in production in some places very recently. Most notably the new photo search feature on Google+ uses Prof. Hinton's work (http://techcrunch.com/2013/06/12/how-googles-acquisition-of-dnnresearch-allowed-it-to-build-its-impressive-google-photo-search-in-6-months/).
ReplyDeleteI think the big parameters and complex architecture design might not be that a big issue. Like nearly every vision and machine learning system in practice needs parameter tuning, 60 million parameters and 650,000 neurons do not necessarily become an unaffordable burden. This is because the application of deep learning into vision is just recent event, and the knowledge transfer from machine learning to vision community is not well established now.
DeleteAn interesting argument about feature design to system design. Personally I think for feature design, it is easier for human to understand what is going on there, so it is more intuitive. But for system design, only people like Geoffrey Hinton can find a good way to proceed, maybe still after a lot of trial and error. I think people may need to have a better representation of the system itself in order to design it.
DeleteThis paper was very interesting, as the scale of the problem and the scale and the complexity of the solution are all quite large. The authors present many different techniques for combating overfitting, yet it's unfortunate that they did not compare results of the entire procedure in an ablative manner. This data would probably help us understand a bit more about what's important to the training process. A minor concern of mine is 10 region softmax at test time.
ReplyDeleteMy overall feeling is that the results are impressive for this problem formulation . However, I think for slightly more complex tasks (scene parsing, multi object detection, etc.) it's not immediately clear how this approach can be adapted. Again I would point out that images are collections of full and partial views of objects, and the assumption of this task is that these collections are of size 1
I think the magic of deep networks lies in that it incorporates the human high-level prior into the model itself. Unlike the previous feature extraction schemes widely accepted in vision community such as SIFT and HOG which enforce the priors directly into the design of features, deep learning allows the model to learn useful features automatically. They demonstrate different thinking of dealing with prior informations.
ReplyDeleteLeveraging large datasets using deep learning seems to be a hot topic and this paper provides a good view of a complete deep learning system. The authors basically combine work done in different sub-areas like ReLU non-linearity, Data augmentation and dropout for their particular task. I was impressed by the detail with which they describe their CNN.
ReplyDeleteAlso an interesting engineering challenge was the use of two GPUs and the authors address this successfully. I like how the authors use simple techniques like augmentation and dropout to prevent overfitting.
The qualitative results look good though I'm disappointed that the authors don't show images where their system fails completely and their failure cases are all near misses.
This seminal work on deep learning explores the use of extremely complex convolutional neural networks to classify large datasets. Because of the computational complexity of training large neural networks, the main challenge is efficiently implementing the network in a way which can be easily parallelized on GPUs. To that end, this paper is largely about implementation, and contains very little theory. The authors provide reasonable techniques for representing the (60 million parameter) network on a GPU, and detail several other implementation techniques which further reduce the computational overhead of training the network.
ReplyDeleteThroughout this class, the main idea was that even "simple" algorithms, combined with "big data," result in very high performance. Generally, the underlying classifiers and features are very simple, and the data is what drives the performance of the algorithm. This paper bucks the trend by creating an extraordinarily complex model which is learned from the bottom up. The main concern is that as the model complexity increases, paradoxically the performance of the algorithm tends to go down due to overfitting. Another big concern is that as the complexity of the model increases, the ease of *testing* and *re-using* the model decreases dramatically.
The authors address the problem of overfitting through a series of what I would consider, quite frankly, "hacks". They force scale, translation, and mirror invariance by feeding in synthetic data to the network which contains those transformations.
Nevertheless, they manage to perform much better than state of the art image classifiers. Their qualatative results are also compelling.
-- Matt K
It was also incredibly interesting that their first layer learned what essentially resembles texton features!
ReplyDelete-- Matt K
It is a really interesting and famous paper. The deep learning is a really hot area and state-of-art method on computer vision. However, there are so many parameters in deep learning. It is really necessary to study the parameter systematicly.
ReplyDelete