Previous | Next --- Slide 19 of 41
Back to Lecture Thumbnails

These images were taken from the very useful paper Visualizing and Understanding Convolutional Networks by Zeiler and Fergus (ECCV 14).


Section 6.4.1 of Deep learning textbook has a nice explanation on the design choice of deep networks. Quoting one of the arguments directly from the textbook (page 200) :

"Choosing a deep model encodes a very general belief that the function we want to learn should involve composition of several simpler functions. This can be interpreted from a representation learning point of view as saying that we believe the learning problem consists of discovering a set of underlying factors of variation that can in turn be described in terms of other, simpler underlying factors of variation".


Does it indicate that to be able to train a good model, we should provide diverse training data so that it can see the pattern?


@KnightsLanding: I guess the choice of a particular architecture depends on the problem (and the corresponding dataset) at hand. In the paper referred in this slide, one of the datasets that the authors are dealing with is the ImageNet dataset that has over 1.3M images categorized into 1000 classes. So in order to learn well (or in other words to generalize well), specific architecture design choices (of having a fairly deep network) are made before training.

On the other hand, if the problem at hand is to solve the digit classification problem for MNIST dataset, a very deep network would lead to overfitting (As the parameters to tune (knobs to rotate) would be far more compared to the number of available training instances and to capture the underlying variability of the data).

So my point is that rather than trying to get more varied data after fixing a particular architecture (which might not be feasible), it is generally the case that the architecture is chosen depending on the dataset at hand (a common technique is to employ cross-validation to determine the number of layers, learning rate, etc).


@randomized: Thank you! I was thinking if the model can recognize things that it didn't see before. But then I think maybe human couldn't know things they didn't see before either.


I am little confused with the fact that if we are applying gradient filters in the initial layers then are we not losing a lot of information in the first few layers itself? I mean in the next few layers we are finding patterns only in very specific directions (because that is only what is left now). Or may be because we are being specific in later layers it is a good thing?


Another thing, does the "function we want to learn should involve composition of several simpler functions" means that at each layer we are trying to learn the weights of these simpler functions?


@sasthana: We are not applying gradient filters in the initial layers. The grey set of images (in each of the layer 'i' images) show top-9 activations in a random subset of feature maps (taking the validation data). These activations are projected down to the pixel space by using a deconv algorithm. The non-grey images (RHS in each of the layer 'i' images) are the reconstructed patterns from validation set that cause high activations in a particular image map.

Just to recall, the weights corresponding to these feature maps are adjusted by backpropagation, and it happens to be the case that when we analyse the final weights (post training) at the individual layers, we obtain a hierarchical set of features (starting from low level features in the initial layers and so on).

For your second query, I think the answer is yes. (However, our main objective is not to learn the simpler functions in a standalone manner).

EDIT: Changed the description according to the paper (section 4).


is the gray picture a visualization of kernel function and color ones the convolution result? why even the gray ones in layer 3 has some structures of input pictures?


@yangwu. The shown input images are the images that generated the biggest response from a unit (a neuron). It would be reasonable to say that the unit responds to images that look like those. The gray image is a visualization of what pixels in the source pictures contributed to triggering a high response for this unit (neuron). So if the images on the right indicate what the neuron responds to, the gray images give you some sense of why the neuron is responding to those images.


Note that these images are the top 9 activations on each layer -- there are definitely also a LOT of images that trigger really poor or ambiguous responses/clusters too.

Is showing the images that generate the best response a standard thing to do in tasks like this? Is there a more pessimistic measure of the performance of network?