I am having trouble understanding what fundamentally is going on. I understand that each layer has a different convolution filter, and convolves the output from the previous layer. But what is the end result? Is it just matching images that are similar to the final product of the filter?
Yeah, basically. Since the results of the convolutions can be seen as having high responses to the filters if the intensity values are high, after several layers of applying these filters, you can find the filters that produced the highest responses. Then according to your training data, you can see which category of images the image belongs to based on the intensity of the response.
I am still not clear about the process.
Will the order of filter affect the final result? For example, will the result of first using filter A then filter B be the same as first using filter B and then filter A?
How do we decide which category an input image belongs to? Do we find the filter which has strongest response at each layer and get a result based on these filters or do we find only one filter with strongest response in all layers?
@Fantasy, yeah, the order of the filter will affect the final result. Matrix multiplication isn't commutative (and you can think of applying filters as multiplying Matrices).
As for the other question, krillfish answered it already.
It depends on the model used, but there are ones that just look for the filter with the strongest response, but usually, I think it's that there will be an 'output' layer that will take the outputs of all of the last inner layer's responses and combine them and the result will be the 'answer'. Exactly how this is done depends on the model used.
Could someone explain how the layers relate to each other? I'm confused because the images in the corresponding sections of the photo grid don't seem to relate to each other between layers. I guess I'm expecting it to be narrowing down all the images to photos of a particular object, but maybe each layer is supposed to be giving completely different results?
@temmie they don't. Usually, it's nearly impossible to tell what information a given node in a layer actually encodes. The above is probably an approximation of what the element is encoding and not actually exactly what is being encoded. What information is held, is usually quite random. In many popular training models, the initial weights are randomized, and then training (error propagation and readjustment) is allowed to happen. The result is something that works and produces the results you want if you read the output. What do the individual layers do? You can only guess.
They deeper layer encodes the upper layer's result, then why the visualization try to connect filters in the deeper layer with the input images? Shouldn't these filters operate on its input: the convolution of its upper layer?
@RX it's an approximation. Yes, they really should be operating on the results of the previous layer. Perhaps they made these approximations by 'zeroing' that one cell, and seeing what affect it had on the overall output and by doing this, perhaps they were able to get an approximation on which cell is essential to recognizing that particular object.
The other explanation I can think of is, it's not a random training, but rather a manually encoded (highly supervised) training where they manually picked the weights and what not in a way that it would recognize that particular object.
Layer 3, top-right region seems to be responding to bird heads or green backgrounds. The grey boxes visualize what pixels in the image are most responsible for triggering the response. E.g. in the top-left region, the responses seem to be triggered by grid-like patterns.