Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Previous | Next --- Slide 10 of 46

axiao

Another term that is often added to the total loss is the regularization loss, where the regularization loss captures the complexity of the network parameters. For example, regularization loss could be the sum of the squared weights in the network. The intuition is that minimizing the complexity of our network will lead to better generalization beyond our training set, since just minimizing loss on the training accuracy might just lead to dumb memorization of the training data.

Adding terms like this to the total loss, however, fits very well into the backpropagation method described later! We simply need to calculate gradients according to the new loss function.

shhhh

Do we have to consider the case of overfitting? Minimizing loss for all the training examples might not be the best idea because there may be outliers that cause greater error for a testing set.

emt

To rephrase the intuition of the slide: If the network gets the answer relatively correctly for over a million cases (the loss is low), then presumably the parameters have been set so that they can generalize to new examples as well, as opposed to just memorizing the data at hand.

yes

@emt Is that something that is provably true? Or only something that can be tested and empirically reasoned about?

ggm8

As @shhhh pointed out, we want to be wary of overfitting due to an unnecessarily complex model, which can depend on our domain. To further ensure our generalization, we could partition our data into multiple sets serving different purposes. A training set could be used on every iteration to adjust our weights, a generalization set could be used to test the generalization at a finer granularity (e.g. every iteration), and a validation set could be used to test our overall generalization after our stopping conditions have been met.