Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

vincom2

please use partial derivative symbols :(((((

Calloc

Given the current information, the idea would be to increase p1 and decrease p2. I think the amount change would need to be determined by trial and error since we don't know if the function is linear in p1/p2.

PID_1

@calloc. Typically we make some fixed step size times the optimal direction vector rather than trying to make a step that gets us all the way to zero in one go. This way we can break up the calculation into the minibatches and make steps in the general direction of the optimal point. If we took big steps with those, it might not converge, since each minibatch only takes into consideration some of the parameters.

jrgallag

This strategy would only find a local minimum rather than a global one, correct? For example, there could be a pocket of much more accurate values that would require you to go "uphill" a bit before reaching it. Are there any strategies that could find a global maximum? Perhaps starting with many random initial sets of weights and taking the best final one.

hofstee

@jrgallag yes, there is a possibility that this would get stuck in a local minimum. Randomization and variation of step size can help to alleviate this, you could also run a few training attempts with randomized starting parameters. You would want to be finding the global minimum but finding the global minimum for an arbitrary function is impossible with an algorithm.

Maybe someone with better domain knowledge of ML could chime in with more info.

jhibshma

So, I think I understand how one would compute the gradients for a single-layered network, but how do you compute the gradient for a multi-layered network? It seems that the gradient of a given parameter is affected by what all the subsequent layers do with it's output.

randomized

@jhibshma: To compute gradients for a multi-layered network, chain rule is used. Please refer to the 4 slides starting from slide 14.

stl

We step in the direction opposite the gradient because gradient gives us the direction the function is increasing the fastest. The step size needs to be chosen carefully because if it's too big we may overstep the minimum and if it's too small it may take too long to find the minimum. As mentioned above, we are not guaranteed to find a global minimum, because once we've reached some local minimum, the gradient is 0 and we won't move any more.

grarawr

The into to machine learning on coursera by Andrew Ng gives a nice explanation about gradient descent

tdecker

I think it should be pointed out that many times in gradient decent the field is nonlinear, which means that just following the gradient will get you a local min/max. To do gradient decent on those systems you need to have a way to sample in multiple places possible in parallel.