Just approx the gradient of the loss function on the whole dataset with a mean on just a subset (batch)
Input lr , initial guess for params while not stop criterion: compute gradient estimate apply update
Momentum
Inspired by physics, the update also depends on the previous “velocity”.
Input lr , initial guess for params , momentum factor , initial velocity while not stop criterion: compute gradient estimate update velocity apply update
There is a variant called Nesterov Momentum, it’s like implicit euler method, we compute the gradient with the position updated by the previos velocity:
Input lr , initial guess for params , momentum factor , initial velocity while not stop criterion: compute approx next compute gradient estimate update velocity apply update