Stochastic gradient descent SGD

Just approx the gradient of the loss function on the whole dataset with a mean on just a subset (batch)

Input lr $η$ , initial guess for params $θ$ while not stop criterion: compute gradient estimate $\overset{g}{^} = \frac{1}{B} \sum_{i} \nabla_{θ} L (f_{θ} (x_{i}), y_{i})$ apply update $θ = θ - η \overset{g}{^}$

Momentum

Inspired by physics, the update also depends on the previous “velocity”.

Input lr $η$ , initial guess for params $θ$ , momentum factor $α$ , initial velocity $v$ while not stop criterion: compute gradient estimate $\overset{g}{^} = \frac{1}{B} \sum_{i} \nabla_{θ} L (f_{θ} (x_{i}), y_{i})$ update velocity $v = αv - η \overset{g}{^}$ apply update $θ = θ + v$

There is a variant called Nesterov Momentum, it’s like implicit euler method, we compute the gradient with the position updated by the previos velocity:

Input lr $η$ , initial guess for params $θ$ , momentum factor $α$ , initial velocity $v$ while not stop criterion: compute approx next $\hat{θ} = θ + v$ compute gradient estimate $\overset{g}{^} = \frac{1}{B} \sum_{i} \nabla_{\hat{θ}} L (f_{\hat{θ}} (x_{i}), y_{i})$ update velocity $v = αv - η \overset{g}{^}$ apply update $θ = θ + v$

Lorenzo Gregoris

Explorer

Stochastic gradient descent SGD

Momentum

Graph View

Backlinks

Lorenzo Gregoris

Explorer

Stochastic gradient descent SGD

Momentum §

Graph View

Backlinks

Momentum