Optimization algorithms with adaptive LR

AdaGrad

Individually adapts the learning rate for each parameter.

Input gloabl learning rate $η$ , initial guess for params $θ$ , small constant $δ$ for num. stability grad. acc. $r = 0$ while not stop criterion: compute gradient estimate $\overset{g}{^} = \frac{1}{B} \sum_{i} \nabla_{θ} L (f_{θ} (x_{i}), y_{i})$ accumulate squared gradient $r = r + g * g$ compute update $Δ θ = - \frac{η}{δ + r} * g$ apply update $θ = θ + Δ θ$

With $*$ we mean element-wise.

RMSProp

The same as AdaGrad but introduces a new decay parameter parameter $ρ \in [0, 1]$ to exponentially forget the previous gradients:

r = ρ r + (1 - ρ) g * g

Adaptive Moments: Adam

Is a 2014 update to RMSProp combining it with the main feature of the Momentum method. In this optimization algorithm, running averages with exponential forgetting of both the gradients and the second moments of the gradients are used.

Input gloabl learning rate $η$ , initial guess for params $θ$ , small constant $δ$ for num. stability, forgetting rates $ρ_{1}$ , $ρ_{2}$ , initial velocity $v$ . grad. acc. $r = 0$ while not stop criterion: compute gradient estimate $g = \frac{1}{B} \sum_{i} \nabla_{θ} L (f_{θ} (x_{i}), y_{i})$ accumulate squared gradient $r = ρ_{1} r + (1 - ρ_{1}) g * g$ update velocty $v = ρ_{2} v + (1 - ρ_{2}) η g$ correct bias: $r = \frac{r}{1 - ρ _{1}^{t}}$ correct bias: $v = \frac{v}{1 - ρ _{2}^{t}}$ compute update $Δ θ = - η \frac{v}{δ + r}$ apply update $θ = θ + Δ θ$

Lorenzo Gregoris

Explorer

Optimization algorithms with adaptive LR

AdaGrad

RMSProp

Adaptive Moments: Adam

Graph View

Table of Contents

Backlinks

Lorenzo Gregoris

Explorer

Optimization algorithms with adaptive LR

AdaGrad §

RMSProp §

Adaptive Moments: Adam §

Graph View

Table of Contents

Backlinks

AdaGrad

RMSProp

Adaptive Moments: Adam