Vanishing and exploding gradient

In deep learning, since we stack a considerable amount of layers, the gradient depends on more and more factor. So a natural problem arises: the gradient can tendo to zero or infinity exponetially fast.

Consider for example the sigmoid activation function. For values with big $∣ x ∣$ the function is almost flat, so the gradient is almost zero (it’s always a value in $[0, 1]$ ). If we stack a lot of layers, the gradient becomes very small, it vanishes, therefore the network is hard to train.

Lorenzo Gregoris

Explorer

Vanishing and exploding gradient

Graph View

Backlinks