In deep learning, since we stack a considerable amount of layers, the gradient depends on more and more factor. So a natural problem arises: the gradient can tendo to zero or infinity exponetially fast.

Consider for example the sigmoid activation function. For values with big the function is almost flat, so the gradient is almost zero (it’s always a value in ). If we stack a lot of layers, the gradient becomes very small, it vanishes, therefore the network is hard to train.