A simple solution to the proble of vanishing gradient, since the problem is that chaining a lot of function leads to a a gradient of lots of factors, we use that differentiation is linear:

It’s called residual since the layers now learns the residual, the difference beetween the input and the (desired) output.