The most popular and used activation function:
Pros:
- Computationally less expensive compared to other activation functions
- Fewer neauron beeing activated, leading to network sparsity and thus computational efficency
- Avoida vaniscing gradient: since its derivative is always (in the positive input range) gradiens flow well on active paths of neurons. Cons
- The dying ReLu: if too many values are negative, most of the network is inactive and the network is unable to learn further
Possible causes of the dying ReLu:
- High learning rate: the update will push weights into negative numbers, thus overall negative scalar product.
- Large negative bias obvious.