The most popular and used activation function:

Pros:

  • Computationally less expensive compared to other activation functions
  • Fewer neauron beeing activated, leading to network sparsity and thus computational efficency
  • Avoida vaniscing gradient: since its derivative is always (in the positive input range) gradiens flow well on active paths of neurons. Cons
  • The dying ReLu: if too many values are negative, most of the network is inactive and the network is unable to learn further

Possible causes of the dying ReLu:

  • High learning rate: the update will push weights into negative numbers, thus overall negative scalar product.
  • Large negative bias obvious.