Normalize to zero mean and one variance each feature across the batch. It’s a method of adaptive reparametrization.

The normalzation operations are considered in the computational graph of the network! for each feature in the of size :

is a small value (e.g. ) added for numerical stability (at zero is non differentiable). So the rescaled features are

During inference time, the mean and variance may be replaced by the values obtained one the whole training dataset.