In a supervised setting, the goal is to produce an embedding such that a fixed distance function is small for example in the same class, large for example in different classes.
Using contrastive methods, sometimes doesn’t work well. For example you could end up with high variance between samples of the same class. A simple idea is to penalize this in the loss, the center loss:
where is the center of the class computed by averaging other examples of this class in the batch.