A computer program is said to learn from experience with respect to some class of tasks and performance measure , if its performance at tasks , as measured by , improves with experience .

Classification task

The objective of the model is to learn the mapping function that maps inputs to a set of labels . The expirience here is a collection of data, namly the pairs . A performance measure is some kind of distance .

The classical statistical perspective

The main assumption is that our data follows a (true) probability distribution. Let’s imagine a family of statistical models, where each member is specified by a set of real parameters

Our assumption is that our data is i.i.d like

our task is then to find the (true) , that is find the correct set of parameters. To do this, we define an estimator based on the data

given a distribution with parameters , we can of course compute the expected value conditioned on (imagine is a picture, a label)

then we can define an empirical cost to measure the performance of our model

so we the average distance (averaging on all our dataset of samples) a choice of distance. It’s clear that using the true parameters , this empirical cost would tend to zero in the limi .

It makes sense to choose as our best guess for the parametes

Technical difficulties of this approach

  1. Family of model not known!
  2. The minimization task is usually hard ( non convex)
  3. We need big enough to be in the SLLN, i.e empirical cost need to be close to the (true) population cost.