A computer program is said to learn from experience with respect to some class of tasks and performance measure , if its performance at tasks , as measured by , improves with experience .
Classification task
The objective of the model is to learn the mapping function that maps inputs to a set of labels . The expirience here is a collection of data, namly the pairs . A performance measure is some kind of distance .
The classical statistical perspective
The main assumption is that our data follows a (true) probability distribution. Let’s imagine a family of statistical models, where each member is specified by a set of real parameters
Our assumption is that our data is i.i.d like
our task is then to find the (true) , that is find the correct set of parameters. To do this, we define an estimator based on the data
given a distribution with parameters , we can of course compute the expected value conditioned on (imagine is a picture, a label)
then we can define an empirical cost to measure the performance of our model
so we the average distance (averaging on all our dataset of samples) a choice of distance. It’s clear that using the true parameters , this empirical cost would tend to zero in the limi .
It makes sense to choose as our best guess for the parametes
Technical difficulties of this approach
- Family of model not known!
- The minimization task is usually hard ( non convex)
- We need big enough to be in the SLLN, i.e empirical cost need to be close to the (true) population cost.