The bayesian setting

As always in baysian statistics, one exploits bayes theorem.

The task is to find the probability of , given a number of (finite) observations.

The max likelihood method MLE

This is a frequentist approach (it assumes flat prior). Estimation of a pdf based on a number of finite observations (data) of a random variable .

is our data, sampled from an unknown pdf .

The likelihood of the data given a parameter for our family of distribution is

since the data is i.i.d. We now use the standard trick

ovviamente massimizzare la verosomiglianza equivale a massimizzare la funzione appena introdotta. Ora riscaliamo di un fattore e aggiungiamo una costante, l’entropia empirica della vera distribuzione

è apparsa proprio la divergenza KL empirica. In conclusione massimizzare la verosomiglianza è equivalente a minimizzare la Kullback-Leibler divergence empirica.

Bayesian Learning MAP

Again, suppo we have a set of noisy data , generated by an unknown source .

Here we also need a prior distribution for the weights , this reflects prior knowledge/prejudice.

The goal is to find the posterior distribution of the weights , that is our new knowledge about the weights given data .

The posterior is computed using Bayes theorem:

but we don’t know the true data distribution (this is infact our goal), but since

using this we get

Remark Given a family of pdf and a parameters, we can compute .

We can schematize the procedure in 3 steps:

  1. Definitions define the parametrized model and the prior distribution .
  2. Model translation Convert the model in a standard probabilistic form, that is specify the liklihood of finding upon input , given parameters , i.e. .
  3. Posterior distribution Compute the data likelihood as a function of

Remarks

  • No need for cross-validation
  • Returns an error measure (a distribution of )
  • This gives an error on the estimate
  • Traditional learning via gradient-descent is recoverd.

Example Let’s define a family of determinstic models with just one parameter . Let’s choose a normal prior . We have only one data point .

Convert the model in probabilistic form:

compute posterior

carring out the computations, one gets

so the posterior is a delta, and the value of that maximize it (the only possible choice) is .

In traditional statistical learning we minimize the empirical cost plus some regularization to ensure no overfitting

where the empirical cost is

On the other hand in bayesian learning, the most probable weights are found minimazing

where are just some constant terms. If we assume that our data is perturbed by a noise with distribution then

Let’s see what we are trying to minimize

we have added just to see clearly the relationship with classical learning. Then if we choose a noise distribution and the weights prior as we are back in the setting of classical learning.

Remark this suggest to choose .

Gradient descent per minimizzare D_KL

Il primo algoritmo che viene in mente è il metodo del gradiente, fissiamo un learning rate , ed aggiorniamo il vettore dei parametri come

per abbastanza piccoli la divergenza non può crescere, infatti

da questo segue

Remark is unknown, but it doesn’t appear in the gradient of the KL divergence:

The maximum entropy principle

Idea: to find the unknown pdf, find such that it maximazies given information (contrains) of mean values of some function .

to solve this problem we can use Lagrange multipliers.