The bayesian setting
As always in baysian statistics, one exploits bayes theorem.
The task is to find the probability of , given a number of (finite) observations.
The max likelihood method MLE
This is a frequentist approach (it assumes flat prior). Estimation of a pdf based on a number of finite observations (data) of a random variable .
is our data, sampled from an unknown pdf .
The likelihood of the data given a parameter for our family of distribution is
since the data is i.i.d. We now use the standard trick
ovviamente massimizzare la verosomiglianza equivale a massimizzare la funzione appena introdotta. Ora riscaliamo di un fattore e aggiungiamo una costante, l’entropia empirica della vera distribuzione
è apparsa proprio la divergenza KL empirica. In conclusione massimizzare la verosomiglianza è equivalente a minimizzare la Kullback-Leibler divergence empirica.
Bayesian Learning MAP
Again, suppo we have a set of noisy data , generated by an unknown source .
Here we also need a prior distribution for the weights , this reflects prior knowledge/prejudice.
The goal is to find the posterior distribution of the weights , that is our new knowledge about the weights given data .
The posterior is computed using Bayes theorem:
but we don’t know the true data distribution (this is infact our goal), but since
using this we get
Remark Given a family of pdf and a parameters, we can compute .
We can schematize the procedure in 3 steps:
- Definitions define the parametrized model and the prior distribution .
- Model translation Convert the model in a standard probabilistic form, that is specify the liklihood of finding upon input , given parameters , i.e. .
- Posterior distribution Compute the data likelihood as a function of
Remarks
- No need for cross-validation
- Returns an error measure (a distribution of )
- This gives an error on the estimate
- Traditional learning via gradient-descent is recoverd.
Example Let’s define a family of determinstic models with just one parameter . Let’s choose a normal prior . We have only one data point .
Convert the model in probabilistic form:
compute posterior
carring out the computations, one gets
so the posterior is a delta, and the value of that maximize it (the only possible choice) is .
Link with traditional learning
In traditional statistical learning we minimize the empirical cost plus some regularization to ensure no overfitting
where the empirical cost is
On the other hand in bayesian learning, the most probable weights are found minimazing
where are just some constant terms. If we assume that our data is perturbed by a noise with distribution then
Let’s see what we are trying to minimize
we have added just to see clearly the relationship with classical learning. Then if we choose a noise distribution and the weights prior as we are back in the setting of classical learning.
Remark this suggest to choose .
Gradient descent per minimizzare D_KL
Il primo algoritmo che viene in mente è il metodo del gradiente, fissiamo un learning rate , ed aggiorniamo il vettore dei parametri come
per abbastanza piccoli la divergenza non può crescere, infatti
da questo segue
Remark is unknown, but it doesn’t appear in the gradient of the KL divergence:
The maximum entropy principle
Idea: to find the unknown pdf, find such that it maximazies given information (contrains) of mean values of some function .
to solve this problem we can use Lagrange multipliers.