The bayesian setting

As always in baysian statistics, one exploits bayes theorem.

The task is to find the probability of $P (x, y)$ , given a number of (finite) observations.

The max likelihood method MLE

This is a frequentist approach (it assumes flat prior). Estimation of a pdf based on a number of finite observations (data) of a random variable $X$ .

$D = {x_{i}}_{i = 1, \dots, M}$ is our data, sampled from an unknown pdf $P$ .

The likelihood of the data $D$ given a parameter $λ$ for our family of distribution is

P (D ∣ λ) = i = 1 \prod M P_{λ} (x_{i})

since the data is i.i.d. We now use the standard trick

= e^{\sum_{i = 1}^{M} l o g P_{λ} (x_{i})} =: e^{L (λ ∣ D)}

ovviamente massimizzare la verosomiglianza equivale a massimizzare la funzione $L$ appena introdotta. Ora riscaliamo di un fattore $\frac{1}{M}$ e aggiungiamo una costante, l’entropia empirica della vera distribuzione

L (λ ∣ D) = - \frac{1}{M} i = 1 \sum M lo g \frac{P ( x _{i} )}{P _{λ} ( x _{i} )} - H_{M} [P]

è apparsa proprio la divergenza KL empirica. In conclusione massimizzare la verosomiglianza è equivalente a minimizzare la Kullback-Leibler divergence empirica.

λ max L (λ ∣ D) = λ min D_{M} (P ∥ P_{λ})

Bayesian Learning MAP

Again, suppo we have a set of noisy data $D$ , generated by an unknown source $S (x) = f_{w} (x) + noise$ .

Here we also need a prior distribution for the weights $P (w)$ , this reflects prior knowledge/prejudice.

The goal is to find the posterior distribution of the weights $P (w ∣ D)$ , that is our new knowledge about the weights given data $D$ .

The posterior is computed using Bayes theorem:

P (w ∣ D) = \frac{P ( D ∣ w ) P ( w )}{P ( D )}

but we don’t know the true data distribution (this is infact our goal), but since

P (D) = \int P (w \cap D) d w

using this we get

P (w ∣ D) = \frac{P ( D ∣ w ) P ( w )}{\int P ( D ∣ w ^{'} ) P ( w ^{'} ) d w ^{'}}

Remark Given a family of pdf and a parameters, we can compute $P (D ∣ w)$ .

We can schematize the procedure in 3 steps:

Definitions define the parametrized model $f_{w} (x)$ and the prior distribution $P (w)$ .
Model translation Convert the model in a standard probabilistic form, that is specify the liklihood of finding $y$ upon input $x$ , given parameters $w$ , i.e. $P (y ∣ x, w)$ .
Posterior distribution Compute the data likelihood $P (D ∣ w) = \prod_{i} P (y_{i} ∣ x_{i}, w)$ as a function of $w$

Remarks

No need for cross-validation
Returns an error measure (a distribution of $w$ )
This gives an error on the estimate
Traditional learning via gradient-descent is recoverd.

Example Let’s define a family of determinstic models $y (x) = tanh (w x)$ with just one parameter $w \in R$ . Let’s choose a normal prior $w \sim N (0, 1)$ . We have only one data point $D = (1, - 1/2)$ .

Convert the model in probabilistic form:

P (y ∣ x, w) = δ (y - tanh (w x))

compute posterior

P (w ∣ D) = \frac{δ ( - 1/2 - tanh ( w )) e ^{- w^{2} /2}}{\int d w ^{'} δ ( - 1/2 - tanh ( w ^{'} )) e ^{- w^{'2} /2}}

carring out the computations, one gets

= δ (w - tanh (- 1/2))

so the posterior is a delta, and the value of $w$ that maximize it (the only possible choice) is $w = - tanh (1/2)$ .

Link with traditional learning

In traditional statistical learning we minimize the empirical cost plus some regularization to ensure no overfitting

\frac{d w}{d t} = - η \nabla_{w} [E_{M} (w) + λ R (w)]

where the empirical cost is

E_{M} (w) := \frac{1}{M} i = 1 \sum M d i s t (y_{i}, f_{w} (x_{i}))

On the other hand in bayesian learning, the most probable weights are found minimazing

w_{b} = a r g mi n_{w} ln P (w ∣ D)^{- 1} = a r g mi n_{w} [- ln P (w) - i = 1 \sum M ln P (y_{i} ∣ x_{i}, w) + C]

where $C$ are just some constant terms. If we assume that our data is perturbed by a noise with distribution $p$ then

P (y ∣ x, w) = p (y - f_{w} (x))

Let’s see what we are trying to minimize

\frac{1}{M} S (w, D) = - \frac{1}{M} i = 1 \sum M ln p (y_{i} - f_{w} (x_{i})) - \frac{1}{M} ln P (w)

we have added $1/ M$ just to see clearly the relationship with classical learning. Then if we choose a noise distribution $p (z) = e^{- d i s t (y_{,} f_{w} (x))}$ and the weights prior as $P (w) = e^{- M λ R (w)}$ we are back in the setting of classical learning.

Remark this suggest to choose $λ \sim M^{- 1}$ .

Gradient descent per minimizzare D_KL

Il primo algoritmo che viene in mente è il metodo del gradiente, fissiamo un learning rate $η > 0$ , ed aggiorniamo il vettore dei parametri $λ$ come

λ (t + η) = λ (t) - η \nabla_{λ} D_{M} (P ∥ P_{λ})

per $η$ abbastanza piccoli la divergenza non può crescere, infatti

\frac{d λ}{d t} = - \nabla_{λ} D_{M} (P ∥ P_{λ})

da questo segue

\frac{d}{d t} D_{M} (P ∥ P_{λ}) = \nabla_{λ} D_{M} (P ∥ P_{λ}) \frac{d λ}{d t} = - (\nabla_{λ} D_{M} (P ∥ P_{λ}))^{2} \leq 0

Remark $P$ is unknown, but it doesn’t appear in the gradient of the KL divergence:

\nabla_{λ} D_{M} (P ∥ P_{λ}) = - \frac{1}{M} i = 1 \sum M \frac{\nabla _{λ} P ( x _{i} )}{P _{λ} ( x _{i} )}

The maximum entropy principle

Idea: to find the unknown pdf, find $P$ such that it maximazies $H [P]$ given information (contrains) of mean values of some function $⟨ f (x)⟩$ .

P^{*} = argmax_{P} H [P] subject to ⟨ f_{i} (x)⟩ = f_{i})

to solve this problem we can use Lagrange multipliers.

Lorenzo Gregoris

Explorer

The bayesian setting

The bayesian setting

The max likelihood method MLE

Bayesian Learning MAP

Link with traditional learning

Gradient descent per minimizzare D_KL

The maximum entropy principle

Graph View

Table of Contents

Backlinks

Lorenzo Gregoris

Explorer

The bayesian setting

The bayesian setting §

The max likelihood method MLE §

Bayesian Learning MAP §

Link with traditional learning §

Gradient descent per minimizzare D_KL §

The maximum entropy principle §

Graph View

Table of Contents

Backlinks

The bayesian setting

The max likelihood method MLE

Bayesian Learning MAP

Link with traditional learning

Gradient descent per minimizzare D_KL

The maximum entropy principle