Logistic regression

Suppose that we want to learn a function $f : R^{n} \to {0, 1}$ . A simple hypotesis would be a step function:

h_{w} (x) = θ (w^{T} x)

but this is non differentiable! Another simple choice that is smooth is the logistic function:

l o g i s t i c (x) = \frac{1}{1 + e ^{- x}}

so that ourmodel is defined as

P (y = 1 ∣ x) = l o g i s t i c (w^{T} x)

Since $P (y = 1 ∣ x) + P (y = 0 ∣ x) = 1$ , we find that

q_{w} (y = 1∣ x) = P (y = 1 ∣ x) = \frac{1}{1 - e ^{- w^{T} x}}

We call $p$ the ‘true’ probability. Our goal is to make $q$ as similar as possible to $p$ , to do so we introduce

H (p, q) = - p (y = 1) lo g \frac{p ( y = 1 )}{q ( y = 1 )} - (1 - p (y = 1)) lo g \frac{1 - p ( y = 1 )}{1 - q ( y = 1 )}

we minimze the average binary cross-entropy on the $m$ training examples

\frac{1}{m} i = 1 \sum m H (p, q) = - \frac{1}{m} i = 1 \sum m y_{i} lo g q_{w} (y = 1) + (1 - y_{i}) l o g (1 - q_{w} (y = 1))

note that

lo g q_{w} = lo g \frac{1}{1 + e ^{- w^{T} x}} = - lo g (1 + e^{- w^{T} x})

lo g (1 - q_{w}) = lo g (1 - \frac{1}{1 + e ^{- w^{T} x}}) = lo g (\frac{e ^{- w^{T} x}}{1 + e ^{- w^{T} x}}) = - w^{T} x - lo g (1 + e^{- w^{T} x})

let’s compute the gradient!

- \frac{1}{m} i = 1 \sum m y_{i} (\frac{x e ^{- w^{T} x}}{1 + e ^{- w^{T} x}}) + (1 - y_{i}) (- x + \frac{x e ^{- w^{T} x}}{1 + e ^{- w^{T} x}})

- \frac{1}{m} i = 1 \sum m y_{i} x + (\frac{x e ^{- w^{T} x}}{1 + e ^{- w^{T} x}})

- \frac{1}{m} i = 1 \sum m x_{i} (y_{i} + \frac{e ^{- w^{T} x_{i}}}{1 + e ^{- w^{T} x_{i}}})

- \frac{1}{m} i = 1 \sum m x_{i} (y_{i} + 1 - q_{w} (x_{i}))

(c’è un +1 che non dovrebbe esserci, alla fine è molto clearn) imposing it equals to zero

i = 1 \sum m x_{i} (y_{i} - q_{w} (x_{i})) = 0

which is

i = 1 \sum m x_{i} (y_{i} - \overset{y}{^}_{i}) = 0

this is not analitically tractable! We need to use numeric techincs like Gradient Methods.

Lorenzo Gregoris