RBM

Architecture

Is a neural network architecture on a complete bipartite graph. The two layer are called visible and hiddent, in total we have

N = N_{v} + N_{h}

total neurons. Indichiamo lo stato dei neuroni visibili con $v \in {- 1, 1}$ e $h \in {- 1, 1}$ for the hidden neurons.

The hamiltonian is the usual of neural networks models

H^{RBM} = - ij \sum J_{ij} s_{i} s_{j} - i \sum θ_{i} s_{i}

Supervised learning

The goal is to learn the joint distribution $q (x, y)$ where $x$ is the input and $y$ the label.

The idea is to use the visible neurons $v$ to encode $x$ , and the hidden to encode the label (class).

We need to find parameters $J, θ$ such that the equilibrium distribution of the network is as close as possible to the data

P_{J, θ} (v, h) \sim q (v, h)

Then we can use the network to predict the correct label of unseen data, the best guess

h^{*} = E [h ∣ v = v^{*}]

Unsupervides learning

Input: some data $D = {x_{i}}$ . Task: learn the data distribution $q (x)$ .

Again we encode our data $x$ in the visible layer neurons $v$ . So we need to find the best parameters $J, θ$ such that

P_{J, θ} (v) \sim q (v)

Once the network is trained, it can be used to generate new data, denoise, and to reconstruct.

Learning

The task is to learn a distribution. The obvious choice is to define a metric (loss) beetween the network distribution and the goal and try to minimize that using gradient descent.

We will use the KL-divergence $D (q ∥ P_{J, θ})$ . Using gradient descent we obtain update rules for the parameters:

Δ J_{ij} = - η \partial_{J_{ij}} D (q ∥ P_{J, θ})

Δ θ_{i} = - η \partial_{θ_{i}} D (q ∥ P_{J, θ})

thus we need to compute this derivatives. Let’s consider the RBM in the unsupervised learning setting.

P_{J, θ} (v) = \frac{\sum _{h} exp [ - β H ( v , h )]}{Z} = \frac{Z ( v )}{Z}

since we only care about visibile neurons, we had to marginalize.

\partial_{λ} D (q ∥ P_{λ}) = - v \sum q (v) \partial_{λ} [ln q (v) - ln P_{λ} (q)]

= - v \sum q (v) \partial_{λ} [- ln Z (q) + ln Z]

= v \sum q (v) [\frac{\partial _{λ} Z ( v )}{Z ( v )} - \frac{\partial _{λ} Z}{Z}]

calcoliamo le derivate che compaiono al numeratore

\partial_{λ} Z (v) = - β h \sum \partial_{λ} H (v, h) e^{- β H}

\partial_{λ} Z = - β v, h \sum \partial_{λ} H (v, h) e^{- β H}

rimettendo insieme…

Δ J_{ij} = - \frac{η β}{ln 2} [⟨ s_{i} s_{j} ⟩_{f ree} - ⟨ s_{i} s_{j} ⟩_{c l am p e d}]

Δ θ_{i} = - \frac{η β}{ln 2} [⟨ s_{i} ⟩_{f ree} - ⟨ s_{i} ⟩_{c l am p e d}]

per la procedura di training devo far termalizzare la rete… Computazionalmente non trattabile per reti non miniuscole.

Teniche per approssimare:

MCMC Monte Carlo Markov Chain, più passi faccio meglio è (empiricamente ne basta anche uno solo!)
Approssimazione mean field usando le equazioni di autoconsistenza.

Lorenzo Gregoris

Explorer

Restricted Boltzmann machines RBM

RBM

Architecture

Supervised learning

Unsupervides learning

Learning

Graph View

Table of Contents

Backlinks

Lorenzo Gregoris

Explorer

Restricted Boltzmann machines RBM

RBM §

Architecture §

Supervised learning §

Unsupervides learning §

Learning §

Graph View

Table of Contents

Backlinks

RBM

Architecture

Supervised learning

Unsupervides learning

Learning