RNN

Modelling Sequence Data

Some task where the input is of sequential nature:

speech recognition
machine translation
language modeling
sentiment analysis
video data

How should we choose our model? Again, FCN are universal, but the size of the model can grow too much.

If our sequence is long $x_{1}, \dots, x_{T}$ the FCN would become prohibitively large: Idea input one $x_{i}$ at a time, re-use the same weights! With this idea we don’t care about the sequence length $T$ .

To memorize prviously information, we employ a recurrent mechanism.

h^{t} = f_{W} (h^{t - 1}, x^{t}), h^{- 1} given

we can unfold this formula $t - 1$ times.

A simple RNN model:

h^{t} = tanh (W_{hh} h^{t - 1} + W_{x h} x^{t}), o^{t} = so f t ma x (W_{h o} h^{t})

Training an RNN

It’s like we arte backpropagating through time. Just unfold the network! The loss is just the sum:

L = i = 1 \sum T L (o^{i}, y^{i})

for simplicity let’s assume that there is no bias and that the activation function is the identity. Then

h^{i} = W_{hh} h^{i - 1} + W_{x h} x^{i}

o^{i} = W_{h o} h^{i}

Let $s \leq t$ , then

\frac{\partial L ^{t}}{\partial h ^{s}} = \frac{\partial L ^{t}}{\partial h ^{t}} s < k \leq t \prod \frac{\partial h ^{k}}{\partial h ^{k + 1}}

but in our simple case

\frac{\partial h ^{k}}{\partial h ^{k + 1}} = W_{hh}

so that

= \frac{\partial L ^{t}}{\partial h ^{t}} W_{hh}^{l}

things could get ugly if $l$ is big: if the greatest eigenvalue norm is greater than $1$ we get exploding gradient, if it’s less than $1$ vanishing gradient.

LSTM

Long-Short-Term-Memory
Past input contribute less and less to the loss. We need to add memory!

Main ideas

a new hidden state $c^{t}$ called cell state, with the ability to store long-term information.
LSTM can read, write and erase information from a cell.
Gates are defined to get the ability to select information, they are also vector of length $n$ .
At each time step $t$ the gates can be open (1) or close (0), or somewhere in between.

Forget Gate

Controls what is kept vs forgotten from previous cell state.

f^{t} = σ (W_{f} h^{t - 1} + U_{f} x^{t} + b_{f})

Input Gate

Controls what parts of the new input are written to the cell.

i^{t} = σ (W_{i} h^{t - 1} + U_{i} x^{t} + b_{i})

Output Gate

Controls what parts of the cell state is trasnfered to the hidden state.

o^{t} = σ (W_{o} h^{t - 1} + U_{o} x^{t} + b_{o})

New Cell content

We update the content of the cell and the hidden state usins the gates

\tilde{c}^{t} = tanh (W_{c} h^{t - 1} + U_{c} x^{t} + b_{c})

forget and write

c^{t} = f^{t} ⊙ c^{t - 1} + i^{t} ⊙ \tilde{c}^{t}

the new hidden state uses the output gate

h^{t} = o^{t} ⊙ tanh c^{t}

GRU

Gated Recurrent Units It doesn’t use a cell state, but it has gates!

Reset Gate

controls what parts of previous hidden state are used to compute new content

r^{t} = σ (W_{r} h^{t - 1} + U_{r} x^{t} + b_{r})

Update Gate

controls what parts of previous hidden state are used to compute new content

u^{t} = σ (W_{u} h^{t - 1} + U_{u} x^{t} + b_{u})

We use the gates very similarly to create a new hidden state:

\tilde{h}^{t} = tanh (W_{h} (r^{t} ⊙ h^{t - 1}) + U_{h} x^{t} + b_{h})

h^{t} = (1 - u^{t}) ⊙ h^{t - 1} + u^{t} ⊙ \tilde{h}^{t}

Lorenzo Gregoris

Explorer

RNN

Modelling Sequence Data

Training an RNN

LSTM

Forget Gate

Input Gate

Output Gate

New Cell content

GRU

Reset Gate

Update Gate

Graph View

Table of Contents

Backlinks

Lorenzo Gregoris

Explorer

RNN

Modelling Sequence Data §

Training an RNN §

LSTM §

Forget Gate §

Input Gate §

Output Gate §

New Cell content §

GRU §

Reset Gate §

Update Gate §

Graph View

Table of Contents

Backlinks

Modelling Sequence Data

Training an RNN

LSTM

Forget Gate

Input Gate

Output Gate

New Cell content

GRU

Reset Gate

Update Gate