RNN

Basic structure

Hidden layer:
- X: (n, d), n is batch_size and d is dimension
- W_xh: (d, h): how to transform input
- H_t-1: (n, h): hidden layer from prev time step
- W_hh: (h, h): how to use hidden layer form prev time step
- b_h: (1, h): bias
- $\phi$ : Activation (ReLU, …)
Output layer:
- H_t: (n, h)
- W_hq: (h, q)
- b_q: (1, q)
- O: (n, q): output logit, can use softmax(O) to get class probabilities
Where
- n = batch_size
- d = input_dimension
- h = hidden_size
- q = output_features_size, basically how many things we want the model to spit out

How is it trained

Perplexity
- It is a metric
- For a language model, it is the average number of guesses a model make about the next word, when the model is given a sentence
- .
  - Based on cross entropy
  - example
    - In best cases, those probabilities are all 1, and the perplexity would be $\exp(0)$ = 1
    - In worst cases, those probabilities are infinitely small, the perplexity would be infinitely large
By the way some notes about cross entropy because I forget everything in 2 fucking seconds
- Entropy quantifies the “surprise” element of a thing, like the surprise of drawing a ssr
- Cross entropy’s cross part is “crossing over” tso another distribution
  - This is used when we are comparing 2 distributions
  - $H(P, Q) = -\sum_i{P(i)logQ(i)}$ , where P is true distribution and Q is predicted distribution
  - If P=Q, then H is just the entropy of P, the prediction progress created no added uncertainty
  - If Q deviates from P, then this H will be larger than the entropy of P, and the larger the gap, the more problematic of (more uncertainty in) Q’s predicted distributions
Back propagation through time
- See BPTT算法详解：深入探究循环神经网络（RNN）中的梯度计算【原理理解】-CSDN博客
- During this process, gradients are vulnerable to vanishing (when one or many gradients are small) or exploding (when one or many gradients are too large)
- Gradient clipping is used to address this issue. When a gradient is too big, scale that gradient down by some rules

Fancier RNNs

“ $\odot$ “

$A = \begin{pmatrix} a_1 & a_2 & a_3 \\ a_4 & a_5 & a_6 \\ a_7 & a_8 & a_9 \\ \end{pmatrix} , B = \begin{pmatrix} b_1 & b_2 & b_3 \\ b_4 & b_5 & b_6 \\ b_7 & b_8 & b_9 \\ \end{pmatrix}$
$A \odot B = \begin{pmatrix} a_1b_1 & a_2b_2 & a_3b_3 \\ a_4b_4 & a_5b_5 & a_6b_6 \\ a_7b_7 & a_8b_8 & a_9b_9 \\ \end{pmatrix}$

GRU

Gated Recurrent Unit
Gimmicks
- Reset gate
  - A FC layer with Sigmoid activation function, thus value is only in [0, 1]
  - For value of each cell:
    - Approaches 0 -> the value of corresponding cell on $H_{t-1}$ should be reset
    - Approaches 1 -> …… should not reset
  - The effect of resetting is manifested by using
    - multiply whatever by a value close to 0 weakens its effect
    - ….. close to 1 means almost no weakening of its effect
- Update gate
  - A FC layer with Sigmoid activation function, thus value is only in [0, 1]
  - For value of each cell:
    - Approaches 0 -> More portion in $\mathbf H_t$ is from $\tilde{\mathbf{H}}_t$
    - Approaches 1 -> Less ……
Process: Overall structure see this
1. enters the unit and passes through Reset and Update gate
  - In reset gate:
    - $\mathbf{R}_t = \sigma(\mathbf{X}_t \mathbf{W}_{xr} + \mathbf{H}_{t-1} \mathbf{W}_{hr} + \mathbf{b}_r)$
    - $\mathbf W_{xr}, \mathbf W_{hr}, \mathbf b_r$ are params associated with Reset gate
  - In update gate:
    - $\mathbf{Z}_t = \sigma(\mathbf{X}_t \mathbf{W}_{xz} + \mathbf{H}_{t-1} \mathbf{W}_{hz} + \mathbf{b}_z)$
    - $\mathbf W_{xz}, \mathbf W_{hz}, \mathbf b_z$ are params associated with Reset gate
2. Use $\mathbf {R_t, H_{t-1}, X_t}$ to generate candidate hidden state \tilde{\mathbf{H}}_t
,
- $\mathbf W_{xh}$ is used on $\mathbf X_t$
- $\odot$ operator enable the model to select which cell in $H_{t-1}$ should be reset
- Activated using tanh because the value falls into [-1, 1]

Use $\mathbf Z_t$ to calculate the final version of $\mathbf H_t$

$\mathbf{H}_t = \mathbf{Z}_t \odot \mathbf{H}_{t-1} + (1 - \mathbf{Z}_t) \odot \tilde{\mathbf{H}}_t$ .
Basically, for each element, how much should be from respectively
- One cell in $\mathbf Z_t$ Approaches 0 -> More portion in $\mathbf H_t$ is from $\tilde{\mathbf{H}}_t$
- One cell in $\mathbf Z_t$ Approaches 1 -> Less ……

LSTM

Long Short Term Memory
GRU is a watered down version of this
Gimmicks:
- Forget gate
  - FC layer with sigmoid activation
  - Operates on the memories from prev step. Uses cell-wise multiplication ()
    - $\mathbf F_t \rightarrow 0$ : More memories from the past ( $\mathbf C_{t-1}$ ) are forgotten
    - $\mathbf F_t \rightarrow 1$ : … saved and used in the formation of new memories ( $\mathbf C_t$ )
- Input gate
  - FC layer with sigmoid activation
  - Operates on candidate memory () that is formed by tanh-ing the combination of and
    - $\mathbf I_t \rightarrow 0$ : More memories from candidate memory are forgotten
    - $\mathbf I_t \rightarrow 1$ : … saved and used in the formation of new memories ( $\mathbf C_t$ )
- Output gate
  - FC layer with sigmoid activation
  - Operates on the combined memory of and , which is to be the new memory,
    - $\mathbf O_t \rightarrow 0$ : Hidden cells will be updated more
    - $\mathbf O_t \rightarrow 1$ : Hidden cells will be updated less
Process
- See flow chart in here

Encoder-Decoder & seq2seq

Steps of traditional seq2seq:

Each word in a sentence get its embedding vector
In each time step (for each word in a sentence), a h is generated and recorded
The shape of h is dependent on the batch size and hidden size, for multiple layers, it is the output of the last layer. It has NOTHING TO DO with embedding size.
in the final time step, the h generated in the final step is called “context vector” and is given a new letter c. For multiple layers, it is the hidden state of all layers.
c acts as the initial hidden state (h_0) for the decoder and decoder use it to initiate the RNN
For each output generated, it is fed as the x for the next RNN time step, until the output becomes , indicating the ending of a sentence
The initial x is “<bos>”, indicating the beginning of a sentence

Post Views: 89

D2L Chapter 8-9 note

RNN

Basic structure

How is it trained

Fancier RNNs

“ $\odot$ “

GRU

LSTM

Encoder-Decoder & seq2seq

Latest Comments

D2L Chapter 8-9 note

RNN

Basic structure

How is it trained

Fancier RNNs

““

GRU

LSTM

Encoder-Decoder & seq2seq

Latest Comments

“ $\odot$ “