D2L Chapter 8-9 note

RNN

Basic structure

  • Hidden layer: \mathbf{H}_t = \phi(\mathbf{X}_t \mathbf{W}_{xh} + \mathbf{H}_{t-1} \mathbf{W}_{hh} + \mathbf{b}_h)
    • X: (n, d), n is batch_size and d is dimension
    • W_xh: (d, h): how to transform input
    • H_t-1: (n, h): hidden layer from prev time step
    • W_hh: (h, h): how to use hidden layer form prev time step
    • b_h: (1, h): bias
    • \phi: Activation (ReLU, …)
  • Output layer: \mathbf{O}_t = \mathbf{H}_t \mathbf{W}_{hq} + \mathbf{b}_q
    • H_t: (n, h)
    • W_hq: (h, q)
    • b_q: (1, q)
    • O: (n, q): output logit, can use softmax(O) to get class probabilities
  • Where
    • n = batch_size
    • d = input_dimension
    • h = hidden_size
    • q = output_features_size, basically how many things we want the model to spit out

How is it trained

  • Perplexity
    • It is a metric
    • For a language model, it is the average number of guesses a model make about the next word, when the model is given a sentence
    • \exp\left(-\frac{1}{n} \sum_{t=1}^n \log P(x_t \mid x_{t-1}, \ldots, x_1)\right).
      • Based on cross entropy
      • example
        • In best cases, those probabilities are all 1, and the perplexity would be \exp(0) = 1
        • In worst cases, those probabilities are infinitely small, the perplexity would be infinitely large
  • By the way some notes about cross entropy because I forget everything in 2 fucking seconds
    • Entropy quantifies the “surprise” element of a thing, like the surprise of drawing a ssr
    • Cross entropy’s cross part is “crossing over” tso another distribution
      • This is used when we are comparing 2 distributions
      • H(P, Q) = -\sum_i{P(i)logQ(i)}, where P is true distribution and Q is predicted distribution
      • If P=Q, then H is just the entropy of P, the prediction progress created no added uncertainty
      • If Q deviates from P, then this H will be larger than the entropy of P, and the larger the gap, the more problematic of (more uncertainty in) Q’s predicted distributions
  • Back propagation through time

Fancier RNNs

\odot

  • A = \begin{pmatrix} a_1 & a_2 & a_3 \\ a_4 & a_5 & a_6 \\ a_7 & a_8 & a_9 \\ \end{pmatrix} , B = \begin{pmatrix} b_1 & b_2 & b_3 \\ b_4 & b_5 & b_6 \\ b_7 & b_8 & b_9 \\ \end{pmatrix}
  • A \odot B = \begin{pmatrix} a_1b_1 & a_2b_2 & a_3b_3 \\ a_4b_4 & a_5b_5 & a_6b_6 \\ a_7b_7 & a_8b_8 & a_9b_9 \\ \end{pmatrix}

GRU

  • Gated Recurrent Unit
  • Gimmicks
    • Reset gate \mathbf R_t
      • A FC layer with Sigmoid activation function, thus value is only in [0, 1]
      • For value of each cell:
        • Approaches 0 -> the value of corresponding cell on H_{t-1} should be reset
        • Approaches 1 -> …… should not reset
      • The effect of resetting H_{t-1} is manifested by using \odot
        • multiply whatever by a value close to 0 weakens its effect
        • ….. close to 1 means almost no weakening of its effect
    • Update gate \mathbf Z_t
      • A FC layer with Sigmoid activation function, thus value is only in [0, 1]
      • For value of each cell:
        • Approaches 0 -> More portion in \mathbf H_t is from \tilde{\mathbf{H}}_t
        • Approaches 1 -> Less ……
  • Process: Overall structure see this
    1. \mathbf X_t, \mathbf H_{t-1} enters the unit and passes through Reset and Update gate
      • In reset gate:
        • \mathbf{R}_t = \sigma(\mathbf{X}_t \mathbf{W}_{xr} + \mathbf{H}_{t-1} \mathbf{W}_{hr} + \mathbf{b}_r)
        • \mathbf W_{xr}, \mathbf W_{hr}, \mathbf b_r are params associated with Reset gate
      • In update gate:
        • \mathbf{Z}_t = \sigma(\mathbf{X}_t \mathbf{W}_{xz} + \mathbf{H}_{t-1} \mathbf{W}_{hz} + \mathbf{b}_z)
        • \mathbf W_{xz}, \mathbf W_{hz}, \mathbf b_z are params associated with Reset gate
    2. Use \mathbf {R_t, H_{t-1}, X_t} to generate candidate hidden state \tilde{\mathbf{H}}_t
  • \tilde{\mathbf{H}}_t = \tanh(\mathbf{X}_t \mathbf{W}_{xh} + \left(\mathbf{R}_t \odot \mathbf{H}_{t-1}\right) \mathbf{W}_{hh} + \mathbf{b}_h),
    • \mathbf W_{xh} is used on \mathbf X_t
    • \odot operator enable the model to select which cell in H_{t-1} should be reset
    • Activated using tanh because the value falls into [-1, 1]
  1. Use \mathbf Z_t to calculate the final version of \mathbf H_t
  • \mathbf{H}_t = \mathbf{Z}_t \odot \mathbf{H}_{t-1} + (1 - \mathbf{Z}_t) \odot \tilde{\mathbf{H}}_t.
  • Basically, for each element, how much should be from \tilde{\mathbf{H}}_t, \mathbf H_{t-1} respectively
    • One cell in \mathbf Z_t Approaches 0 -> More portion in \mathbf H_t is from \tilde{\mathbf{H}}_t
    • One cell in \mathbf Z_t Approaches 1 -> Less ……

LSTM

  • Long Short Term Memory
  • GRU is a watered down version of this
  • Gimmicks:
    • Forget gate \mathbf F_t
      • FC layer with sigmoid activation
      • Operates on the memories from prev step. Uses cell-wise multiplication (\odot)
        • \mathbf F_t \rightarrow 0: More memories from the past (\mathbf C_{t-1}) are forgotten
        • \mathbf F_t \rightarrow 1: … saved and used in the formation of new memories (\mathbf C_t)
    • Input gate \mathbf I_t
      • FC layer with sigmoid activation
      • Operates on candidate memory (\tilde{\mathbf C_t}) that is formed by tanh-ing the combination of \mathbf H_{t-1} and \mathbf X_t
        • \mathbf I_t \rightarrow 0: More memories from candidate memory are forgotten
        • \mathbf I_t \rightarrow 1: … saved and used in the formation of new memories (\mathbf C_t)
    • Output gate \mathbf O_t
      • FC layer with sigmoid activation
      • Operates on the combined memory of \mathbf C_{t-1} and \tilde{\mathbf C_t}, which is to be the new memory, \mathbf C_t
        • \mathbf O_t \rightarrow 0: Hidden cells will be updated more
        • \mathbf O_t \rightarrow 1: Hidden cells will be updated less
  • Process
    • See flow chart in here

Encoder-Decoder & seq2seq

Steps of traditional seq2seq:

  1. Each word in a sentence get its embedding vector
  2. In each time step (for each word in a sentence), a h is generated and recorded
  3. The shape of h is dependent on the batch size and hidden size, for multiple layers, it is the output of the last layer. It has NOTHING TO DO with embedding size.
  4. in the final time step, the h generated in the final step is called “context vector” and is given a new letter c. For multiple layers, it is the hidden state of all layers.
  5. c acts as the initial hidden state (h_0) for the decoder and decoder use it to initiate the RNN
  6. For each output generated, it is fed as the x for the next RNN time step, until the output becomes , indicating the ending of a sentence
  7. The initial x is “<bos>”, indicating the beginning of a sentence

Tags:

Comments are closed

Latest Comments