Deep Learning regularization

4-13 Activation functions

  • Why:
    • Introduce the nonlinearty to make the model fancier
  • Types:
    • none: y=x
    • sigmoid:
    • $\large y = \frac{1}{1+e^{-x}}$
    • Used when output is binary
    • tanh:
    • $\large y = \frac{e^z – e^{-z}}{e^z + e^{-z}} $
    • ReLU:
    • $y = max(x,0)$
    • Advantages & Disadvanatges:
    • Sigmoid / tanh
      • Sigmoid utput values in $(0,1)$ and might be more suitable for probability shits
      • tanh output values in $(-1,1)$ so its more suitable when we need negative shits
      • They all saturates when z is very negative/positive
    • ReLU
      • Fast
      • $R = 0$ when $z < 0$
  • Softmax:
    • Used when output is discrete
    • $\Large \frac{e^zj}{\sum^K{j=1}e^{z_j}}$
    • [1.3, 5.2, 2.2, 0.7, 1.1] -> softmax -> [0.02, 0.90, 0.05, 0.01, 0.02], and the max probability thing will be the output
  • Issues of nonlinearty:
    • Causes loss functions to be nonconvex
    • No global convergences, there might be multiple pits
    • Sensitive to initial parameters

15-31 Regularization

  • Why:
    • Control overfitting
    • Reduce its generalization error caused by bias and variance
    • Goal is to make an overfitted model dumber
  • Signs of overfitting:
    • When slope of training loss/accuracy begins to get smaller
    • When validation loss/accuracy begins to fluctuate
    • THEN we start using this whatever the fucking thing
  • Regularization term:
    • The whole thing to be added to make the model dumb is referred to as $R(W,b)$
    • Then the total loss function becomes $L_{data}+ \lambda R(W,b)$
    • Then the model also have a goal to find optimal complexity of $W, b$ and that is cool
    • $\lambda$ is a hyperparameter, and same value is used on all layers
  • Methods:
    • L1 regularization
    • $R= \sum{i,j}W{ij}$
    • L2 regularization
    • $R= \sum{i,j}W{ij}^2$
    • Data augumentation
    • Flips, scale, crop, color change
    • Dropout
    • Everytime we do forward pass, a neuron has $p$% to dropout
    • $p$ is a hyperparameter
    • Dropouts in training phase, but if train number is large enough, guarantee that all neurons will be used
    • No dropouts in testing phase
    • Advantages:
      • computationally cheap
      • more effective than other regularizers that has same complexity
      • no limit on type of models or procedures

33-36 Feedforward ntwk in python demo

38-45 Python Keras basics

Tags:

Comments are closed

Latest Comments