Deep Learning regularization

4-13 Activation functions

Why:
- Introduce the nonlinearty to make the model fancier
Types:
- none: y=x
- sigmoid:
- $\large y = \frac{1}{1+e^{-x}}$
- Used when output is binary
- tanh:
- $\large y = \frac{e^z – e^{-z}}{e^z + e^{-z}} $
- ReLU:
- $y = max(x,0)$
- Advantages & Disadvanatges:
- Sigmoid / tanh
  - Sigmoid utput values in $(0,1)$ and might be more suitable for probability shits
  - tanh output values in $(-1,1)$ so its more suitable when we need negative shits
  - They all saturates when z is very negative/positive
- ReLU
  - Fast
  - $R = 0$ when $z < 0$
Softmax:
- Used when output is discrete
- $\Large \frac{e^zj}{\sum^K{j=1}e^{z_j}}$
- [1.3, 5.2, 2.2, 0.7, 1.1] -> softmax -> [0.02, 0.90, 0.05, 0.01, 0.02], and the max probability thing will be the output
Issues of nonlinearty:
- Causes loss functions to be nonconvex
- No global convergences, there might be multiple pits
- Sensitive to initial parameters

Why:
- Control overfitting
- Reduce its generalization error caused by bias and variance
- Goal is to make an overfitted model dumber
Signs of overfitting:
- When slope of training loss/accuracy begins to get smaller
- When validation loss/accuracy begins to fluctuate
- THEN we start using this whatever the fucking thing
Regularization term:
- The whole thing to be added to make the model dumb is referred to as $R(W,b)$
- Then the total loss function becomes $L_{data}+ \lambda R(W,b)$
- Then the model also have a goal to find optimal complexity of $W, b$ and that is cool
- $\lambda$ is a hyperparameter, and same value is used on all layers
Methods:
- L1 regularization
- $R= \sum{i,j}W{ij}$
- L2 regularization
- $R= \sum{i,j}W{ij}^2$
- Data augumentation
- Flips, scale, crop, color change
- Dropout
- Everytime we do forward pass, a neuron has $p$% to dropout
- $p$ is a hyperparameter
- Dropouts in training phase, but if train number is large enough, guarantee that all neurons will be used
- No dropouts in testing phase
- Advantages:
  - computationally cheap
  - more effective than other regularizers that has same complexity
  - no limit on type of models or procedures

Post Views: 87