D2L Chapter 3 Note

Comments on 3.1

Gradient descent for linear regression:

$\mathbf{w} &\leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)$

$\ b &\leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right).$

Here,

$\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) = \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$

Partial derevatives are first taken on $\frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2$ , yielding $\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)$ .

Then for $\mathbf{w}$ part, x is extracted and multiplied.

No actions on $\mathbf{b}$ part.

Maximum Likelihood Estimation:

For each individual point, the likelihood is

$P(y \mid \mathbf{x}) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (y - \mathbf{w}^\top \mathbf{x} - b)^2\right)$

This is obtained because we set $y = \mathbf{w}^\top \mathbf{x} + b + \epsilon$ , where $\epsilon \sim \mathcal{N}(0, \sigma^2)$ .

After transformation, we can get $\epsilon = y - b - \mathbf{w}^\top \mathbf{x}$ . Since $\epsilon$ has that normal distribution, we substitute $\epsilon$ in its p.d.f. with right hand side that contains y, b, x and w, and transformed it into a function containing x and y

Then for all points, the maximum likelihood is obtained via $P(\mathbf y \mid \mathbf X) = \prod_{i=1}^{n} p(y^{(i)}|\mathbf{x}^{(i)})$

Ultimate goal is to find the set of w and b that maximizes that value. This process can be simplified by using log transformation -> log likelihood

Implementation of linear regression

SGD is basically that gradient update shit from previous chapter, performed on each batch instead of each epoch

trainer = torch.optim.SGD(net.parameters(), lr=0.03)
num_epochs = 3

for epoch in range(num_epochs):
    for X, y in data_iter:
        l = loss(net(X) ,y)
        trainer.zero_grad()
        l.backward()
        trainer.step()
    l = loss(net(features), labels)
    print(f'epoch {epoch + 1}, loss {l:f}')

loss = nn.MSELoss()

trainer = torch.optim.SGD(net.parameters(), lr=0.03), where net=Sequential(nn.Linear(2, 1))

Forward pass: l = loss(net(X), y) computes the loss for the current batch of data.

Zero gradients: trainer.zero_grad() clears the old gradients.

Backward pass: l.backward() computes the new gradients for the loss with respect to the model parameters.

Update parameters: trainer.step() updates the model parameters using the computed gradients.

Softmax

logit = “未规范化的预测”

Overflow: in the original softmax $\hat y_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}$ ，if any $\mathbf{o}$ is too big, $\exp o_j$ might exceed the maximum number of that data type can handle, resulting in 0, inf or nan

To solve this,

$\begin{aligned} \hat y_j & = \frac{\exp(o_j - \max(o_k))\exp(\max(o_k))}{\sum_k \exp(o_k - \max(o_k))\exp(\max(o_k))} \\ & = \frac{\exp(o_j - \max(o_k))}{\sum_k \exp(o_k - \max(o_k))} \end{aligned}$

Underflow: $(o_j - \max(o_k))$ can be infinitely small, making the exponential -inf, and it have bad results when being back propagated. This can be solved by taking a log, as shown:

$\begin{aligned} \log{(\hat y_j)} & = \log\left( \frac{\exp(o_j - \max(o_k))}{\sum_k \exp(o_k - \max(o_k))}\right) \\ & = \log{(\exp(o_j - \max(o_k)))}-\log{\left( \sum_k \exp(o_k - \max(o_k)) \right)} \\ & = o_j - \max(o_k) -\log{\left( \sum_k \exp(o_k - \max(o_k)) \right)}. \end{aligned}$

Cross entrypy loss = $L(y,\hat y) = -\sum_{i=1}^Cy_ilog(\hat y_i)$

example: 5 classes, the real class is no.2

def train_epoch_ch3(net, train_iter, loss, updater):
    if isinstance(net, torch.nn.Module):
        net.train()
    metric = Accumulator(3)
    for X, y in train_iter:
        y_hat = net(X)
        l = loss(y_hat, y)
        if isinstance(updater, torch.optim.Optimizer):
            updater.zero_grad()
            l.mean().backward()
            updater.step()
        else:
            l.sum().backward()
            updater(X.shape[0])
        metric.add(float(l.sum()), accuracy(y_hat, y), y.numel())
    return metric[0] / metric[2], metric[1] / metric[2]

For the first isinstance, if net is an instance of torch.nn.Module, then it sets the net into “training mode”.

In “training mode”, dropout layers are activated where applicable, and batchnorm layers behaves differently than in “eval mode”.

For the second isinstance, If updater is not an instance of torch.optim.Optimizer, it assumes a custom updater function is being used and performs a different set of operations:

Computes the gradient of the loss with respect to the parameters (l.sum().backward()).
Calls the custom updater function with the batch size as an argument (updater(X.shape[0])).

Post Views: 81

Comments on 3.1

Implementation of linear regression

Softmax

Latest Comments