Comments on 3.1
Gradient descent for linear regression:
Here,
Partial derevatives are first taken on , yielding .
Then for part, x is extracted and multiplied.
No actions on part.
Maximum Likelihood Estimation:
For each individual point, the likelihood is
This is obtained because we set , where .
After transformation, we can get . Since has that normal distribution, we substitute in its p.d.f. with right hand side that contains y, b, x and w, and transformed it into a function containing x and y
Then for all points, the maximum likelihood is obtained via
Ultimate goal is to find the set of w and b that maximizes that value. This process can be simplified by using log transformation -> log likelihood
Implementation of linear regression
SGD is basically that gradient update shit from previous chapter, performed on each batch instead of each epoch
trainer = torch.optim.SGD(net.parameters(), lr=0.03)
num_epochs = 3
for epoch in range(num_epochs):
for X, y in data_iter:
l = loss(net(X) ,y)
trainer.zero_grad()
l.backward()
trainer.step()
l = loss(net(features), labels)
print(f'epoch {epoch + 1}, loss {l:f}')
loss = nn.MSELoss()
trainer = torch.optim.SGD(net.parameters(), lr=0.03), where net=Sequential(nn.Linear(2, 1))
Forward pass: l = loss(net(X), y)
computes the loss for the current batch of data.
Zero gradients: trainer.zero_grad()
clears the old gradients.
Backward pass: l.backward()
computes the new gradients for the loss with respect to the model parameters.
Update parameters: trainer.step()
updates the model parameters using the computed gradients.
Softmax
logit = “未规范化的预测”
Overflow: in the original softmax ,if any is too big, might exceed the maximum number of that data type can handle, resulting in 0
, inf
or nan
To solve this,
Underflow: can be infinitely small, making the exponential -inf, and it have bad results when being back propagated. This can be solved by taking a log, as shown:
Cross entrypy loss =
example: 5 classes, the real class is no.2
def train_epoch_ch3(net, train_iter, loss, updater):
if isinstance(net, torch.nn.Module):
net.train()
metric = Accumulator(3)
for X, y in train_iter:
y_hat = net(X)
l = loss(y_hat, y)
if isinstance(updater, torch.optim.Optimizer):
updater.zero_grad()
l.mean().backward()
updater.step()
else:
l.sum().backward()
updater(X.shape[0])
metric.add(float(l.sum()), accuracy(y_hat, y), y.numel())
return metric[0] / metric[2], metric[1] / metric[2]
For the first isinstance
, if net is an instance of torch.nn.Module, then it sets the net into “training mode”.
In “training mode”, dropout layers are activated where applicable, and batchnorm layers behaves differently than in “eval mode”.
For the second isinstance
, If updater
is not an instance of torch.optim.Optimizer
, it assumes a custom updater function is being used and performs a different set of operations:
- Computes the gradient of the loss with respect to the parameters (
l.sum().backward()
). - Calls the custom updater function with the batch size as an argument (
updater(X.shape[0])
).
Comments are closed