Suppose vector \theta is all the parameters in a neural network, I wonder how to compute hessian matrix for \theta in pytorch.
Suppose the network is as follows:
class Net(Module):
def __init__(self, h, w):
super(Net, self).__init__()
self.c1 = torch.nn.Conv2d(1, 32, 3, 1, 1)
self.f2 = torch.nn.Linear(32 * h * w, 5)
def forward(self, x):
x = self.c1(x)
x = x.view(x.size(0), -1)
x = self.f2(x)
return x
I know the second derivative can be calculated by calling torch.autograd.grad() twice, but the parameters in pytorch is organized by net.parameters(), and I don't know how to compute the hessian for all parameters.
I have tried to use torch.autograd.functional.hessian() in pytorch 1.5 as follows:
import torch
import numpy as np
from torch.nn import Module
import torch.nn.functional as F
class Net(Module):
def __init__(self, h, w):
super(Net, self).__init__()
self.c1 = torch.nn.Conv2d(1, 32, 3, 1, 1)
self.f2 = torch.nn.Linear(32 * h * w, 5)
def forward(self, x):
x = self.c1(x)
x = x.view(x.size(0), -1)
x = self.f2(x)
return x
def func_(a, b c, d):
p = [a, b, c, d]
x = torch.randn(size=[8, 1, 12, 12], dtype=torch.float32)
y = torch.randint(0, 5, [8])
x = F.conv2d(x, p[0], p[1], 1, 1)
x = x.view(x.size(0), -1)
x = F.linear(x, p[2], p[3])
loss = F.cross_entropy(x, y)
return loss
if __name__ == '__main__':
net = Net(12, 12)
h = torch.autograd.functional.hessian(func_, tuple([_ for _ in net.parameters()]))
print(type(h), len(h))
h is a tuple, and the results are in strange shape. For example, the shape of \frac{\delta Loss^2}{\delta c1.weight^2} is [32,1,3,3,32,1,3,3]. It seems like I can combine them into a complete H, but I don't know which part it is in the whole Hessian Matrix and the corresponding order.
Here is one solution, I think it's a little too complex but could be instructive.
Considering about these points:
torch.autograd.functional.hessian()the first argument must be a function, and the second argument should be a tuple or list of tensors. That means we cannot directly pass a scalar loss to it. (I don't know why, because I think there is no large difference between a scalar loss or a function that returns a scalar)So here is the solution:
I build a new function
hahathat works in the same way with the neural networkNet. Notice that argumentsa, b, c, dare all expanded into one-dimensional vectors, so that the shapes of tensors inhare all two dimensional, in good order and easy to be combined into a large hessian matrix.In my example, the shapes of tensors in
hisSo it is easy to see the meaning of the tensors and which part it is. All we need to do is to allocate a
(288+32+23040+5)*(288+32+23040+5)matrix and fix the tensors inhinto the corresponding locations.I think the solution still could be improved, like we don't need to build a function works the same way with neural network, and transform the shape of parameters twice. But for now I don't have better ideas, if there is any better solution, please let me know.