Callback and helper function to add hooks in models
from nbdev.showdoc import *
from fastai2.test_utils import *

What are hooks?

Hooks are functions you can attach to a particular layer in your model and that will be executed in the foward pass (for forward hooks) or backward pass (for backward hooks). Here we begin with an introduction around hooks, but you should jump to HookCallback if you quickly want to implement one (and read the following example ActivationStats).

Forward hooks are functions that take three arguments: the layer it's applied to, the input of that layer and the output of that layer.

tst_model = nn.Linear(5,3)
def example_forward_hook(m,i,o): print(m,i,o)
    
x = torch.randn(4,5)
hook = tst_model.register_forward_hook(example_forward_hook)
y = tst_model(x)
hook.remove()
Linear(in_features=5, out_features=3, bias=True) (tensor([[-2.0659,  0.0029, -0.9297, -1.6083, -1.7315],
        [-1.6195, -1.5001, -0.8875, -0.3388, -0.3652],
        [ 0.2002,  0.2435,  1.2172,  0.8630,  1.2036],
        [ 0.6391,  1.2876, -1.9677,  0.1012, -0.4137]]),) tensor([[-0.1301,  0.0155,  0.2497],
        [ 0.2340, -0.3740, -0.4380],
        [-0.3792, -0.4760, -0.1643],
        [ 0.4125,  0.9391, -1.2156]], grad_fn=<AddmmBackward>)

Backward hooks are functions that take three arguments: the layer it's applied to, the gradients of the loss with respect to the input, and the gradients with respect to the output.

def example_backward_hook(m,gi,go): print(m,gi,go)
hook = tst_model.register_backward_hook(example_backward_hook)

x = torch.randn(4,5)
y = tst_model(x)
loss = y.pow(2).mean()
loss.backward()
hook.remove()
Linear(in_features=5, out_features=3, bias=True) (tensor([-0.0889, -0.1311, -0.3239]), None, tensor([[-0.1187,  0.2066, -0.1778],
        [-0.0607,  0.3248, -0.0937],
        [ 0.0405, -0.0819,  0.1595],
        [ 0.0120, -0.1240,  0.0143],
        [-0.0533, -0.2217, -0.2708]])) (tensor([[-0.0366,  0.0250, -0.1122],
        [-0.0503, -0.0623, -0.1142],
        [ 0.0212, -0.1468, -0.0359],
        [-0.0231,  0.0530, -0.0617]]),)

Hooks can change the input/output of a layer, or the gradients, print values or shapes. If you want to store something related to theses inputs/outputs, it's best to have your hook associated to a class so that it can put it in the state of an instance of that class.

class Hook[source]

Hook(m, hook_func, is_forward=True, detach=True, cpu=False, gather=False)

Create a hook on m with hook_func.

This will be called during the forward pass if is_forward=True, the backward pass otherwise, and will optionally detach, gather and put on the cpu the (gradient of the) input/output of the model before passing them to hook_func. The result of hook_func will be stored in the stored attribute of the Hook.

tst_model = nn.Linear(5,3)
hook = Hook(tst_model, lambda m,i,o: o)
y = tst_model(x)
test_eq(hook.stored, y)

Hook.hook_fn[source]

Hook.hook_fn(module, input, output)

Applies hook_func to module, input, output.

Hook.remove[source]

Hook.remove()

Remove the hook from the model.

tst_model = nn.Linear(5,10)
x = torch.randn(4,5)
y = tst_model(x)
hook = Hook(tst_model, example_forward_hook)
test_stdout(lambda: tst_model(x), f"{tst_model} ({x},) {y.detach()}")
hook.remove()
test_stdout(lambda: tst_model(x), "")

Context Manager

Since it's very important to remove your Hook even if your code is interrupted by some bug, Hook can be used as context managers.

Hook.__enter__[source]

Hook.__enter__(*args)

Register the hook

Hook.__exit__[source]

Hook.__exit__(*args)

Remove the hook

tst_model = nn.Linear(5,10)
x = torch.randn(4,5)
y = tst_model(x)
with Hook(tst_model, example_forward_hook) as h:
    test_stdout(lambda: tst_model(x), f"{tst_model} ({x},) {y.detach()}")
test_stdout(lambda: tst_model(x), "")

hook_output[source]

hook_output(module, detach=True, cpu=False, grad=False)

Return a Hook that stores activations of module in self.stored

The activations stored are the gradients if grad=True, otherwise the output of module. If detach=True they are detached from their history, and if cpu=True, they're put on the CPU.

tst_model = nn.Linear(5,10)
x = torch.randn(4,5)
with hook_output(tst_model) as h:
    y = tst_model(x)
    test_eq(y, h.stored)
    assert not h.stored.requires_grad
    
with hook_output(tst_model, grad=True) as h:
    y = tst_model(x)
    loss = y.pow(2).mean()
    loss.backward()
    test_close(2*y / y.numel(), h.stored[0])
#cuda
with hook_output(tst_model, cpu=True) as h:
    y = tst_model.cuda()(x.cuda())
    test_eq(h.stored.device, torch.device('cpu'))

class Hooks[source]

Hooks(ms, hook_func, is_forward=True, detach=True, cpu=False)

Create several hooks on the modules in ms with hook_func.

layers = [nn.Linear(5,10), nn.ReLU(), nn.Linear(10,3)]
tst_model = nn.Sequential(*layers)
hooks = Hooks(tst_model, lambda m,i,o: o)
y = tst_model(x)
test_eq(hooks.stored[0], layers[0](x))
test_eq(hooks.stored[1], F.relu(layers[0](x)))
test_eq(hooks.stored[2], y)
hooks.remove()

Hooks.stored[source]

The states saved in each hook.

Hooks.remove[source]

Hooks.remove()

Remove the hooks from the model.

Context Manager

Like Hook , you can use Hooks as context managers.

Hooks.__enter__[source]

Hooks.__enter__(*args)

Register the hooks

Hooks.__exit__[source]

Hooks.__exit__(*args)

Remove the hooks

layers = [nn.Linear(5,10), nn.ReLU(), nn.Linear(10,3)]
tst_model = nn.Sequential(*layers)
with Hooks(layers, lambda m,i,o: o) as h:
    y = tst_model(x)
    test_eq(h.stored[0], layers[0](x))
    test_eq(h.stored[1], F.relu(layers[0](x)))
    test_eq(h.stored[2], y)

hook_outputs[source]

hook_outputs(modules, detach=True, cpu=False, grad=False)

Return Hooks that store activations of all modules in self.stored

The activations stored are the gradients if grad=True, otherwise the output of modules. If detach=True they are detached from their history, and if cpu=True, they're put on the CPU.

layers = [nn.Linear(5,10), nn.ReLU(), nn.Linear(10,3)]
tst_model = nn.Sequential(*layers)
x = torch.randn(4,5)
with hook_outputs(layers) as h:
    y = tst_model(x)
    test_eq(h.stored[0], layers[0](x))
    test_eq(h.stored[1], F.relu(layers[0](x)))
    test_eq(h.stored[2], y)
    for s in h.stored: assert not s.requires_grad
    
with hook_outputs(layers, grad=True) as h:
    y = tst_model(x)
    loss = y.pow(2).mean()
    loss.backward()
    g = 2*y / y.numel()
    test_close(g, h.stored[2][0])
    g = g @ layers[2].weight.data
    test_close(g, h.stored[1][0])
    g = g * (layers[0](x) > 0).float()
    test_close(g, h.stored[0][0])
#cuda
with hook_outputs(tst_model, cpu=True) as h:
    y = tst_model.cuda()(x.cuda())
    for s in h.stored: test_eq(s.device, torch.device('cpu'))

dummy_eval[source]

dummy_eval(m, size=(64, 64))

Evaluate m on a dummy input of a certain size

model_sizes[source]

model_sizes(m, size=(64, 64))

Pass a dummy input through the model m to get the various sizes of activations.

m = nn.Sequential(ConvLayer(3, 16), ConvLayer(16, 32, stride=2), ConvLayer(32, 32))
test_eq(model_sizes(m), [[1, 16, 64, 64], [1, 32, 32, 32], [1, 32, 32, 32]])

num_features_model[source]

num_features_model(m)

Return the number of output features for m.

m = nn.Sequential(nn.Conv2d(5,4,3), nn.Conv2d(4,3,3))
test_eq(num_features_model(m), 3)
m = nn.Sequential(ConvLayer(3, 16), ConvLayer(16, 32, stride=2), ConvLayer(32, 32))
test_eq(num_features_model(m), 32)

To make hooks easy to use, we wrapped a version in a Callback where you just have to implement a hook function (plus any element you might need).

has_params[source]

has_params(m)

Check if m has at least one parameter

assert has_params(nn.Linear(3,4))
assert has_params(nn.LSTM(4,5,2))
assert not has_params(nn.ReLU())

class HookCallback[source]

HookCallback(modules=None, every=None, remove_end=True, is_forward=True, detach=True, cpu=True, hook=None) :: Callback

Callback that can be used to register hooks on modules

You can either subclass and implement a hook function (along with any event you want) or pass that a hook function when initializing. Such a function needs to take three argument: a layer, input and output (for a backward hook, input means gradient with respect to the inputs, output, gradient with respect to the output) and can either modify them or update the state according to them.

If not provided, modules will default to the layers of self.model that have a weight attribute. Depending on do_remove, the hooks will be properly removed at the end of training (or in case of error). is_forward , detach and cpu are passed to Hooks.

The function called at each forward (or backward) pass is self.hook and must be implemented when subclassing this callback.

class TstCallback(HookCallback):
    def hook(self, m, i, o): return o
    def after_batch(self): test_eq(self.hooks.stored[0], self.pred)
        
learn = synth_learner(n_trn=5, cbs = TstCallback())
learn.fit(1)
(#4) [0,17.76603126525879,23.31198501586914,'00:00']
class TstCallback(HookCallback):
    def __init__(self, modules=None, remove_end=True, detach=True, cpu=False):
        super().__init__(modules, None, remove_end, False, detach, cpu)
    def hook(self, m, i, o): return o
    def after_batch(self):
        if self.training:
            test_eq(self.hooks.stored[0][0], 2*(self.pred-self.y)/self.pred.shape[0])
        
learn = synth_learner(n_trn=5, cbs = TstCallback())
learn.fit(1)
(#4) [0,3.439974069595337,2.1897826194763184,'00:00']

HookCallback.begin_fit[source]

HookCallback.begin_fit()

Register the Hooks on self.modules.

HookCallback.after_fit[source]

HookCallback.after_fit()

Remove the Hooks.

Model summary

total_params[source]

total_params(m)

Give the number of parameters of a module and if it's trainable or not

test_eq(total_params(nn.Linear(10,32)), (32*10+32,True))
test_eq(total_params(nn.Linear(10,32, bias=False)), (32*10,True))
test_eq(total_params(nn.BatchNorm2d(20)), (20*2, True))
test_eq(total_params(nn.BatchNorm2d(20, affine=False)), (0,False))
test_eq(total_params(nn.Conv2d(16, 32, 3)), (16*32*3*3 + 32, True))
test_eq(total_params(nn.Conv2d(16, 32, 3, bias=False)), (16*32*3*3, True))
#First ih layer 20--10, all else 10--10. *4 for the four gates
test_eq(total_params(nn.LSTM(20, 10, 2)), (4 * (20*10 + 10) + 3 * 4 * (10*10 + 10), True))

layer_info[source]

layer_info(model, *xb)

Return layer infos of model on xb (only support batch first inputs)

m = nn.Sequential(nn.Linear(1,50), nn.ReLU(), nn.BatchNorm1d(50), nn.Linear(50, 1))
sample_input = torch.randn((16, 1))
test_eq(layer_info(m, sample_input)[1], [
    ('Linear', 100, True, [1, 50]),
    ('ReLU', 0, False, [1, 50]),
    ('BatchNorm1d', 100, True, [1, 50]),
    ('Linear', 51, True, [1, 1])
])
# Test for multiple inputs model
class _2InpModel(Module):
    def __init__(self):
        super().__init__()
        self.seq = nn.Sequential(nn.Linear(2,50), nn.ReLU(), nn.BatchNorm1d(50), nn.Linear(50, 1))
    def forward(self, *inps):
        outputs = torch.cat(inps, dim=-1)
        return self.seq(outputs)


m = _2InpModel()
sample_inputs = (torch.randn(16, 1), torch.randn(16, 1))
test_eq(layer_info(m, *sample_inputs)[1], [
    ('Linear', 150, True, [1, 50]),
    ('ReLU', 0, False, [1, 50]),
    ('BatchNorm1d', 100, True, [1, 50]),
    ('Linear', 51, True, [1, 1])
])

Module.summary[source]

Module.summary(*xb)

Print a summary of self using xb

m = nn.Sequential(nn.Linear(1,50), nn.ReLU(), nn.BatchNorm1d(50), nn.Linear(50, 1))
for p in m[0].parameters(): p.requires_grad_(False)
sample_input = torch.randn((16, 1))
m.summary(sample_input)
Sequential (Input shape: ['16 x 1'])
================================================================
Layer (type)         Output Shape         Param #    Trainable 
================================================================
Linear               16 x 50              100        False     
________________________________________________________________
ReLU                 16 x 50              0          False     
________________________________________________________________
BatchNorm1d          16 x 50              100        True      
________________________________________________________________
Linear               16 x 1               51         True      
________________________________________________________________

Total params: 251
Total trainable params: 151
Total non-trainable params: 100

Learner.summary[source]

Learner.summary()

Print a summary of the model, optimizer and loss function.

m = nn.Sequential(nn.Linear(1,50), nn.ReLU(), nn.BatchNorm1d(50), nn.Linear(50, 1))
for p in m[0].parameters(): p.requires_grad_(False)
learn = synth_learner()
learn.create_opt()
learn.model=m
learn.summary()
Sequential (Input shape: ['16 x 1'])
================================================================
Layer (type)         Output Shape         Param #    Trainable 
================================================================
Linear               16 x 50              100        False     
________________________________________________________________
ReLU                 16 x 50              0          False     
________________________________________________________________
BatchNorm1d          16 x 50              100        True      
________________________________________________________________
Linear               16 x 1               51         True      
________________________________________________________________

Total params: 251
Total trainable params: 151
Total non-trainable params: 100

Optimizer used: functools.partial(<function SGD at 0x7f6dd8ae4f80>, mom=0.9)
Loss function: FlattenedLoss of MSELoss()

Model unfrozen

Callbacks:
  - TrainEvalCallback
  - Recorder
# Test for multiple output
class _NOutModel(nn.Module):
    def forward(self, x1):
        seq_len, bs, hid_size = 50, 16, 256
        num_layer = 1
        return torch.randn((seq_len, bs, hid_size)), torch.randn((num_layer, bs, hid_size))
m = _NOutModel()
learn = synth_learner()
learn.model = m
learn.summary() # Output Shape should be (50, 16, 256), (1, 16, 256)
_NOutModel (Input shape: ['16 x 1'])
================================================================
Layer (type)         Output Shape         Param #    Trainable 
================================================================
_NOutModel           ['16 x 16 x 256', '  0          False     
________________________________________________________________

Total params: 0
Total trainable params: 0
Total non-trainable params: 0

Optimizer used: functools.partial(<function SGD at 0x7f6dd8ae4f80>, mom=0.9)
Loss function: FlattenedLoss of MSELoss()

Callbacks:
  - TrainEvalCallback
  - Recorder

Activation graphs

This is an example of a HookCallback, that stores the mean, stds and histograms of activations that go through the network.

@delegates()
class ActivationStats(HookCallback):
    "Callback that record the mean and std of activations."
    run_before=TrainEvalCallback
    def __init__(self, with_hist=False, **kwargs):
        super().__init__(**kwargs)
        self.with_hist = with_hist

    def begin_fit(self):
        "Initialize stats."
        super().begin_fit()
        self.stats = L()

    def hook(self, m, i, o):
        o = o.float()
        res = {'mean': o.mean().item(), 'std': o.std().item(),
               'near_zero': (o<=0.05).long().sum().item()/o.numel()}
        if self.with_hist: res['hist'] = o.histc(40,0,10)
        return res

    def after_batch(self):
        "Take the stored results and puts it in `self.stats`"
        if self.training and (self.every is None or self.train_iter%self.every == 0):
            self.stats.append(self.hooks.stored)
        super().after_batch()
        
    def layer_stats(self, idx):
        lstats = self.stats.itemgot(idx)
        return L(lstats.itemgot(o) for o in ('mean','std','near_zero'))
    
    def hist(self, idx):
        res = self.stats.itemgot(idx).itemgot('hist')
        return torch.stack(tuple(res)).t().float().log1p()

    def color_dim(self, idx, figsize=(10,5), ax=None):
        "The 'colorful dimension' plot"
        res = self.hist(idx)
        if ax is None: ax = subplots(figsize=figsize)[1][0]
        ax.imshow(res, origin='lower')
        ax.axis('off')

    def plot_layer_stats(self, idx):
        _,axs = subplots(1, 3, figsize=(12,3))
        for o,ax,title in zip(self.layer_stats(idx),axs,('mean','std','% near zero')):
            ax.plot(o)
            ax.set_title(title)

class ActivationStats[source]

ActivationStats(with_hist=False, modules=None, every=None, remove_end=True, is_forward=True, detach=True, cpu=True, hook=None) :: HookCallback

Callback that record the mean and std of activations.

learn = synth_learner(n_trn=5, cbs = ActivationStats(every=4))
learn.fit(1)
(#4) [0,15.9533109664917,16.78474998474121,'00:00']
learn.activation_stats.stats
(#2) [(#1) [{'mean': 0.7498316168785095, 'std': 1.0590876340866089, 'near_zero': 0.375}],(#1) [{'mean': 0.6023440361022949, 'std': 0.915452241897583, 'near_zero': 0.1875}]]

The first line contains the means of the outputs of the model for each batch in the training set, the second line their standard deviations.

import math

def test_every(n_tr, every):
    "create a learner, fit, then check number of stats collected"
    learn = synth_learner(n_trn=n_tr, cbs=ActivationStats(every=every))
    learn.fit(1)
    expected_stats_len = math.ceil(n_tr / every)
    test_eq(expected_stats_len, len(learn.activation_stats.stats))
    
for n_tr in [11, 12, 13]:
    test_every(n_tr, 4)
    test_every(n_tr, 1)
(#4) [0,12.596071243286133,10.469954490661621,'00:00']
(#4) [0,31.34967803955078,23.612323760986328,'00:00']
(#4) [0,14.076311111450195,11.747014999389648,'00:00']
(#4) [0,16.8385066986084,9.499192237854004,'00:00']
(#4) [0,18.776268005371094,14.976844787597656,'00:00']
(#4) [0,11.286646842956543,8.42125415802002,'00:00']