AWD LSTM from Smerity et al.
from nbdev.showdoc import *

Basic NLP modules

On top of the pytorch or the fastai layers, the language models use some custom layers specific to NLP.

dropout_mask[source]

dropout_mask(x, sz, p)

Return a dropout mask of the same type as x, size sz, with probability p to cancel an element.

t = dropout_mask(torch.randn(3,4), [4,3], 0.25)
test_eq(t.shape, [4,3])
assert ((t == 4/3) + (t==0)).all()

class RNNDropout[source]

RNNDropout(p=0.5) :: Module

Dropout with probability p that is consistent on the seq_len dimension.

dp = RNNDropout(0.3)
tst_inp = torch.randn(4,3,7)
tst_out = dp(tst_inp)
for i in range(4):
    for j in range(7):
        if tst_out[i,0,j] == 0: assert (tst_out[i,:,j] == 0).all()
        else: test_close(tst_out[i,:,j], tst_inp[i,:,j]/(1-0.3))

class WeightDropout[source]

WeightDropout(module, weight_p, layer_names='weight_hh_l0') :: Module

A module that warps another layer in which some weights will be replaced by 0 during training.

module = nn.LSTM(5,7).cuda()
dp_module = WeightDropout(module, 0.4)
wgts = getattr(dp_module.module, 'weight_hh_l0')
tst_inp = torch.randn(10,20,5).cuda()
h = torch.zeros(1,20,7).cuda(), torch.zeros(1,20,7).cuda()
dp_module.reset()
x,h = dp_module(tst_inp,h)
new_wgts = getattr(dp_module.module, 'weight_hh_l0')
test_eq(wgts, getattr(dp_module, 'weight_hh_l0_raw'))
assert 0.2 <= (new_wgts==0).sum().float()/new_wgts.numel() <= 0.6

class EmbeddingDropout[source]

EmbeddingDropout(emb, embed_p) :: Module

Apply dropout with probabily embed_p to an embedding layer emb.

enc = nn.Embedding(10, 7, padding_idx=1)
enc_dp = EmbeddingDropout(enc, 0.5)
tst_inp = torch.randint(0,10,(8,))
tst_out = enc_dp(tst_inp)
for i in range(8):
    assert (tst_out[i]==0).all() or torch.allclose(tst_out[i], 2*enc.weight[tst_inp[i]])

class AWD_LSTM[source]

AWD_LSTM(vocab_sz, emb_sz, n_hid, n_layers, pad_token=1, hidden_p=0.2, input_p=0.6, embed_p=0.1, weight_p=0.5, bidir=False) :: Module

AWD-LSTM inspired by https://arxiv.org/abs/1708.02182

This is the core of an AWD-LSTM model, with embeddings from vocab_sz and emb_sz, n_layers LSTMs potentialy bidir stacked, the first one going from emb_sz to n_hid, the last one from n_hid to emb_sz and all the inner ones from n_hid to n_hid. pad_token is passed to the PyTorch embedding layer. The dropouts are applied as such:

  • the embeddings are wrapped in EmbeddingDropout of probability embed_p;
  • the result of thise embedding layer goes through an RNNDropout of probability input_p;
  • each LSTM has WeightDropout applied with probability weight_p;
  • between two of the inner LSTM, an RNNDropout is applied with probabilith hidden_p.

THe module returns two lists: the raw outputs (without being applied the dropout of hidden_p) of each inner LSTM and the list of outputs with dropout. Since there is no dropout applied on the last output, those two lists have the same last element, which is the output that should be fed to a decoder (in the case of a language model).

tst = AWD_LSTM(100, 20, 10, 2)
x = torch.randint(0, 100, (10,5))
r = tst(x)
test_eq(tst.bs, 10)
test_eq(len(tst.hidden), 2)
test_eq([h_.shape for h_ in tst.hidden[0]], [[1,10,10], [1,10,10]])
test_eq([h_.shape for h_ in tst.hidden[1]], [[1,10,20], [1,10,20]])

test_eq(r.shape, [10,5,20])
test_eq(r[:,-1], tst.hidden[-1][0][0]) #hidden state is the last timestep in raw outputs

awd_lstm_lm_split[source]

awd_lstm_lm_split(model)

Split a RNN model in groups for differential learning rates.

splits = awd_lstm_lm_split

awd_lstm_clas_split[source]

awd_lstm_clas_split(model)

Split a RNN model in groups for differential learning rates.

QRNN

class AWD_QRNN[source]

AWD_QRNN(vocab_sz, emb_sz, n_hid, n_layers, pad_token=1, hidden_p=0.2, input_p=0.6, embed_p=0.1, weight_p=0.5, bidir=False) :: AWD_LSTM

Same as an AWD-LSTM, but using QRNNs instead of LSTMs

model = AWD_QRNN(vocab_sz=10, emb_sz=20, n_hid=16, n_layers=2, bidir=False)
x = torch.randint(0, 10, (7,5))
y = model(x)
test_eq(y.shape, (7, 5, 20))