Functions and transforms to help gather text data in a `Datasets`
from nbdev.showdoc import *

Numericalizing

make_vocab[source]

make_vocab(count, min_freq=3, max_vocab=60000)

Create a vocab of max_vocab size from Counter count with items present more than min_freq

count = Counter(['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'd'])
test_eq(set([x for x in make_vocab(count) if not x.startswith('xxfake')]), 
        set(defaults.text_spec_tok + 'a'.split()))
test_eq(len(make_vocab(count))%8, 0)
test_eq(set([x for x in make_vocab(count, min_freq=1) if not x.startswith('xxfake')]), 
        set(defaults.text_spec_tok + 'a b c d'.split()))
test_eq(set([x for x in make_vocab(count,max_vocab=12, min_freq=1) if not x.startswith('xxfake')]), 
        set(defaults.text_spec_tok + 'a b c'.split()))

class TensorText[source]

TensorText(x, **kwargs) :: TensorBase

class LMTensorText[source]

LMTensorText(x, **kwargs) :: TensorText

class Numericalize[source]

Numericalize(vocab=None, min_freq=3, max_vocab=60000, sep=' ') :: Transform

Reversible transform of tokenized texts to numericalized ids

num = Numericalize(min_freq=1, sep=' ')
num.setup(L('This is an example of text'.split(), 'this is another text'.split()))
test_eq(set([x for x in num.vocab if not x.startswith('xxfake')]), 
        set(defaults.text_spec_tok + 'This is an example of text this another'.split()))
test_eq(len(num.vocab)%8, 0)
start = 'This is an example of text'
t = num(start.split())
test_eq(t, tensor([11, 9, 12, 13, 14, 10]))
test_eq(num.decode(t), start)
num = Numericalize(min_freq=2, sep=' ')
num.setup(L('This is an example of text'.split(), 'this is another text'.split()))
test_eq(set([x for x in num.vocab if not x.startswith('xxfake')]), 
        set(defaults.text_spec_tok + 'is text'.split()))
test_eq(len(num.vocab)%8, 0)
t = num(start.split())
test_eq(t, tensor([0, 9, 0, 0, 0, 10]))
test_eq(num.decode(t), f'{UNK} is {UNK} {UNK} {UNK} text')
ReindexCollection
fastcore.utils.ReindexCollection

class LMDataLoader[source]

LMDataLoader(dataset, lens=None, cache=2, bs=64, seq_len=72, num_workers=0, shuffle=False, verbose=False, do_setup=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, wif=None, before_iter=None, after_item=None, before_batch=None, after_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None) :: TfmdDL

Transformed DataLoader

bs,sl = 4,3
ints = L([0,1,2,3,4],[5,6,7,8,9,10],[11,12,13,14,15,16,17,18],[19,20],[21,22,23],[24]).map(tensor)
dl = LMDataLoader(ints, bs=bs, seq_len=sl)
test_eq(list(dl),
    [[tensor([[0, 1, 2], [6, 7, 8], [12, 13, 14], [18, 19, 20]]),
      tensor([[1, 2, 3], [7, 8, 9], [13, 14, 15], [19, 20, 21]])],
     [tensor([[3, 4, 5], [ 9, 10, 11], [15, 16, 17], [21, 22, 23]]),
      tensor([[4, 5, 6], [10, 11, 12], [16, 17, 18], [22, 23, 24]])]])
dl = LMDataLoader(ints, bs=bs, seq_len=sl, shuffle=True)
for x,y in dl: test_eq(x[:,1:], y[:,:-1])
((x0,y0), (x1,y1)) = tuple(dl)
#Second batch begins where first batch ended
test_eq(y0[:,-1], x1[:,0]) 
test_eq(type(x0), LMTensorText)

Showing

TitledStr.truncate[source]

TitledStr.truncate(n)

Integration example

path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')
df.head(2)
label text is_valid
0 negative Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff! False
1 positive This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som... False
splits = ColSplitter()(df)
tfms = [attrgetter('text'), Tokenizer.from_df('text'), Numericalize()]
dsets = Datasets(df, [tfms], splits=splits, dl_type=LMDataLoader)
dls = dsets.dataloaders(bs=16, seq_len=72)
dls.show_batch(max_n=6)
text text_
0 xxbos " look , i know this may suck right now , but pain is xxunk , film is forever . xxmaj whatever you do right now is burned into celluloid for all time and for thousands of years to come . " – xxmaj robert xxmaj de xxmaj niro \n\n xxmaj this was initially a film for xxmaj steven xxmaj spielberg , the director hiring several screenwriters to xxunk the screenplay " look , i know this may suck right now , but pain is xxunk , film is forever . xxmaj whatever you do right now is burned into celluloid for all time and for thousands of years to come . " – xxmaj robert xxmaj de xxmaj niro \n\n xxmaj this was initially a film for xxmaj steven xxmaj spielberg , the director hiring several screenwriters to xxunk the screenplay so
1 to kill another human being , as xxmaj hitchcock had demonstrated in xxmaj torn xxmaj xxunk . xxmaj but that scene leads to no place of any importance . \n\n xxmaj some people might enjoy this , especially those young enough to think that pain and death are things that happen only in movies . xxmaj some xxunk stuff on screen here . xxbos xxmaj well , what can i say . kill another human being , as xxmaj hitchcock had demonstrated in xxmaj torn xxmaj xxunk . xxmaj but that scene leads to no place of any importance . \n\n xxmaj some people might enjoy this , especially those young enough to think that pain and death are things that happen only in movies . xxmaj some xxunk stuff on screen here . xxbos xxmaj well , what can i say . \n\n
2 of the most striking xxunk women xxunk ; her acting xxunk her beauty . ( the love scenes between these two xxunk better than words how little the age difference matters to them ! ) xxmaj each of the supporting characters is sharply drawn and excellently portrayed as well . xxmaj the mix of xxunk dialog and passionate excess makes this a delightful xxunk . xxmaj as xxmaj russell xxmaj baker notes the most striking xxunk women xxunk ; her acting xxunk her beauty . ( the love scenes between these two xxunk better than words how little the age difference matters to them ! ) xxmaj each of the supporting characters is sharply drawn and excellently portrayed as well . xxmaj the mix of xxunk dialog and passionate excess makes this a delightful xxunk . xxmaj as xxmaj russell xxmaj baker notes in
3 came out and having served in xxmaj xxunk he had great admiration for the man . xxmaj the disappointing thing about this film is that it only concentrate on a short period of the man 's life - interestingly enough the man 's entire life would have made such an epic bio - xxunk that it is staggering to imagine the cost for production . \n\n xxmaj some posters xxunk to the out and having served in xxmaj xxunk he had great admiration for the man . xxmaj the disappointing thing about this film is that it only concentrate on a short period of the man 's life - interestingly enough the man 's entire life would have made such an epic bio - xxunk that it is staggering to imagine the cost for production . \n\n xxmaj some posters xxunk to the flawed
4 the xxmaj blob is one of those films that probably sounds good on paper & is well known as being a ' classic ' but is in actual fact a huge disappointment when finally seen . xxmaj this is one case when the remake xxmaj the xxmaj blob ( xxunk ) is definitely better than the original . xxmaj the original xxmaj blob is slow & boring & the remake is n't xxmaj blob is one of those films that probably sounds good on paper & is well known as being a ' classic ' but is in actual fact a huge disappointment when finally seen . xxmaj this is one case when the remake xxmaj the xxmaj blob ( xxunk ) is definitely better than the original . xxmaj the original xxmaj blob is slow & boring & the remake is n't ,
5 rocks ! \n\n xxmaj now if you still have n't gotten it . xxmaj i 'm being xxunk . 1 / 10 xxbos i grew up xxmaj baptist and i know the story this movie is trying to tell , although i no longer believe the story . xxmaj i 'll give the movie kudos for being as good as the average xxmaj lifetime xxmaj movie of the xxmaj week . xxmaj ! \n\n xxmaj now if you still have n't gotten it . xxmaj i 'm being xxunk . 1 / 10 xxbos i grew up xxmaj baptist and i know the story this movie is trying to tell , although i no longer believe the story . xxmaj i 'll give the movie kudos for being as good as the average xxmaj lifetime xxmaj movie of the xxmaj week . xxmaj mildly
b = dls.one_batch()
test_eq(type(x), LMTensorText)
test_eq(len(dls.valid_ds[0][0]), dls.valid.lens[0])

Classification

pad_input[source]

pad_input(samples, pad_idx=1, pad_fields=0, pad_first=False, backwards=False)

Function that collect samples and adds padding. Flips token order if needed

test_eq(pad_input([(tensor([1,2,3]),1), (tensor([4,5]), 2), (tensor([6]), 3)], pad_idx=0), 
        [(tensor([1,2,3]),1), (tensor([4,5,0]),2), (tensor([6,0,0]), 3)])
test_eq(pad_input([(tensor([1,2,3]), (tensor([6]))), (tensor([4,5]), tensor([4,5])), (tensor([6]), (tensor([1,2,3])))], pad_idx=0, pad_fields=1), 
        [(tensor([1,2,3]),(tensor([6,0,0]))), (tensor([4,5]),tensor([4,5,0])), ((tensor([6]),tensor([1, 2, 3])))])
test_eq(pad_input([(tensor([1,2,3]),1), (tensor([4,5]), 2), (tensor([6]), 3)], pad_idx=0, pad_first=True), 
        [(tensor([1,2,3]),1), (tensor([0,4,5]),2), (tensor([0,0,6]), 3)])
test_eq(pad_input([(tensor([1,2,3]),1), (tensor([4,5]), 2), (tensor([6]), 3)], pad_idx=0, backwards=True), 
        [(tensor([3,2,1]),1), (tensor([5,4,0]),2), (tensor([6,0,0]), 3)])
x = test_eq(pad_input([(tensor([1,2,3]),1), (tensor([4,5]), 2), (tensor([6]), 3)], pad_idx=0, backwards=True), 
        [(tensor([3,2,1]),1), (tensor([5,4,0]),2), (tensor([6,0,0]), 3)])

pad_input_chunk[source]

pad_input_chunk(samples, pad_idx=1, pad_first=True, seq_len=72)

test_eq(pad_input_chunk([(tensor([1,2,3,4,5,6]),1), (tensor([1,2,3]), 2), (tensor([1,2]), 3)], pad_idx=0, seq_len=2), 
        [(tensor([1,2,3,4,5,6]),1), (tensor([0,0,1,2,3,0]),2), (tensor([0,0,0,0,1,2]), 3)])
test_eq(pad_input_chunk([(tensor([1,2,3,4,5,6]),), (tensor([1,2,3]),), (tensor([1,2]),)], pad_idx=0, seq_len=2), 
        [(tensor([1,2,3,4,5,6]),), (tensor([0,0,1,2,3,0]),), (tensor([0,0,0,0,1,2]),)])
test_eq(pad_input_chunk([(tensor([1,2,3,4,5,6]),), (tensor([1,2,3]),), (tensor([1,2]),)], pad_idx=0, seq_len=2, pad_first=False), 
        [(tensor([1,2,3,4,5,6]),), (tensor([1,2,3,0,0,0]),), (tensor([1,2,0,0,0,0]),)])

class SortedDL[source]

SortedDL(dataset, sort_func=None, res=None, bs=64, shuffle=False, num_workers=None, verbose=False, do_setup=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, wif=None, before_iter=None, after_item=None, before_batch=None, after_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None) :: TfmdDL

Transformed DataLoader

ds = [(tensor([1,2]),1), (tensor([3,4,5,6]),2), (tensor([7]),3), (tensor([8,9,10]),4)]
dl = SortedDL(ds, bs=2, before_batch=partial(pad_input, pad_idx=0))
test_eq(list(dl), [(tensor([[ 3,  4,  5,  6], [ 8,  9, 10,  0]]), tensor([2, 4])), 
                   (tensor([[1, 2], [7, 0]]), tensor([1, 3]))])
ds = [(tensor(range(random.randint(1,10))),i) for i in range(101)]
dl = SortedDL(ds, bs=2, create_batch=partial(pad_input, pad_idx=-1), shuffle=True, num_workers=0)
batches = list(dl)
max_len = len(batches[0][0])
for b in batches: 
    assert(len(b[0])) <= max_len 
    test_ne(b[0][-1], -1)
splits = RandomSplitter()(range_of(df))
dsets = Datasets(df, splits=splits, tfms=[tfms, [attrgetter("label"), Categorize()]], dl_type=SortedDL)
dls = dsets.dataloaders(before_batch=pad_input)
dls.show_batch(max_n=2)
text category
0 xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with the xxunk possible scenarios to get the two protagonists together in the end . xxmaj in fact , all its charm is xxunk , contained within the characters and the setting and the plot … which is highly believable to xxunk . xxmaj it 's easy to think that such a love story , as beautiful as any other ever told , * could * happen to you … a feeling you do n't often get from other romantic comedies positive
1 xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of " at xxmaj the xxmaj movies " in taking xxmaj steven xxmaj soderbergh to task . \n\n xxmaj it 's usually satisfying to watch a film director change his style / subject , but xxmaj soderbergh 's most recent stinker , xxmaj the xxmaj girlfriend xxmaj xxunk ) , was also missing a story , so narrative ( and editing ? ) seem to suddenly be xxmaj soderbergh 's main challenge . xxmaj strange , after xxunk years in the business . xxmaj he was probably never much good at narrative , just xxunk it well inside " edgy " projects . \n\n xxmaj none of this excuses him this present , almost diabolical negative

TransformBlock for text

class TextBlock[source]

TextBlock(tok_tfm, vocab=None, is_lm=False, seq_len=72, min_freq=3, max_vocab=60000, sep=' ') :: TransformBlock

A basic wrapper that links defaults transforms for the data block API

class TextDataLoaders[source]

TextDataLoaders(*loaders, path='.', device=None) :: DataLoaders

Basic wrapper around several DataLoaders.