Functions and transforms to help gather text data in a Datasets

## Numericalizing

Numericalization is the step in which we convert tokens to integers. The first step is to build a correspondence token to index that is called a vocab.

#### make_vocab[source]

make_vocab(count, min_freq=3, max_vocab=60000)

Create a vocab of max_vocab size from Counter count with items present more than min_freq

If there are more than max_vocab tokens, the ones kept are the most frequent.

count = Counter(['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'd'])
test_eq(set([x for x in make_vocab(count) if not x.startswith('xxfake')]),
set(defaults.text_spec_tok + 'a'.split()))
test_eq(len(make_vocab(count))%8, 0)
test_eq(set([x for x in make_vocab(count, min_freq=1) if not x.startswith('xxfake')]),
set(defaults.text_spec_tok + 'a b c d'.split()))
test_eq(set([x for x in make_vocab(count,max_vocab=12, min_freq=1) if not x.startswith('xxfake')]),
set(defaults.text_spec_tok + 'a b c'.split()))

### classTensorText[source]

TensorText(x, **kwargs) :: TensorBase

Semantic type for a tensor representing text in language modeling

### classLMTensorText[source]

LMTensorText(x, **kwargs) :: TensorText

Semantic type for a tensor representing text in language modeling

### classNumericalize[source]

Numericalize(vocab=None, min_freq=3, max_vocab=60000) :: Transform

Reversible transform of tokenized texts to numericalized ids

If no vocab is passed, one is created at setup from the data, using make_vocab with min_freq and max_vocab.

start = 'This is an example of text'
num = Numericalize(min_freq=1)
num.setup(L(start.split(), 'this is another text'.split()))
test_eq(set([x for x in num.vocab if not x.startswith('xxfake')]),
set(defaults.text_spec_tok + 'This is an example of text this another'.split()))
test_eq(len(num.vocab)%8, 0)
t = num(start.split())

test_eq(t, tensor([11, 9, 12, 13, 14, 10]))
test_eq(num.decode(t), start.split())
num = Numericalize(min_freq=2)
num.setup(L('This is an example of text'.split(), 'this is another text'.split()))
test_eq(set([x for x in num.vocab if not x.startswith('xxfake')]),
set(defaults.text_spec_tok + 'is text'.split()))
test_eq(len(num.vocab)%8, 0)
t = num(start.split())
test_eq(t, tensor([0, 9, 0, 0, 0, 10]))
test_eq(num.decode(t), f'{UNK} is {UNK} {UNK} {UNK} text'.split())

## classLMDataLoader[source]

LMDataLoader(dataset, lens=None, cache=2, bs=64, seq_len=72, num_workers=0, shuffle=False, verbose=False, do_setup=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, wif=None, before_iter=None, after_item=None, before_batch=None, after_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None) :: TfmdDL

A DataLoader suitable for language modeling

dataset should be a collection of numericalized texts for this to work. lens can be passed for optimizing the creation, otherwise, the LMDataLoader will do a full pass of the dataset to compute them. cache is used to avoid reloading items unnecessarily.

The LMDataLoader will concatenate all texts (maybe shuffled) in one big stream, split it in bs contiguous sentences, then go through those seq_len at a time.

bs,sl = 4,3
ints = L([0,1,2,3,4],[5,6,7,8,9,10],[11,12,13,14,15,16,17,18],[19,20],[21,22,23],[24]).map(tensor)
dl = LMDataLoader(ints, bs=bs, seq_len=sl)
test_eq(list(dl),
[[tensor([[0, 1, 2], [6, 7, 8], [12, 13, 14], [18, 19, 20]]),
tensor([[1, 2, 3], [7, 8, 9], [13, 14, 15], [19, 20, 21]])],
[tensor([[3, 4, 5], [ 9, 10, 11], [15, 16, 17], [21, 22, 23]]),
tensor([[4, 5, 6], [10, 11, 12], [16, 17, 18], [22, 23, 24]])]])
dl = LMDataLoader(ints, bs=bs, seq_len=sl, shuffle=True)
for x,y in dl: test_eq(x[:,1:], y[:,:-1])
((x0,y0), (x1,y1)) = tuple(dl)
#Second batch begins where first batch ended
test_eq(y0[:,-1], x1[:,0])
test_eq(type(x0), LMTensorText)

## Classification

For classification, we deal with the fact that texts don't all have the same length by using padding.

#### pad_input[source]

pad_input(samples, pad_idx=1, pad_fields=0, pad_first=False, backwards=False)

Function that collect samples and adds padding

pad_idx is used for the padding, and the padding is applied to the pad_fields of the samples. The padding is applied at the beginning if pad_first is True, and if backwards is added, the tensors are flipped.

test_eq(pad_input([(tensor([1,2,3]),1), (tensor([4,5]), 2), (tensor([6]), 3)], pad_idx=0),
[(tensor([1,2,3]),1), (tensor([4,5,0]),2), (tensor([6,0,0]), 3)])
test_eq(pad_input([(tensor([1,2,3]), (tensor([6]))), (tensor([4,5]), tensor([4,5])), (tensor([6]), (tensor([1,2,3])))], pad_idx=0, pad_fields=1),
[(tensor([1,2,3]),(tensor([6,0,0]))), (tensor([4,5]),tensor([4,5,0])), ((tensor([6]),tensor([1, 2, 3])))])
test_eq(pad_input([(tensor([1,2,3]),1), (tensor([4,5]), 2), (tensor([6]), 3)], pad_idx=0, pad_first=True),
[(tensor([1,2,3]),1), (tensor([0,4,5]),2), (tensor([0,0,6]), 3)])
test_eq(pad_input([(tensor([1,2,3]),1), (tensor([4,5]), 2), (tensor([6]), 3)], pad_idx=0, backwards=True),
[(tensor([3,2,1]),1), (tensor([5,4,0]),2), (tensor([6,0,0]), 3)])
x = test_eq(pad_input([(tensor([1,2,3]),1), (tensor([4,5]), 2), (tensor([6]), 3)], pad_idx=0, backwards=True),
[(tensor([3,2,1]),1), (tensor([5,4,0]),2), (tensor([6,0,0]), 3)])

#### pad_input_chunk[source]

pad_input_chunk(samples, pad_idx=1, pad_first=True, seq_len=72)

Pad samples by adding padding by chunks of size seq_len

The difference with the base pad_input is that most of the padding is applied first (if pad_first=True) or at the end (if pad_first=False) but only by a round multiple of seq_len. The rest of the padding is applied to the end (or the beginning if pad_first=False). This is to work with SequenceEncoder with recurrent models.

test_eq(pad_input_chunk([(tensor([1,2,3,4,5,6]),1), (tensor([1,2,3]), 2), (tensor([1,2]), 3)], pad_idx=0, seq_len=2),
[(tensor([1,2,3,4,5,6]),1), (tensor([0,0,1,2,3,0]),2), (tensor([0,0,0,0,1,2]), 3)])
test_eq(pad_input_chunk([(tensor([1,2,3,4,5,6]),), (tensor([1,2,3]),), (tensor([1,2]),)], pad_idx=0, seq_len=2),
[(tensor([1,2,3,4,5,6]),), (tensor([0,0,1,2,3,0]),), (tensor([0,0,0,0,1,2]),)])
test_eq(pad_input_chunk([(tensor([1,2,3,4,5,6]),), (tensor([1,2,3]),), (tensor([1,2]),)], pad_idx=0, seq_len=2, pad_first=False),
[(tensor([1,2,3,4,5,6]),), (tensor([1,2,3,0,0,0]),), (tensor([1,2,0,0,0,0]),)])

### classSortedDL[source]

SortedDL(dataset, sort_func=None, res=None, bs=64, shuffle=False, num_workers=None, verbose=False, do_setup=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, wif=None, before_iter=None, after_item=None, before_batch=None, after_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None) :: TfmdDL

A DataLoader that goes throught the item in the order given by sort_func

res is the result of sort_func applied on all elements of the dataset. You can pass it if available to make the init faster by avoiding an initial pass over the whole dataset. If shuffle is True, this will shuffle a bit the results of the sort to have items of roughly the same size in batches, but not in the exact sorted order.

ds = [(tensor([1,2]),1), (tensor([3,4,5,6]),2), (tensor([7]),3), (tensor([8,9,10]),4)]
dl = SortedDL(ds, bs=2, before_batch=partial(pad_input, pad_idx=0))
test_eq(list(dl), [(tensor([[ 3,  4,  5,  6], [ 8,  9, 10,  0]]), tensor([2, 4])),
(tensor([[1, 2], [7, 0]]), tensor([1, 3]))])
ds = [(tensor(range(random.randint(1,10))),i) for i in range(101)]
dl = SortedDL(ds, bs=2, create_batch=partial(pad_input, pad_idx=-1), shuffle=True, num_workers=0)
batches = list(dl)
max_len = len(batches[0][0])
for b in batches:
assert(len(b[0])) <= max_len
test_ne(b[0][-1], -1)

## TransformBlock for text

To use the data block API, you will need this build block for texts.

### classTextBlock[source]

TextBlock(tok_tfm, vocab=None, is_lm=False, seq_len=72, min_freq=3, max_vocab=60000) :: TransformBlock

A TransformBlock for texts

For efficient tokenization, you probably want to use one of the factory methods. Otherwise, you can pass your custom tok_tfm that will deal with tokenization (if your texts are already tokenized, you can pass noop), a vocab, or leave it to be inferred on the texts using min_freq and max_vocab.

is_lm indicates if we want to use texts for language modeling or another task, seq_len is only necessary to tune if is_lm=False, and is passed along to pad_input_chunk.

#### TextBlock.from_df[source]

TextBlock.from_df(text_cols, vocab=None, is_lm=False, seq_len=72, min_freq=3, max_vocab=60000, tok_func='SpacyTokenizer', rules=None, sep=' ', n_workers=8, mark_fields=None, res_col_name='text', **kwargs)

Build a TextBlock from a dataframe using text_cols

vocab, is_lm, seq_len, min_freq and max_vocab are passed to the main init, the other argument to Tokenizer.from_df.

#### TextBlock.from_folder[source]

TextBlock.from_folder(path, vocab=None, is_lm=False, seq_len=72, min_freq=3, max_vocab=60000, tok_func='SpacyTokenizer', rules=None, extensions=None, folders=None, output_dir=None, n_workers=8, encoding='utf8', **kwargs)

Build a TextBlock from a path

vocab, is_lm, seq_len, min_freq and max_vocab are passed to the main init, the other argument to Tokenizer.from_folder.

## classTextDataLoaders[source]

TextDataLoaders(*loaders, path='.', device=None) :: DataLoaders

Basic wrapper around several DataLoaders with factory methods for NLP problems

You should not use the init directly but one of the following factory methods. All those factory methods accept as arguments:

• text_vocab: the vocabulary used for numericalizing texts (if not passed, it's infered from the data)
• tok_tfm: if passed, uses this tok_tfm instead of the default
• seq_len: the sequence length used for batch
• bs: the batch size
• val_bs: the batch size for the validation DataLoader (defaults to bs)
• shuffle_train: if we shuffle the training DataLoader or not
• device: the PyTorch device to use (defaults to default_device())

#### TextDataLoaders.from_folder[source]

TextDataLoaders.from_folder(path, train='train', valid='valid', valid_pct=None, seed=None, vocab=None, text_vocab=None, is_lm=False, tok_tfm=None, seq_len=72, bs=64, val_bs=None, shuffle_train=True, device=None)

Create from imagenet style dataset in path with train and valid subfolders (or provide valid_pct)

If valid_pct is provided, a random split is performed (with an optional seed) by setting aside that percentage of the data for the validation set (instead of looking at the grandparents folder). If a vocab is passed, only the folders with names in vocab are kept.

Here is an example on a sample of the IMDB movie review dataset:

#slow
path = untar_data(URLs.IMDB)
dls = TextDataLoaders.from_folder(path)
dls.show_batch(max_n=3)
text category
0 xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero pos
1 xxbos xxmaj some have praised _ xxunk _ as a xxmaj disney adventure for adults . i do n't think so -- at least not for thinking adults . \n\n xxmaj this script suggests a beginning as a live - action movie , that struck someone as the type of crap you can not sell to adults anymore . xxmaj the " crack staff " of many older adventure movies has been done well before , ( think _ the xxmaj dirty xxmaj dozen _ ) but _ atlantis _ represents one of the worse films in that motif . xxmaj the characters are weak . xxmaj even the background that each member trots out seems stock and awkward at best . xxmaj an xxup md / xxmaj medicine xxmaj man , a tomboy mechanic whose father always wanted sons , if we have not at least seen these before neg
2 xxbos i thought that xxup rotj was clearly the best out of the three xxmaj star xxmaj wars movies . i find it surprising that xxup rotj is considered the weakest installment in the xxmaj trilogy by many who have voted . xxmaj to me it seemed like xxup rotj was the best because it had the most profound plot , the most suspense , surprises , most xxunk the ending ) and definitely the most episodic movie . i personally like the xxmaj empire xxmaj strikes xxmaj back a lot also but i think it is slightly less good than than xxup rotj since it was slower - moving , was not as episodic , and i just did not feel as much suspense or emotion as i did with the third movie . \n\n xxmaj it also seems like to me that after reading these surprising reviews that pos

#### TextDataLoaders.from_df[source]

TextDataLoaders.from_df(df, path='.', valid_pct=0.2, seed=None, text_col=0, label_col=1, label_delim=None, y_block=None, text_vocab=None, is_lm=False, valid_col=None, tok_tfm=None, seq_len=72, bs=64, val_bs=None, shuffle_train=True, device=None)

Create from df in path with valid_pct

seed can optionally be passed for reproducibility. text_col, label_col and optionaly valid_col are indices or names of columns for texts/labels and the validation flag. label_delim can be passed for a multi-label problem if your labels are in one column, separated by a particular char. y_block should be passed to indicate your type of targets, in case the library did no infer it properly.

Here are examples on subsets of IMDB:

path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')
df.head(2)
label text is_valid
0 negative Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff! False
1 positive This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som... False
dls = TextDataLoaders.from_df(df, path=path, text_col='text', label_col='label', valid_col='is_valid')
dls.show_batch(max_n=3)
text category
0 xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj vargas became i was always aware that something did n't quite feel right . xxmaj victor xxmaj vargas suffers from a certain xxunk on the director 's part . xxmaj apparently , the director thought that the ethnic backdrop of a xxmaj latino family on the lower east side , and an xxunk storyline would make the film critic proof . xxmaj he was right , but it did n't fool me . xxmaj raising xxmaj victor xxmaj vargas is negative
1 xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with the xxunk possible scenarios to get the two protagonists together in the end . xxmaj in fact , all its charm is xxunk , contained within the characters and the setting and the plot … which is highly believable to xxunk . xxmaj it 's easy to think that such a love story , as beautiful as any other ever told , * could * happen to you … a feeling you do n't often get from other romantic comedies positive
2 xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of " at xxmaj the xxmaj movies " in taking xxmaj steven xxmaj soderbergh to task . \n\n xxmaj it 's usually satisfying to watch a film director change his style / subject , but xxmaj soderbergh 's most recent stinker , xxmaj the xxmaj girlfriend xxmaj xxunk ) , was also missing a story , so narrative ( and editing ? ) seem to suddenly be xxmaj soderbergh 's main challenge . xxmaj strange , after xxunk years in the business . xxmaj he was probably never much good at narrative , just xxunk it well inside " edgy " projects . \n\n xxmaj none of this excuses him this present , almost diabolical negative
dls = TextDataLoaders.from_df(df, path=path, text_col='text', is_lm=True, valid_col='is_valid')
dls.show_batch(max_n=3)
text text_
0 xxbos i remember when i first saw this movie , i was in xxunk grade when it happened . xxmaj before i saw this , i had xxunk to the original xxmaj broadway recording of it , and i really loved it ! xxmaj but when i saw this , i was like , what the heck ? ! xxmaj this movie is missing a lot of the songs from the musical i remember when i first saw this movie , i was in xxunk grade when it happened . xxmaj before i saw this , i had xxunk to the original xxmaj broadway recording of it , and i really loved it ! xxmaj but when i saw this , i was like , what the heck ? ! xxmaj this movie is missing a lot of the songs from the musical for
1 to see when it would be shown . i would usually find it playing on a xxmaj saturday afternoon . i only found the movie in xxmaj english which took something special away from the film and have xxunk to find a copy of it in xxmaj spanish . i hold this film dear to my heart and have never suffered from nightmares as others might suggest . xxmaj yes , it see when it would be shown . i would usually find it playing on a xxmaj saturday afternoon . i only found the movie in xxmaj english which took something special away from the film and have xxunk to find a copy of it in xxmaj spanish . i hold this film dear to my heart and have never suffered from nightmares as others might suggest . xxmaj yes , it is
2 scott . xxbos i saw this movie on t.v . this afternoon and i ca n't see how anyone can sit through this piece of trash . xxmaj it 's not funny at all and it takes your xxup i.q . down a few xxunk . i know this movie is for kids , but that does n't mean the writers should take their intelligence for granted . i bet that writers . xxbos i saw this movie on t.v . this afternoon and i ca n't see how anyone can sit through this piece of trash . xxmaj it 's not funny at all and it takes your xxup i.q . down a few xxunk . i know this movie is for kids , but that does n't mean the writers should take their intelligence for granted . i bet that writers were

#### TextDataLoaders.from_csv[source]

TextDataLoaders.from_csv(path, csv_fname='labels.csv', header='infer', delimiter=None, valid_pct=0.2, seed=None, text_col=0, label_col=1, label_delim=None, y_block=None, text_vocab=None, is_lm=False, valid_col=None, tok_tfm=None, seq_len=72, bs=64, val_bs=None, shuffle_train=True, device=None)

Create from csv file in path/csv_fname

Opens the csv file with header and delimiter, then pass all the other arguments to TextDataLoaders.from_df.

dls = TextDataLoaders.from_csv(path=path, csv_fname='texts.csv', text_col='text', label_col='label', valid_col='is_valid')
dls.show_batch(max_n=3)
text category
0 xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj vargas became i was always aware that something did n't quite feel right . xxmaj victor xxmaj vargas suffers from a certain xxunk on the director 's part . xxmaj apparently , the director thought that the ethnic backdrop of a xxmaj latino family on the lower east side , and an xxunk storyline would make the film critic proof . xxmaj he was right , but it did n't fool me . xxmaj raising xxmaj victor xxmaj vargas is negative
1 xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with the xxunk possible scenarios to get the two protagonists together in the end . xxmaj in fact , all its charm is xxunk , contained within the characters and the setting and the plot … which is highly believable to xxunk . xxmaj it 's easy to think that such a love story , as beautiful as any other ever told , * could * happen to you … a feeling you do n't often get from other romantic comedies positive
2 xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of " at xxmaj the xxmaj movies " in taking xxmaj steven xxmaj soderbergh to task . \n\n xxmaj it 's usually satisfying to watch a film director change his style / subject , but xxmaj soderbergh 's most recent stinker , xxmaj the xxmaj girlfriend xxmaj xxunk ) , was also missing a story , so narrative ( and editing ? ) seem to suddenly be xxmaj soderbergh 's main challenge . xxmaj strange , after xxunk years in the business . xxmaj he was probably never much good at narrative , just xxunk it well inside " edgy " projects . \n\n xxmaj none of this excuses him this present , almost diabolical negative