Basic function to preprocess text before assembling it in a `DataLoaders`.

Preprocessing rules

The following are rules applied to texts before or after it's tokenized.

spec_add_spaces[source]

spec_add_spaces(t)

Add spaces around / and #

test_eq(spec_add_spaces('#fastai'), ' # fastai')
test_eq(spec_add_spaces('/fastai'), ' / fastai')
test_eq(spec_add_spaces('\\fastai'), ' \\ fastai')

rm_useless_spaces[source]

rm_useless_spaces(t)

Remove multiple spaces

test_eq(rm_useless_spaces('a  b   c'), 'a b c')

replace_rep[source]

replace_rep(t)

Replace repetitions at the character level: cccc -- TK_REP 4 c

It starts replacing at 3 repetitions of the same character or more.

test_eq(replace_rep('aa'), 'aa')
test_eq(replace_rep('aaaa'), f' {TK_REP} 4 a ')

replace_wrep[source]

replace_wrep(t)

Replace word repetitions: word word word word -- TK_WREP 4 word

It starts replacing at 3 repetitions of the same word or more.

test_eq(replace_wrep('ah ah'), 'ah ah')
test_eq(replace_wrep('ah ah ah'), f' {TK_WREP} 3 ah ')
test_eq(replace_wrep('ah ah   ah  ah'), f' {TK_WREP} 4 ah ')
test_eq(replace_wrep('ah ah ah ah '), f' {TK_WREP} 4 ah  ')
test_eq(replace_wrep('ah ah ah ah.'), f' {TK_WREP} 4 ah .')
test_eq(replace_wrep('ah ah ahi'), f'ah ah ahi')

fix_html[source]

fix_html(x)

Various messy things we've seen in documents

test_eq(fix_html('#39;bli#146;'), "'bli'")
test_eq(fix_html('Sarah amp; Duck...'), 'Sarah & Duck …')
test_eq(fix_html('a nbsp; #36;'), 'a   $')
test_eq(fix_html('\\" <unk>'), f'" {UNK}')
test_eq(fix_html('quot;  @.@  @-@ '), "' .-")
test_eq(fix_html('<br />text\\n'), '\ntext\n')

replace_all_caps[source]

replace_all_caps(t)

Replace tokens in ALL CAPS by their lower version and add TK_UP before.

test_eq(replace_all_caps("I'M SHOUTING"), f"{TK_UP} i'm {TK_UP} shouting")
test_eq(replace_all_caps("I'm speaking normally"), "I'm speaking normally")
test_eq(replace_all_caps("I am speaking normally"), "i am speaking normally")

replace_maj[source]

replace_maj(t)

Replace tokens in ALL CAPS by their lower version and add TK_UP before.

test_eq(replace_maj("Jeremy Howard"), f'{TK_MAJ} jeremy {TK_MAJ} howard')
test_eq(replace_maj("I don't think there is any maj here"), ("i don't think there is any maj here"),)

lowercase[source]

lowercase(t, add_bos=True, add_eos=False)

Converts t to lowercase

replace_space[source]

replace_space(t)

Replace embedded spaces in a token with unicode line char to allow for split/join

Tokenizing

A tokenizer is a class that must implement a pipe method. This pipe method receives a generator of texts and must return a generator with their tokenized versions. Here is the most basic example:

class BaseTokenizer[source]

BaseTokenizer(split_char=' ', **kwargs)

Basic tokenizer that just splits on spaces

tok = BaseTokenizer()
for t in tok(["This is a text"]): test_eq(t, ["This", "is", "a", "text"])
tok = BaseTokenizer('x')
for t in tok(["This is a text"]): test_eq(t, ["This is a te", "t"])

class SpacyTokenizer[source]

SpacyTokenizer(lang='en', special_toks=None, buf_sz=5000)

Spacy tokenizer for lang

tok = SpacyTokenizer()
inp,exp = "This isn't the easiest text.",["This", "is", "n't", "the", "easiest", "text", "."]
test_eq(L(tok([inp]*5)), [exp]*5)

class TokenizeBatch[source]

TokenizeBatch(tok_func='SpacyTokenizer', rules=None, post_rules=None, **tok_kwargs)

A wrapper around tok_func to apply rules and tokenize in parallel

f = TokenizeBatch()
test_eq(f(["This isn't a problem"]), [[BOS, TK_MAJ, 'this', 'is', "n't", 'a', 'problem']])
f = TokenizeBatch(BaseTokenizer, rules=[], split_char="'")
test_eq(f(["This isn't a problem"]), [['This▁isn', 't▁a▁problem']])

The main function that will be called during one of the processes handling tokenization. It will create an instance of a tokenizer with tok_func and tok_kwargs at init, then iterate through the batch of texts, apply them rules and tokenize them.

texts = ["this is a text", "this is another text"]
tok = TokenizeBatch(BaseTokenizer, texts.__getitem__)
test_eq([t for t in tok([0,1])],[['this', 'is', 'a', 'text'], ['this', 'is', 'another', 'text']])

tokenize1[source]

tokenize1(text, tok_func='SpacyTokenizer', rules=None, post_rules=None, **tok_kwargs)

Tokenize one text with an instance of tok_func and some rules

test_eq(tokenize1("This isn't a problem"),
        [BOS, TK_MAJ, 'this', 'is', "n't", 'a', 'problem'])
test_eq(tokenize1("This isn't a problem", BaseTokenizer, rules=[], split_char="'"),
        ['This▁isn', 't▁a▁problem'])

parallel_tokenize[source]

parallel_tokenize(items, tok_func, rules, as_gen=False, n_workers=56, **tok_kwargs)

Calls a potential setup on tok_func before launching TokenizeBatch in parallel

Tokenize texts in files

Preprocessing function for texts in filenames. Tokenized texts will be saved in a similar fashion in a directory suffixed with _tok in the parent folder of path (override with output_dir).

tokenize_folder[source]

tokenize_folder(path, extensions=None, folders=None, output_dir=None, n_workers=56, rules=None, tok_func='SpacyTokenizer', encoding='utf8', **tok_kwargs)

Tokenize text files in path in parallel using n_workers

read_tokenized_file[source]

read_tokenized_file(f)

The result will be in output_dir (defaults to a folder in the same parent directory as path, with _tok added to path.name) with the same structure as in path. Tokenized texts for a given file will be in the file having the same name in output_dir. Additionally, a file with a .len suffix contains the number of tokens and the count of all words is stored in output_dir/counter.pkl.

extensions will default to ['.txt'] and all text files in path are treated unless you specify a list of folders in include. tok_func is instantiated in each process with tok_kwargs, and rules (that defaults to defaults.text_proc_rules) are applied to each text before going in the tokenizer.

tokenize_files[source]

tokenize_files(files, path, output_dir, output_names=None, n_workers=56, rules=None, tok_func='SpacyTokenizer', encoding='utf8', **tok_kwargs)

Tokenize text files in parallel using n_workers

Tokenize texts in a dataframe

tokenize_texts[source]

tokenize_texts(texts, n_workers=56, rules=None, tok_func='SpacyTokenizer', **tok_kwargs)

Tokenize texts in parallel using n_workers

tokenize_df[source]

tokenize_df(df, text_cols, n_workers=56, rules=None, mark_fields=None, tok_func='SpacyTokenizer', res_col_name='text', **tok_kwargs)

Tokenize texts in df[text_cols] in parallel using n_workers

This function returns a new dataframe with the same non-text columns, a colum named text that contains the tokenized texts and a column named text_lengths that contains their respective length. It also returns a counter of all words see to quickly build a vocabulary afterward.

tok_func is instantiated in each process with tok_kwargs, and rules (that defaults to defaults.text_proc_rules) are applied to each text before going in the tokenizer. If mark_fields isn't specified, it defaults to False when there is a single text column, True when there are several. In that case, the texts in each of those columns are joined with FLD markes followed by the number of the field.

tokenize_csv[source]

tokenize_csv(fname, text_cols, outname=None, n_workers=4, rules=None, mark_fields=None, tok_func='SpacyTokenizer', header='infer', chunksize=50000, **tok_kwargs)

Tokenize texts in the text_cols of the csv fname in parallel using n_workers

load_tokenized_csv[source]

load_tokenized_csv(fname)

Utility function to quickly load a tokenized csv ans the corresponding counter

The result will be written in a new csv file in outname (defaults to the same as fname with the suffix _tok.csv) and will have the same header as the original file, the same non-text columns, a text and a text_lengths column as described in tokenize_df.

tok_func is instantiated in each process with tok_kwargs, and rules (that defaults to defaults.text_proc_rules) are applied to each text before going in the tokenizer. If mark_fields isn't specified, it defaults to False when there is a single text column, True when there are several. In that case, the texts in each of those columns are joined with FLD markes followed by the number of the field.

The csv file is opened with header and optionally with blocks of chunksize at a time. If this argument is passed, each chunk is processed independtly and saved in the output file to save memory usage.

def _prepare_texts(tmp_d):
    "Prepare texts in a folder struct in tmp_d, a csv file and returns a dataframe"
    path = Path(tmp_d)/'tmp'
    path.mkdir()
    for d in ['a', 'b', 'c']: 
        (path/d).mkdir()
        for i in range(5):
            with open(path/d/f'text{i}.txt', 'w') as f: f.write(f"This is an example of text {d} {i}")
    
    texts = [f"This is an example of text {d} {i}" for i in range(5) for d in ['a', 'b', 'c']]
    df = pd.DataFrame({'text': texts, 'label': list(range(15))}, columns=['text', 'label'])
    csv_fname = tmp_d/'input.csv'
    df.to_csv(csv_fname, index=False)
    return path,df,csv_fname
with tempfile.TemporaryDirectory() as tmp_d:
    path,df,csv_fname = _prepare_texts(Path(tmp_d))
    #Tokenize as folders
    tokenize_folder(path)
    outp = Path(tmp_d)/'tmp_tok'
    for d in ['a', 'b', 'c']: 
        p = outp/d
        for i in range(5):
            test_eq((p/f'text{i}.txt').read(), ' '.join([
                BOS, TK_MAJ, 'this', 'is', 'an', 'example', 'of', 'text', d, str(i) ]))
    cnt_a = (outp/fn_counter_pkl).load()
    test_eq(cnt_a['this'], 15)
    test_eq(cnt_a['a'], 5)
    test_eq(cnt_a['0'], 3)
    
    #Tokenize as files
    files = get_text_files(path)
    tokenize_files(files, path, path/'d')
    for f in files: 
        test_eq((path/'d'/f.relative_to(path)).read(), ' '.join([
                BOS, TK_MAJ, 'this', 'is', 'an', 'example', 'of', 'text', f.parent.name, f.name[4]]))
    
    #Tokenize as individual texts
    out = tokenize_texts(df['text'].values)
    test_eq(out, [(outp/d/f'text{i}.txt').read().split(' ') for i in range(5) for d in ['a', 'b', 'c']])
    
    #Tokenize as a dataframe
    out,cnt_b = tokenize_df(df, text_cols='text')
    test_eq(list(out.columns), ['label', 'text', 'text_length'])
    test_eq(out['label'].values, df['label'].values)
    test_eq(out['text'], [(outp/d/f'text{i}.txt').read().split(' ') for i in range(5) for d in ['a', 'b', 'c']])
    test_eq(cnt_a, cnt_b)
    
    #Tokenize as a csv 
    out_fname = Path(tmp_d)/'output.csv'
    tokenize_csv(csv_fname, text_cols='text', outname=out_fname)
    test_eq((out,cnt_b), load_tokenized_csv(out_fname))

get_tokenizer[source]

get_tokenizer(tok_func='SpacyTokenizer', **kwargs)

class Tokenizer[source]

Tokenizer(tokenizer, rules=None, counter=None, lengths=None, mode=None, sep=' ') :: Transform

Delegates (__call__,decode,setup) to (encodes,decodes,setups) if split_idx matches

with tempfile.TemporaryDirectory() as tmp_d:
    path,df,csv_fname = _prepare_texts(Path(tmp_d))
    items = get_text_files(path)
    splits = RandomSplitter()(items)
    dsets = Datasets(items, [Tokenizer.from_folder(path)], splits=splits)
    print(dsets.train[0])
    
    dsets = Datasets(df, [[attrgetter('text'), Tokenizer.from_df('text')]], splits=splits)
    print(dsets.train[0])
((#10) ['xxbos','xxmaj','this','is','an','example','of','text','a','1'],)
(((#2) ['xxbos','xxbos'], (#2) ['xxbos','xxmaj'], (#2) ['xxbos','this'], (#2) ['xxbos','is'], (#2) ['xxbos','an'], (#2) ['xxbos','example'], (#2) ['xxbos','of'], (#2) ['xxbos','text'], (#2) ['xxbos','b'], (#2) ['xxbos','2']),)
tst = test_set(dsets, ['This is a test', 'this is another test'])
test_eq(tst, [(['xxbos', 'xxmaj', 'this','is','a','test'],), 
              (['xxbos','this','is','another','test'],)])

Sentencepiece

class SentencePieceTokenizer[source]

SentencePieceTokenizer(lang='en', special_toks=None, sp_model=None, vocab_sz=None, max_vocab_sz=30000, model_type='unigram', char_coverage=None, cache_dir='tmp')

Spacy tokenizer for lang

texts = [f"This is an example of text {i}" for i in range(10)]
df = pd.DataFrame({'text': texts, 'label': list(range(10))}, columns=['text', 'label'])
out,cnt = tokenize_df(df, text_cols='text', tok_func=SentencePieceTokenizer, vocab_sz=34, n_workers=1)