Basic function to preprocess tabular data before assembling it in a `DataLoaders`.
from nbdev.showdoc import *

Initial preprocessing

make_date[source]

make_date(df, date_field)

Make sure df[date_field] is of the right date type.

df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
make_date(df, 'date')
test_eq(df['date'].dtype, np.dtype('datetime64[ns]'))

add_datepart[source]

add_datepart(df, field_name, prefix=None, drop=True, time=False)

Helper function that adds columns relevant to a date in the column field_name of df.

df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
df = add_datepart(df, 'date')
test_eq(df.columns, ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start', 
            'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start', 'Elapsed'])
df.head()
Year Month Week Day Dayofweek Dayofyear Is_month_end Is_month_start Is_quarter_end Is_quarter_start Is_year_end Is_year_start Elapsed
0 2019 12 49 4 2 338 False False False False False False 1575417600
1 2019 11 48 29 4 333 False False False False False False 1574985600
2 2019 11 46 15 4 319 False False False False False False 1573776000
3 2019 10 43 24 3 297 False False False False False False 1571875200

add_elapsed_times[source]

add_elapsed_times(df, field_names, date_field, base_field)

Add in df for each event in field_names the elapsed time according to date_field grouped by base_field

df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
                   'event': [False, True, False, True], 'base': [1,1,2,2]})
df = add_elapsed_times(df, ['event'], 'date', 'base')
df
date event base Afterevent Beforeevent event_bw event_fw
0 2019-12-04 False 1 5 0 1.0 0.0
1 2019-11-29 True 1 0 0 1.0 1.0
2 2019-11-15 False 2 22 0 1.0 0.0
3 2019-10-24 True 2 0 0 1.0 1.0

cont_cat_split[source]

cont_cat_split(df, max_card=20, dep_var=None)

Helper function that returns column names of cont and cat variables from given df.

df_shrink_dtypes[source]

df_shrink_dtypes(df, skip=[], obj2cat=True, int2uint=False)

Return any possible smaller data types for DataFrame columns. Allows object->category, int->uint, and exclusion.

df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'e': [True, False, True],
                   'date':['2019-12-04','2019-11-29','2019-11-15',]})
dt = df_shrink_dtypes(df)
test_eq(df['i'].dtype, 'int64')
test_eq(dt['i'], 'int8')

test_eq(df['f'].dtype, 'float64')
test_eq(dt['f'], 'float32')

# Default ignore 'object' and 'boolean' columns
test_eq(df['date'].dtype, 'object')
test_eq(dt['date'], 'category')

# Test categorifying 'object' type
dt2 = df_shrink_dtypes(df, obj2cat=False)
test_eq('date' not in dt2, True)

df_shrink[source]

df_shrink(df, skip=[], obj2cat=True, int2uint=False)

Reduce DataFrame memory usage, by casting to smaller types returned by df_shrink_dtypes().

df_shrink(df) attempts to make a DataFrame uses less memory, by fit numeric columns into smallest datatypes. In addition:

  • boolean, category, datetime64[ns] dtype columns are ignored.
  • 'object' type columns are categorified, which can save a lot of memory in large dataset. It can be turned off by obj2cat=False.
  • int2uint=True, to fit int types to uint types, if all data in the column is >= 0.
  • columns can be excluded by name using excl_cols=['col1','col2'].

To get only new column data types without actually casting a DataFrame, use df_shrink_dtypes() with all the same parameters for df_shrink().

df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'u':[0, 10,254],
                  'date':['2019-12-04','2019-11-29','2019-11-15']})
df2 = df_shrink(df, skip=['date'])

test_eq(df['i'].dtype=='int64' and df2['i'].dtype=='int8', True)
test_eq(df['f'].dtype=='float64' and df2['f'].dtype=='float32', True)
test_eq(df['u'].dtype=='int64' and df2['u'].dtype=='int16', True)
test_eq(df2['date'].dtype, 'object')

test_eq(df2.memory_usage().sum() < df.memory_usage().sum(), True)

# Test int => uint (when col.min() >= 0)
df3 = df_shrink(df, int2uint=True)
test_eq(df3['u'].dtype, 'uint8')  # int64 -> uint8 instead of int16

# Test excluding columns
df4 = df_shrink(df, skip=['i','u'])
test_eq(df['i'].dtype, df4['i'].dtype)
test_eq(df4['u'].dtype, 'int64')

Here's an example using the ADULT_SAMPLE dataset:

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
new_df = df_shrink(df, int2uint=True)
print(f"Memory usage: {df.memory_usage().sum()} --> {new_df.memory_usage().sum()}")
Memory usage: 3907448 --> 818665

class Tabular[source]

Tabular(df, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True) :: CollBase

A DataFrame wrapper that knows which cols are cont/cat/y, and returns rows in __getitem__

  • df: A DataFrame of your data
  • cat_names: Your categorical x variables
  • cont_names: Your continuous x variables
  • y_names: Your dependendant y variables
    • Note: Mixed y's such as Regression and Classification is not currently supported, however multiple regression or classification outputs is
  • y_block: How to sub-categorize the type of y_names (CategoryBlock or RegressionBlock)
  • splits: How to split your data
  • do_setup: A parameter for if Tabular will run the data through the procs upon initialization
  • device: cuda or cpu
  • inplace: If True, Tabular will not keep a seperate copy of your original DataFrame in memory. You should ensure pd.options.mode.chained_assignment is None before setting this
  • reduce_memory: fastai will attempt to reduce the overall memory usage by the inputed DataFrame with df_shrink

class TabularPandas[source]

TabularPandas(df, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True) :: Tabular

A Tabular object with transforms

df = pd.DataFrame({'a':[0,1,2,0,2], 'b':[0,0,0,0,1]})
to = TabularPandas(df, cat_names='a')
t = pickle.loads(pickle.dumps(to))
test_eq(t.items,to.items)
test_eq(to.all_cols,to[['a']])

class TabularProc[source]

TabularProc(enc=None, dec=None, split_idx=None, order=None) :: InplaceTransform

Base class to write a non-lazy tabular processor for dataframes

setups[source]

setups(to:Tabular)

encodes[source]

encodes(to:Tabular)

decodes[source]

decodes(to:Tabular)

class Categorify[source]

Categorify(enc=None, dec=None, split_idx=None, order=None) :: TabularProc

Transform the categorical variables to something similar to pd.Categorical

df = pd.DataFrame({'a':[0,1,2,0,2]})
to = TabularPandas(df, Categorify, 'a')
cat = to.procs.categorify
test_eq(cat['a'], ['#na#',0,1,2])
test_eq(to['a'], [1,2,3,1,3])
to.show()
a
0 0
1 1
2 2
3 0
4 2
df1 = pd.DataFrame({'a':[1,0,3,-1,2]})
to1 = to.new(df1)
to1.process()
#Values that weren't in the training df are sent to 0 (na)
test_eq(to1['a'], [2,1,0,0,3])
to2 = cat.decode(to1)
test_eq(to2['a'], [1,0,'#na#','#na#',2])
cat = Categorify()
df = pd.DataFrame({'a':[0,1,2,3,2]})
to = TabularPandas(df, cat, 'a', splits=[[0,1,2],[3,4]])
test_eq(cat['a'], ['#na#',0,1,2])
test_eq(to['a'], [1,2,3,0,3])
df = pd.DataFrame({'a':pd.Categorical(['M','H','L','M'], categories=['H','M','L'], ordered=True)})
to = TabularPandas(df, Categorify, 'a')
cat = to.procs.categorify
test_eq(cat['a'], ['#na#','H','M','L'])
test_eq(to.items.a, [2,1,3,2])
to2 = cat.decode(to)
test_eq(to2['a'], ['M','H','L','M'])
cat = Categorify()
df = pd.DataFrame({'a':[0,1,2,3,2], 'b': ['a', 'b', 'a', 'b', 'b']})
to = TabularPandas(df, cat, 'a', splits=[[0,1,2],[3,4]], y_names='b')
test_eq(to.vocab, ['a', 'b'])
test_eq(to['b'], [0,1,0,1,1])
to2 = to.procs.decode(to)
test_eq(to2['b'], ['a', 'b', 'a', 'b', 'b'])
cat = Categorify()
df = pd.DataFrame({'a':[0,1,2,3,2], 'b': ['a', 'b', 'a', 'b', 'b']})
to = TabularPandas(df, cat, 'a', splits=[[0,1,2],[3,4]], y_names='b')
test_eq(to.vocab, ['a', 'b'])
test_eq(to['b'], [0,1,0,1,1])
to2 = to.procs.decode(to)
test_eq(to2['b'], ['a', 'b', 'a', 'b', 'b'])
cat = Categorify()
df = pd.DataFrame({'a':[0,1,2,3,2], 'b': ['a', 'b', 'a', 'c', 'b']})
to = TabularPandas(df, cat, 'a', splits=[[0,1,2],[3,4]], y_names='b')
test_eq(to.vocab, ['a', 'b'])

class NormalizeTab[source]

NormalizeTab(enc=None, dec=None, split_idx=None, order=None) :: TabularProc

Normalize the continuous variables

setups[source]

setups(to:Tabular)

encodes[source]

encodes(to:Tabular)

decodes[source]

decodes(to:Tabular)

norm = Normalize()
df = pd.DataFrame({'a':[0,1,2,3,4]})
to = TabularPandas(df, norm, cont_names='a')
x = np.array([0,1,2,3,4])
m,s = x.mean(),x.std()
test_eq(norm.means['a'], m)
test_close(norm.stds['a'], s)
test_close(to['a'].values, (x-m)/s)
df1 = pd.DataFrame({'a':[5,6,7]})
to1 = to.new(df1)
to1.process()
test_close(to1['a'].values, (np.array([5,6,7])-m)/s)
to2 = norm.decode(to1)
test_close(to2['a'].values, [5,6,7])
norm = Normalize()
df = pd.DataFrame({'a':[0,1,2,3,4]})
to = TabularPandas(df, norm, cont_names='a', splits=[[0,1,2],[3,4]])
x = np.array([0,1,2])
m,s = x.mean(),x.std()
test_eq(norm.means['a'], m)
test_close(norm.stds['a'], s)
test_close(to['a'].values, (np.array([0,1,2,3,4])-m)/s)

class FillStrategy[source]

FillStrategy()

Namespace containing the various filling strategies.

Currently, filling with the median, a constant, and the mode are supported.

class FillMissing[source]

FillMissing(fill_strategy='median', add_col=True, fill_vals=None) :: TabularProc

Fill the missing values in continuous columns.

fill1,fill2,fill3 = (FillMissing(fill_strategy=s) 
                     for s in [FillStrategy.median, FillStrategy.constant, FillStrategy.mode])
df = pd.DataFrame({'a':[0,1,np.nan,1,2,3,4]})
df1 = df.copy(); df2 = df.copy()
tos = (TabularPandas(df, fill1, cont_names='a'),
       TabularPandas(df1, fill2, cont_names='a'),
       TabularPandas(df2, fill3, cont_names='a'))
test_eq(fill1.na_dict, {'a': 1.5})
test_eq(fill2.na_dict, {'a': 0})
test_eq(fill3.na_dict, {'a': 1.0})

for t in tos: test_eq(t.cat_names, ['a_na'])

for to_,v in zip(tos, [1.5, 0., 1.]):
    test_eq(to_['a'].values, np.array([0, 1, v, 1, 2, 3, 4]))
    test_eq(to_['a_na'].values, np.array([0, 0, 1, 0, 0, 0, 0]))
fill = FillMissing() 
df = pd.DataFrame({'a':[0,1,np.nan,1,2,3,4], 'b': [0,1,2,3,4,5,6]})
to = TabularPandas(df, fill, cont_names=['a', 'b'])
test_eq(fill.na_dict, {'a': 1.5})
test_eq(to.cat_names, ['a_na'])
test_eq(to['a'].values, np.array([0, 1, 1.5, 1, 2, 3, 4]))
test_eq(to['a_na'].values, np.array([0, 0, 1, 0, 0, 0, 0]))
test_eq(to['b'].values, np.array([0,1,2,3,4,5,6]))
procs = [Normalize, Categorify, FillMissing, noop]
df = pd.DataFrame({'a':[0,1,2,1,1,2,0], 'b':[0,1,np.nan,1,2,3,4]})
to = TabularPandas(df, procs, cat_names='a', cont_names='b')

#Test setup and apply on df_main
test_eq(to.cat_names, ['a', 'b_na'])
test_eq(to['a'], [1,2,3,2,2,3,1])
test_eq(to['b_na'], [1,1,2,1,1,1,1])
x = np.array([0,1,1.5,1,2,3,4])
m,s = x.mean(),x.std()
test_close(to['b'].values, (x-m)/s)
test_eq(to.classes, {'a': ['#na#',0,1,2], 'b_na': ['#na#',False,True]})
df = pd.DataFrame({'a':[0,1,2,1,1,2,0], 'b':[0,1,np.nan,1,2,3,4], 'c': ['b','a','b','a','a','b','a']})
to = TabularPandas(df, procs, 'a', 'b', y_names='c')

test_eq(to.cat_names, ['a', 'b_na'])
test_eq(to['a'], [1,2,3,2,2,3,1])
test_eq(to['b_na'], [1,1,2,1,1,1,1])
test_eq(to['c'], [1,0,1,0,0,1,0])
x = np.array([0,1,1.5,1,2,3,4])
m,s = x.mean(),x.std()
test_close(to['b'].values, (x-m)/s)
test_eq(to.classes, {'a': ['#na#',0,1,2], 'b_na': ['#na#',False,True]})
test_eq(to.vocab, ['a','b'])
df = pd.DataFrame({'a':[0,1,2,1,1,2,0], 'b':[0,1,np.nan,1,2,3,4], 'c': ['b','a','b','a','a','b','a']})
to = TabularPandas(df, procs, 'a', 'b', y_names='c')

test_eq(to.cat_names, ['a', 'b_na'])
test_eq(to['a'], [1,2,3,2,2,3,1])
test_eq(df.a.dtype,int)
test_eq(to['b_na'], [1,1,2,1,1,1,1])
test_eq(to['c'], [1,0,1,0,0,1,0])
df = pd.DataFrame({'a':[0,1,2,1,1,2,0], 'b':[0,np.nan,1,1,2,3,4], 'c': ['b','a','b','a','a','b','a']})
to = TabularPandas(df, procs, cat_names='a', cont_names='b', y_names='c', splits=[[0,1,4,6], [2,3,5]])

test_eq(to.cat_names, ['a', 'b_na'])
test_eq(to['a'], [1,2,2,1,0,2,0])
test_eq(df.a.dtype,int)
test_eq(to['b_na'], [1,2,1,1,1,1,1])
test_eq(to['c'], [1,0,0,0,1,0,1])

class ReadTabBatch[source]

ReadTabBatch(to) :: ItemTransform

A transform that always take tuples as items

from torch.utils.data.dataloader import _MultiProcessingDataLoaderIter,_SingleProcessDataLoaderIter,_DatasetKind
_loaders = (_MultiProcessingDataLoaderIter,_SingleProcessDataLoaderIter)

class TabDataLoader[source]

TabDataLoader(dataset, bs=16, shuffle=False, after_batch=None, num_workers=0, verbose=False, do_setup=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, wif=None, before_iter=None, after_item=None, before_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None) :: TfmdDL

A transformed DataLoader for Tabular data

Integration example

For a more in-depth explaination, see the tabular tutorial

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_test.drop('salary', axis=1, inplace=True)
df_main.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary", splits=splits)
dls = to.dataloaders()
dls.valid.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Private HS-grad Married-civ-spouse Craft-repair Husband White False 57.000001 124771.002845 9.0 <50k
1 Private Some-college Separated Other-service Unmarried Black False 34.000000 232855.000031 10.0 <50k
2 ? HS-grad Married-civ-spouse #na# Husband White True 60.000000 174073.000337 10.0 <50k
3 Local-gov Masters Divorced Prof-specialty Unmarried White False 53.000000 176058.999402 14.0 <50k
4 Local-gov Bachelors Never-married Prof-specialty Other-relative White False 28.000000 356088.995090 13.0 <50k
5 Private HS-grad Married-civ-spouse Transport-moving Husband White False 39.000000 33885.998610 9.0 <50k
6 Private HS-grad Never-married Craft-repair Own-child White False 45.000000 67715.996922 9.0 >=50k
7 Self-emp-inc Bachelors Married-civ-spouse Sales Husband White False 66.000001 249042.997726 13.0 >=50k
8 State-gov Some-college Never-married Adm-clerical Own-child White False 18.999999 194259.999915 10.0 <50k
9 Private Assoc-voc Divorced Exec-managerial Not-in-family White False 44.000000 247879.998460 11.0 >=50k
to.show()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
6170 Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial Husband White False 42.0 101593.0 13.0 >=50k
9510 Private Bachelors Married-civ-spouse Prof-specialty Husband Amer-Indian-Eskimo False 51.0 215404.0 13.0 >=50k
6619 Private HS-grad Divorced Adm-clerical Not-in-family White False 53.0 195638.0 9.0 <50k
4119 Private HS-grad Married-civ-spouse Craft-repair Husband White False 43.0 50646.0 9.0 <50k
6359 Self-emp-not-inc 9th Married-civ-spouse Craft-repair Husband White False 65.0 144822.0 5.0 <50k
773 Local-gov Assoc-voc Married-civ-spouse Protective-serv Husband White False 40.0 141649.0 11.0 >=50k
2413 Self-emp-not-inc HS-grad Married-civ-spouse Exec-managerial Husband White False 41.0 408498.0 9.0 <50k
3052 Private HS-grad Divorced Craft-repair Not-in-family White False 59.0 87510.0 9.0 <50k
7317 Private Bachelors Never-married Exec-managerial Not-in-family White False 39.0 202950.0 13.0 <50k
1006 Self-emp-not-inc Assoc-acdm Married-civ-spouse Sales Husband White False 29.0 179008.0 12.0 <50k

We can decode any set of transformed data by calling to.decode_row with our raw data:

row = to.items.iloc[0]
to.decode_row(row)
age                                  42
workclass              Self-emp-not-inc
fnlwgt                           101593
education                     Bachelors
education-num                        13
marital-status       Married-civ-spouse
occupation              Exec-managerial
relationship                    Husband
race                              White
sex                                Male
capital-gain                          0
capital-loss                          0
hours-per-week                       60
native-country            United-States
salary                            >=50k
education-num_na                  False
Name: 6170, dtype: object
to_tst = to.new(df_test)
to_tst.process()
to_tst.items.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country education-num_na
10000 0.468572 5 1.333707 10 1.174040 3 2 1 2 Male 0 0 40 Philippines 1
10001 -0.924597 5 1.247241 12 -0.425197 3 15 1 4 Male 0 0 40 United-States 1
10002 1.055170 5 0.151604 2 -1.224815 1 9 2 5 Female 0 0 37 United-States 1
10003 0.541897 5 -0.278734 12 -0.425197 7 2 5 5 Female 0 0 43 United-States 1
10004 0.761871 6 1.435587 9 0.374421 3 5 1 5 Male 0 0 60 United-States 1
tst_dl = dls.valid.new(to_tst)
tst_dl.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num
0 Private Bachelors Married-civ-spouse Adm-clerical Husband Asian-Pac-Islander False 45.000000 338104.995209 13.0
1 Private HS-grad Married-civ-spouse Transport-moving Husband Other False 26.000000 328663.003774 9.0
2 Private 11th Divorced Other-service Not-in-family White False 53.000000 209022.000780 7.0
3 Private HS-grad Widowed Adm-clerical Unmarried White False 46.000000 162029.998572 9.0
4 Self-emp-inc Assoc-voc Married-civ-spouse Exec-managerial Husband White False 49.000000 349230.004432 11.0
5 Local-gov Some-college Married-civ-spouse Exec-managerial Husband White False 34.000000 124826.997130 10.0
6 Self-emp-inc Some-college Married-civ-spouse Sales Husband White False 53.000000 290640.001413 10.0
7 Private Some-college Never-married Sales Own-child White False 18.999999 106272.998476 10.0
8 Private Some-college Married-civ-spouse Protective-serv Husband Black False 71.999999 53683.997212 10.0
9 Private Some-college Never-married Sales Own-child White False 19.999999 505979.998937 10.0

Other target types

Multi-label categories

one-hot encoded label

def _mock_multi_label(df):
    sal,sex,white = [],[],[]
    for row in df.itertuples():
        sal.append(row.salary == '>=50k')
        sex.append(row.sex == ' Male')
        white.append(row.race == ' White')
    df['salary'] = np.array(sal)
    df['male']   = np.array(sex)
    df['white']  = np.array(white)
    return df
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)
df_main.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary male white
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States True False True
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States True True True
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States False False False
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States True True False
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States False False False

setups[source]

setups(to:Tabular)

encodes[source]

encodes(to:Tabular)

decodes[source]

decodes(to:Tabular)

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
y_names=["salary", "male", "white"]
%time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names=y_names, y_block=MultiCategoryBlock(encoded=True, vocab=y_names), splits=splits)
CPU times: user 84.9 ms, sys: 0 ns, total: 84.9 ms
Wall time: 83.4 ms
dls = to.dataloaders()
dls.valid.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary male white
0 Private HS-grad Never-married Adm-clerical Not-in-family White False 27.000000 204788.000254 9.0 False False True
1 Private Some-college Never-married Sales Unmarried White False 21.000000 119704.000225 10.0 False False True
2 Without-pay HS-grad Married-civ-spouse Adm-clerical Wife White False 52.000000 198262.000006 9.0 False False True
3 ? HS-grad Married-civ-spouse ? Husband White False 32.000000 647882.018627 9.0 False True True
4 Private HS-grad Never-married Sales Not-in-family White False 21.000000 234107.999244 9.0 False True True
5 Private Some-college Never-married Other-service Not-in-family White False 21.000000 83032.993524 10.0 False False True
6 Private HS-grad Never-married #na# Other-relative White False 21.000000 265147.997727 9.0 False True True
7 Private Bachelors Married-civ-spouse Prof-specialty Husband White False 41.000000 32877.996387 13.0 True True True
8 Private Some-college Never-married Other-service Not-in-family White False 23.000000 210053.000696 10.0 False False True
9 Local-gov Bachelors Widowed Prof-specialty Unmarried White False 70.999999 365996.001912 13.0 False False True

Not one-hot encoded

def _mock_multi_label(df):
    targ = []
    for row in df.itertuples():
        labels = []
        if row.salary == '>=50k': labels.append('>50k')
        if row.sex == ' Male':   labels.append('male')
        if row.race == ' White': labels.append('white')
        targ.append(' '.join(labels))
    df['target'] = np.array(targ)
    return df
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)
df_main.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary target
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k >50k white
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k >50k male white
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k >50k male
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
@MultiCategorize
def encodes(self, to:Tabular): 
    #to.transform(to.y_names, partial(_apply_cats, {n: self.vocab for n in to.y_names}, 0))
    return to
  
@MultiCategorize
def decodes(self, to:Tabular): 
    #to.transform(to.y_names, partial(_decode_cats, {n: self.vocab for n in to.y_names}))
    return to
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
%time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="target", y_block=MultiCategoryBlock(), splits=splits)
CPU times: user 89.8 ms, sys: 0 ns, total: 89.8 ms
Wall time: 88.1 ms
to.procs[2].vocab
(#24) ['-','_','a','c','d','e','f','g','h','i'...]

Regression

setups[source]

setups(to:Tabular)

encodes[source]

encodes(to:Tabular)

decodes[source]

decodes(to:Tabular)

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
%time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names='age', splits=splits)
CPU times: user 92.5 ms, sys: 0 ns, total: 92.5 ms
Wall time: 90.6 ms
to.procs[-1].means
fnlwgt           192279.640875
education-num        10.060125
dtype: float64
dls = to.dataloaders()
dls.valid.show_batch()
workclass education marital-status occupation relationship race education-num_na fnlwgt education-num age
0 ? 7th-8th Widowed ? Not-in-family White False 191287.999959 4.0 68.0
1 ? Some-college Never-married ? Own-child Asian-Pac-Islander False 144685.000584 10.0 20.0
2 Private Some-college Married-civ-spouse Craft-repair Husband White False 135162.002373 10.0 35.0
3 Private Bachelors Married-civ-spouse Adm-clerical Wife White False 221436.000153 13.0 27.0
4 Private HS-grad Separated Other-service Not-in-family Black False 265954.000517 9.0 35.0
5 Private HS-grad Married-civ-spouse Machine-op-inspct Husband White False 165936.999578 9.0 47.0
6 Private HS-grad Married-civ-spouse Craft-repair Husband White False 115040.003049 9.0 34.0
7 Private 10th Never-married Transport-moving Own-child White False 230574.000507 6.0 22.0
8 Private Some-college Married-civ-spouse Craft-repair Husband White False 221947.000033 10.0 41.0
9 Self-emp-inc Bachelors Married-civ-spouse Craft-repair Husband White False 126569.001634 13.0 37.0

Not being used now - for multi-modal

class TensorTabular(Tuple):
    def get_ctxs(self, max_n=10, **kwargs):
        n_samples = min(self[0].shape[0], max_n)
        df = pd.DataFrame(index = range(n_samples))
        return [df.iloc[i] for i in range(n_samples)]

    def display(self, ctxs): display_df(pd.DataFrame(ctxs))

class TabularLine(pd.Series):
    "A line of a dataframe that knows how to show itself"
    def show(self, ctx=None, **kwargs): return self if ctx is None else ctx.append(self)

class ReadTabLine(ItemTransform):
    def __init__(self, proc): self.proc = proc

    def encodes(self, row):
        cats,conts = (o.map(row.__getitem__) for o in (self.proc.cat_names,self.proc.cont_names))
        return TensorTabular(tensor(cats).long(),tensor(conts).float())

    def decodes(self, o):
        to = TabularPandas(o, self.proc.cat_names, self.proc.cont_names, self.proc.y_names)
        to = self.proc.decode(to)
        return TabularLine(pd.Series({c: v for v,c in zip(to.items[0]+to.items[1], self.proc.cat_names+self.proc.cont_names)}))

class ReadTabTarget(ItemTransform):
    def __init__(self, proc): self.proc = proc
    def encodes(self, row): return row[self.proc.y_names].astype(np.int64)
    def decodes(self, o): return Category(self.proc.classes[self.proc.y_names][o])
# dtype: object""")