How to use the tabular application in fastai

To illustrate the tabular application, we will use the example of the Adult dataset where we have to predict if a person is earning more or less than $50k per year using some general data.

from fastai2.tabular.all import *

We can download a sample of this dataset with the usual command:

path = untar_data(URLs.ADULT_SAMPLE)
path.ls()
(#3) [Path('/home/sgugger/.fastai/data/adult_sample/models'),Path('/home/sgugger/.fastai/data/adult_sample/adult.csv'),Path('/home/sgugger/.fastai/data/adult_sample/export.pkl')]

Then we can have a look at how the data is structured:

df = pd.read_csv(path/'adult.csv')
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k

Some of the coumns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in TabularDataLoaders factory methods:

dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [Categorify, FillMissing, Normalize])

The last part is the list of pre-processors we apply to our data:

  • Categorify is going to take every categorical variable and make a map from integer to unique categories, then replace the values by the corresponding index.
  • FillMissing will fille the missing values in the continuous variables by the median of existing values (you can choose a specific value if you prefer)
  • Normalize will normalize the continuous variables (substract the mean and divide by the std)

The show_batch method works like for every other application:

dls.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 ? Some-college Never-married ? Own-child White False 22.000000 32731.996436 10.0 <50k
1 Private 7th-8th Married-civ-spouse Machine-op-inspct Husband White False 44.000000 99202.998578 4.0 <50k
2 Private HS-grad Divorced Farming-fishing Not-in-family White False 63.000001 117680.996997 9.0 <50k
3 Private HS-grad Married-civ-spouse Machine-op-inspct Husband White False 33.000000 194141.000170 9.0 <50k
4 Private Assoc-voc Divorced Transport-moving Not-in-family White False 35.000000 172570.999732 11.0 <50k
5 Local-gov HS-grad Divorced Exec-managerial Unmarried Amer-Indian-Eskimo False 43.000000 196308.000036 9.0 <50k
6 Private HS-grad Never-married Exec-managerial Not-in-family White False 43.000000 336642.996235 9.0 <50k
7 Private HS-grad Never-married Other-service Not-in-family White False 27.000000 158156.001081 9.0 <50k
8 ? Bachelors Never-married ? Unmarried White False 26.000000 130832.001756 13.0 <50k
9 Private Assoc-voc Married-civ-spouse Tech-support Husband White False 27.000000 62737.003461 11.0 <50k

We can define a model using the tabular_learner method. When we define our model, fastai will try to infer the loss function based on our y_names earlier.

Note: Sometimes with tabular data, your y's may be encoded (such as 0 and 1). In such a case you should explicitly pass y_block = CategoryBlock in your constructor so fastai won't presume you are doing regression.

learn = tabular_learner(dls, metrics=accuracy)

And we can train that model with the fit_one_cycle method (the fine_tune method won't be useful here since we don't have a pretrained model)>

learn.fit_one_cycle(1)
epoch train_loss valid_loss accuracy time
0 0.366727 0.351524 0.835842 00:05

We can then have a look at some predictions:

learn.show_results()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary salary_pred
0 5.0 12.0 3.0 15.0 1.0 5.0 1.0 -0.333356 -0.900977 -0.419934 1.0 0.0
1 7.0 12.0 5.0 6.0 5.0 5.0 1.0 0.916167 -1.457755 -0.419934 0.0 0.0
2 5.0 10.0 3.0 2.0 1.0 5.0 1.0 -0.774364 -0.030944 1.150726 0.0 0.0
3 5.0 13.0 3.0 5.0 1.0 5.0 1.0 -0.259855 -0.668491 1.543390 0.0 1.0
4 5.0 13.0 1.0 13.0 2.0 5.0 1.0 0.622161 0.409060 1.543390 1.0 0.0
5 3.0 16.0 3.0 4.0 1.0 5.0 1.0 0.254654 -0.870132 -0.027269 1.0 0.0
6 5.0 12.0 5.0 13.0 2.0 5.0 1.0 -0.259855 -0.464552 -0.419934 0.0 0.0
7 5.0 9.0 3.0 4.0 1.0 5.0 1.0 0.989668 -0.430562 0.365396 1.0 1.0
8 6.0 16.0 3.0 4.0 1.0 3.0 1.0 -0.627362 -0.110140 -0.027269 0.0 0.0

Or use the predict method on a row:

learn.predict(df.iloc[0])
(   workclass  education  marital-status  occupation  relationship  race  \
 0        5.0        8.0             3.0         0.0           6.0   5.0   
 
    education-num_na       age    fnlwgt  education-num  salary  
 0               1.0  0.769164 -0.835926       0.758061     0.0  ,
 tensor(0),
 tensor([0.5200, 0.4800]))

To get prediction on a new dataframe, you can use the test_dl method of the DataLoaders. That dataframe does not need to have the dependent variable in its column.

test_df = df.copy()
test_df.drop(['salary'], axis=1, inplace=True)
dl = learn.dls.test_dl(test_df)

Then Learner.get_preds will give you the predictions:

learn.get_preds(dl=dl)
(tensor([[0.5200, 0.4800],
         [0.5536, 0.4464],
         [0.9767, 0.0233],
         ...,
         [0.6025, 0.3975],
         [0.7228, 0.2772],
         [0.5157, 0.4843]]), None)