Using the data block accross all applications

In this tutorial, we'll see how to use the data block API on a variety of tasks and how to debug data blocks. The data block API takes its name from the way it's designed: every bit needed to build the DataLoaders object (type of inputs, targets, how to label, split...) is encapsulated in a block, and you can mix and match those blocks

Building a DataBlock from scratch

The rest of this tutorial will give many examples, but let's first build a DataBlock from scratch on the dogs versus cats problem we saw in the vision tutorial. First we import everything needed in vision.

from fastai2.vision.all import *

The first step is to download and decompress our data (if it's not already done) and get its location:

path = untar_data(URLs.PETS)

And as we saw, all the filenames are in the "images" folder. The get_image_files function helps get all the images in subfolders:

fnames = get_image_files(path/"images")

Let's begin with an empty DataBlock.

dblock = DataBlock()

By itself, a DataBlock is just a blue print on how to assemble your data. It does not do anything until you pass it a source. You can choose to then convert that source into a Datasets or a DataLoaders by using the DataBlock.datasets or DataBlock.dataloaders method. Since we haven't done anything to get our data ready for batches, the dataloaders method will fail here, but we can have a look at how it gets converted in Datasets. This is where we pass the source of our data, here all our filenames:

dsets = dblock.datasets(fnames)
dsets.train[0]
(Path('/home/sgugger/.fastai/data/oxford-iiit-pet/images/staffordshire_bull_terrier_26.jpg'),
 Path('/home/sgugger/.fastai/data/oxford-iiit-pet/images/staffordshire_bull_terrier_26.jpg'))

By default, the data block API assumes we have an input and a target, which is why we see our filename repeated twice.

The first thing we can do is use a get_items function to actually assemble our items inside the data block:

dblock = DataBlock(get_items = get_image_files)

The difference is that you then pass as a source the folder with the images and not all the filenames:

dsets = dblock.datasets(path/"images")
dsets.train[0]
(Path('/home/sgugger/.fastai/data/oxford-iiit-pet/images/Russian_Blue_99.jpg'),
 Path('/home/sgugger/.fastai/data/oxford-iiit-pet/images/Russian_Blue_99.jpg'))

Our inputs are ready to be processed as images (since images can be built from filenames), but our target is not. Since we are in a cat versus dog problem, we need to convert that filename to "cat" vs "dog" (or True vs False). Let's build a function for this:

def label_func(fname):
    return "cat" if fname.name[0].isupper() else "dog"

We can then tell our data block to use it to label our target by passing it as get_y:

dblock = DataBlock(get_items = get_image_files,
                   get_y     = label_func)

dsets = dblock.datasets(path/"images")
dsets.train[0]
(Path('/home/sgugger/.fastai/data/oxford-iiit-pet/images/shiba_inu_88.jpg'),
 'dog')

Now that our inputs and targets are ready, we can specify types to tell the data block API that our inputs are images and our targets are categories. Types are represented by blocks in the data block API, here we use ImageBlock and CategoryBlock:

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func)

dsets = dblock.datasets(path/"images")
dsets.train[0]
(PILImage mode=RGB size=500x400, TensorCategory(1))

We can see how the DataBlock automatically added the transforms necessary to open the image, or how it changed the name "cat" to an index (with a special tensor type). To do this, it created a mapping from categories to index called "vocab" that we can access this way:

dsets.vocab
(#2) ['cat','dog']

Note that you can mix and match any block for input and targets, which is why the API is named data block API. You can also have more than two blocks (if you have multiple inputs and/or targets), you would just need to pass n_inp to the DataBlock to tell the library how many inputs there are (the rest would be targets) and pass a list of functions to get_x and/or get_y (to explain how to process each item to be ready for his type). See the object detection below for such an example.

The next step is to control how our validation set is created. We do this by passing a splitter to DataBlock. For instance, here is how to do a random split.

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func,
                   splitter  = RandomSplitter())

dsets = dblock.datasets(path/"images")
dsets.train[0]
(PILImage mode=RGB size=500x375, TensorCategory(1))

The last step is to specify item transforms and batch transforms (the same way we do it in ImageDataLoaders factory methods):

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func,
                   splitter  = RandomSplitter(),
                   item_tfms = Resize(224))

With that resize, we are now able to batch items together and can finally call dataloaders to convert our DataBlock to a DataLoaders object:

dls = dblock.dataloaders(path/"images")
dls.show_batch()

The way we usually build the data block in one go is by answering a list of questions:

  • what is the types of your inputs/targets? Here images and categories
  • where is your data? Here in filenames in subfolders
  • does something need to be applied to inputs? Here no
  • does something need to be applied to the target? Here the label_func function
  • how to split the data? Here randomly
  • do we need to apply something on formed items? Here a resize
  • do we need to apply something on formed batches? Here no

This gives us this design:

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func,
                   splitter  = RandomSplitter(),
                   item_tfms = Resize(224))

For two questions that got a no, the corresponding arguments we would pass if the anwser was different would be get_x and batch_tfms.

Image classification

Let's begin with examples of image classification problems. There are two kinds of image classification problems: problems with single-label (each image has one given label) or multi-label (each image can have multiple or no labels at all). We will cover those two kinds here.

from fastai2.vision.all import *

MNIST (single label)

MNIST is a dataset of hand-written digits from 0 to 9. We can very easily load it in the data block API by answering the following questions:

  • what are the types of our inputs and targets? Black and white images and labels.
  • where is the data? In subfolders.
  • how do we know if a sample is in the training or the validation set? By looking at the grandparent folder.
  • how do we know the label of an image? By looking at the parent folder.

In terms of the API, those answers translate like this:

mnist = DataBlock(blocks=(ImageBlock(cls=PILImageBW), CategoryBlock), 
                  get_items=get_image_files, 
                  splitter=GrandparentSplitter(),
                  get_y=parent_label)

Our types become blocks: one for images (using the black and white PILImageBW class) and one for categories. Searching subfolder for all image filenames is done by the get_image_files function. The split training/validation is done by using a GrandparentSplitter. And the function to get our targets (often called y) is parent_label.

To get an idea of the objects the fastai library provides for reading, labelling or splitting, check the data.transforms module.

In itself, a data block is just a blueprint. It does not do anything and does not check for errors. You have to feed it the source of the data to actually gather something. This is done with the .dataloaders method:

dls = mnist.dataloaders(untar_data(URLs.MNIST_TINY))
dls.show_batch(max_n=9, figsize=(4,4))

If something went wrong in the previous step, or if you're just curious about what happened under the hood, use the summary method. It will go verbosely step by step, and you will see at which point the process failed.

mnist.summary(untar_data(URLs.MNIST_TINY))
Setting-up type transforms pipelines
Collecting items from /home/sgugger/.fastai/data/mnist_tiny
Found 1428 items
2 datasets of sizes 709,699
Setting up Pipeline: PILBase.create
Setting up Pipeline: parent_label -> Categorize

Building one sample
  Pipeline: PILBase.create
    starting from
      /home/sgugger/.fastai/data/mnist_tiny/train/3/7861.png
    applying PILBase.create gives
      PILImageBW mode=L size=28x28
  Pipeline: parent_label -> Categorize
    starting from
      /home/sgugger/.fastai/data/mnist_tiny/train/3/7861.png
    applying parent_label gives
      3
    applying Categorize gives
      TensorCategory(0)

Final sample: (PILImageBW mode=L size=28x28, TensorCategory(0))


Setting up after_item: Pipeline: ToTensor
Setting up before_batch: Pipeline: 
Setting up after_batch: Pipeline: IntToFloatTensor

Building one batch
Applying item_tfms to the first sample:
  Pipeline: ToTensor
    starting from
      (PILImageBW mode=L size=28x28, TensorCategory(0))
    applying ToTensor gives
      (TensorImageBW of size 1x28x28, TensorCategory(0))

Adding the next 3 samples

No before_batch transform to apply

Collating items in a batch

Applying batch_tfms to the batch built
  Pipeline: IntToFloatTensor
    starting from
      (TensorImageBW of size 4x1x28x28, TensorCategory([0, 0, 0, 0], device='cuda:0'))
    applying IntToFloatTensor gives
      (TensorImageBW of size 4x1x28x28, TensorCategory([0, 0, 0, 0], device='cuda:0'))

Let's go over another example!

Pets (single label)

The Oxford IIIT Pets dataset is a dataset of pictures of dogs and cats, with 37 different breeds. A slight (but very) important difference with MNIST is that images are now not all of the same size. In MNIST they were all 28 by 28 pixels, but here they have different aspect ratios or dimensions. Therefore, we will need to add something to make them all the same size to be able to assemble them together in a batch. We will also see how to add data augmentation.

So let's go over the same questions as before and add two more:

  • what are the types of our inputs and targets? Images and labels.
  • where is the data? In subfolders.
  • how do we know if a sample is in the training or the validation set? We'll take a random split.
  • how do we know the label of an image? By looking at the parent folder.
  • do we want to apply a function to a given sample? Yes, we need to resize everything to a given size.
  • do we want to apply a function to a batch after it's created? Yes, we want data augmentation.
pets = DataBlock(blocks=(ImageBlock, CategoryBlock), 
                 get_items=get_image_files, 
                 splitter=RandomSplitter(),
                 get_y=Pipeline([attrgetter("name"), RegexLabeller(pat = r'^(.*)_\d+.jpg$')]),
                 item_tfms=Resize(128),
                 batch_tfms=aug_transforms())

And like for MNIST, we can see how the answers to those questions directly translate in the API. Our types become blocks: one for images and one for categories. Searching subfolder for all image filenames is done by the get_image_files function. The split training/validation is done by using a RandomSplitter. The function to get our targets (often called y) is a composition of two transforms: we get the name attribute of our Path filenames, then apply a regular expression to get the class. To compose those two transforms into one, we use a Pipeline.

Finally, We apply a resize at the item level and aug_transforms() at the batch level.

dls = pets.dataloaders(untar_data(URLs.PETS)/"images")
dls.show_batch(max_n=9)

Now let's see how we can use the same API for a multi-label problem.

Pascal (multi-label)

The Pascal dataset is originally an object detection dataset (we have to predict where some objects are in pictures). But it contains lots of pictures with various objects in them, so it gives a great example for a multi-label problem. Let's download it and have a look at the data:

pascal_source = untar_data(URLs.PASCAL_2007)
df = pd.read_csv(pascal_source/"train.csv")
df.head()
fname labels is_valid
0 000005.jpg chair True
1 000007.jpg car True
2 000009.jpg horse person True
3 000012.jpg car False
4 000016.jpg bicycle True

So it looks like we have one column with filenames, one column with the labels (separated by space) and one column that tells us if the filename should go in the validation set or not.

There are multiple ways to put this in a DataBlock, let's go over them, but first, let's answer our usual questionnaire:

  • what are the types of our inputs and targets? Images and multiple labels.
  • where is the data? In a dataframe.
  • how do we know if a sample is in the training or the validation set? A column of our dataframe.
  • how do we get an image? By looking at the column fname.
  • how do we know the label of an image? By looking at the column labels.
  • do we want to apply a function to a given sample? Yes, we need to resize everything to a given size.
  • do we want to apply a function to a batch after it's created? Yes, we want data augmentation.

Notice how there is one more question compared to before: we wont have to use a get_items function here because we already have all our data in one place. But we will need to do something to the raw dataframe to get our inputs, read the first column and add the proper folder before the filename. This is what we pass as get_x.

pascal = DataBlock(blocks=(