Helper functions to download the fastai datasets

A complete list of datasets that are available by default isnide the library are:

Main datasets:

  1. ADULT_SAMPLE: A small of the adults dataset to predict whether income exceeds $50K/yr based on census data.
  • BIWI_SAMPLE: A BIWI kinect headpose database. The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch.
  1. CIFAR: The famous cifar-10 dataset which consists of 60000 32x32 colour images in 10 classes, with 6000 images per class.
  2. COCO_SAMPLE: A sample of the coco dataset for object detection.
  3. COCO_TINY: A tiny version of the coco dataset for object detection.
  • HUMAN_NUMBERS: A synthetic dataset consisting of human number counts in text such as one, two, three, four.. Useful for experimenting with Language Models.
  • IMDB: The full IMDB sentiment analysis dataset.

  • IMDB_SAMPLE: A sample of the full IMDB sentiment analysis dataset.

  • ML_SAMPLE: A movielens sample dataset for recommendation engines to recommend movies to users.
  • ML_100k: The movielens 100k dataset for recommendation engines to recommend movies to users.
  • MNIST_SAMPLE: A sample of the famous MNIST dataset consisting of handwritten digits.
  • MNIST_TINY: A tiny version of the famous MNIST dataset consisting of handwritten digits.
  • MNIST_VAR_SIZE_TINY:
  • PLANET_SAMPLE: A sample of the planets dataset from the Kaggle competition Planet: Understanding the Amazon from Space.
  • PLANET_TINY: A tiny version of the planets dataset from the Kaggle competition Planet: Understanding the Amazon from Space for faster experimentation and prototyping.
  • IMAGENETTE: A smaller version of the imagenet dataset pronounced just like 'Imagenet', except with a corny inauthentic French accent.
  • IMAGENETTE_160: The 160px version of the Imagenette dataset.
  • IMAGENETTE_320: The 320px version of the Imagenette dataset.
  • IMAGEWOOF: Imagewoof is a subset of 10 classes from Imagenet that aren't so easy to classify, since they're all dog breeds.
  • IMAGEWOOF_160: 160px version of the ImageWoof dataset.
  • IMAGEWOOF_320: 320px version of the ImageWoof dataset.
  • IMAGEWANG: Imagewang contains Imagenette and Imagewoof combined, but with some twists that make it into a tricky semi-supervised unbalanced classification problem
  • IMAGEWANG_160: 160px version of Imagewang.
  • IMAGEWANG_320: 320px version of Imagewang.

Kaggle competition datasets:

  1. DOGS: Image dataset consisting of dogs and cats images from Dogs vs Cats kaggle competition.

Image Classification datasets:

  1. CALTECH_101: Pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. Collected in September 2003 by Fei-Fei Li, Marco Andreetto, and Marc 'Aurelio Ranzato.
  2. CARS: The Cars dataset contains 16,185 images of 196 classes of cars.
  3. CIFAR_100: The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class.
  4. CUB_200_2011: Caltech-UCSD Birds-200-2011 (CUB-200-2011) is an extended version of the CUB-200 dataset, with roughly double the number of images per class and new part location annotations
  5. FLOWERS: 17 category flower dataset by gathering images from various websites.
  6. FOOD:
  7. MNIST: MNIST dataset consisting of handwritten digits.
  8. PETS: A 37 category pet dataset with roughly 200 images for each class.

NLP datasets:

  1. AG_NEWS: The AG News corpus consists of news articles from the AG’s corpus of news articles on the web pertaining to the 4 largest classes. The dataset contains 30,000 training and 1,900 testing examples for each class.
  2. AMAZON_REVIEWS: This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
  3. AMAZON_REVIEWS_POLARITY: Amazon reviews dataset for sentiment analysis.
  4. DBPEDIA: The DBpedia ontology dataset contains 560,000 training samples and 70,000 testing samples for each of 14 nonoverlapping classes from DBpedia.
  5. MT_ENG_FRA: Machine translation dataset from English to French.
  6. SOGOU_NEWS: The Sogou-SRR (Search Result Relevance) dataset was constructed to support researches on search engine relevance estimation and ranking tasks.
  7. WIKITEXT: The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
  8. WIKITEXT_TINY: A tiny version of the WIKITEXT dataset.
  9. YAHOO_ANSWERS: YAHOO's question answers dataset.
  10. YELP_REVIEWS: The Yelp dataset is a subset of YELP businesses, reviews, and user data for use in personal, educational, and academic purposes
  11. YELP_REVIEWS_POLARITY: For sentiment classification on YELP reviews.

Image localization datasets:

  1. BIWI_HEAD_POSE: A BIWI kinect headpose database. The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch.
  2. CAMVID: Consists of driving labelled dataset for segmentation type models.
  3. CAMVID_TINY: A tiny camvid dataset for segmentation type models.
  4. LSUN_BEDROOMS: Large-scale Image Dataset using Deep Learning with Humans in the Loop
  5. PASCAL_2007: Pascal 2007 dataset to recognize objects from a number of visual object classes in realistic scenes.
  6. PASCAL_2012: Pascal 2012 dataset to recognize objects from a number of visual object classes in realistic scenes.

Audio classification:

  1. MACAQUES: 7285 macaque coo calls across 8 individuals from Distributed acoustic cues for caller identity in macaque vocalization.
  2. ZEBRA_FINCH: 3405 zebra finch calls classified across 11 call types. Additonal labels include name of individual making the vocalization and its age.

Medical Imaging datasets:

  1. SIIM_SMALL: A smaller version of the SIIM dataset where the objective is to classify pneumothorax from a set of chest radiographic images.

Pretrained models:

  1. OPENAI_TRANSFORMER: The GPT2 Transformer pretrained weights.
  2. WT103_FWD: The WikiText-103 forward language model weights.
  3. WT103_BWD: The WikiText-103 backward language model weights.

To download any of the datasets or pretrained weights, simply run untar_data by passing any dataset name mentioned above like so:

path = untar_data(URLs.PETS)
path.ls()
> > (#7393) [Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Siamese_178.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/german_shorthaired_94.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Abyssinian_92.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/basset_hound_111.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Russian_Blue_194.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/staffordshire_bull_terrier_91.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Persian_69.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/english_setter_33.jpg'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Russian_Blue_155.jpg')...]

To download model pretrained weights:```python path = untar_data(URLs.PETS) path.ls()

(#2) [Path('/home/ubuntu/.fastai/data/wt103-bwd/itos_wt103.pkl'),Path('/home/ubuntu/.fastai/data/wt103-bwd/lstm_bwd.pth')] ```

class Config[source]

Config()

Setup config at ~/.fastai unless it exists already.

If a config file doesn't exist already, it is always created at ~/.fastai/config.yml location by default whenever an instance of the Config class is created. Here is a quick example to explain:

config_file = Path("~/.fastai/config.yml").expanduser()
if config_file.exists(): os.remove(config_file)
assert not config_file.exists()

config = Config()
assert config_file.exists()

The config is now available as config.d:

config.d
{'archive_path': '/home/sgugger/.fastai/archive',
 'data_path': '/home/sgugger/.fastai/data',
 'model_path': '/home/sgugger/.fastai/models',
 'storage_path': '/home/sgugger/.fastai/data',
 'version': 2}

As can be seen, this is a basic config file that consists of data_path, model_path, storage_path and archive_path. All future downloads occur at the paths defined in the config file based on the type of download. For example, all future fastai datasets are downloaded to the data_path while all pretrained model weights are download to model_path unless the default download location is updated.

Please note that it is possible to update the default path locations in the config file. Let's first create a backup of the config file, then update the config to show the changes and re update the new config with the backup file.

if config_file.exists(): shutil.move(config_file, config_bak)
config['archive_path'] = Path(".")
config.save()
config = Config()
config.d
{'archive_path': '.',
 'data_archive_path': '/home/sgugger/.fastai/data',
 'data_path': '/home/sgugger/.fastai/data',
 'model_path': '/home/sgugger/.fastai/models',
 'storage_path': '/home/sgugger/.fastai/data',
 'version': 2}

The archive_path has been updated to ".". Now let's remove any updates we made to Config file that we made for the purpose of this example.

if config_bak.exists(): shutil.move(config_bak, config_file)
config = Config()
config.d
{'archive_path': '/home/sgugger/.fastai/archive',
 'data_archive_path': '/home/sgugger/.fastai/data',
 'data_path': '/home/sgugger/.fastai/data',
 'model_path': '/home/sgugger/.fastai/models',
 'storage_path': '/home/sgugger/.fastai/data',
 'version': 2}

class URLs[source]

URLs()

Global constants for dataset and model URLs.

The default local path is at ~/.fastai/archive/ but this can be updated by passing a different c_key. Note: c_key should be one of 'archive_path', 'data_archive_path', 'data_path', 'model_path', 'storage_path'.

url = URLs.PETS
local_path = URLs.path(url)
test_eq(local_path.parent, Config()['archive']); 
local_path
Path('/home/sgugger/.fastai/archive/oxford-iiit-pet.tgz')
local_path = URLs.path(url, c_key='model')
test_eq(local_path.parent, Config()['model'])
local_path
Path('/home/sgugger/.fastai/models/oxford-iiit-pet.tgz')

Downloading

download_url[source]

download_url(url, dest, overwrite=False, pbar=None, show_progress=True, chunk_size=1048576, timeout=4, retries=5)

Download url to dest unless it exists and not overwrite

The download_url is a very handy function inside fastai! This function can be used to download any file from the internet to a location passed by dest argument of the function. It should not be confused, that this function can only be used to download fastai-files. That couldn't be further away from the truth. As an example, let's download the pets dataset from the actual source file:

fname = Path("./dog.jpg")
if fname.exists(): os.remove(fname)
url = "https://i.insider.com/569fdd9ac08a80bd448b7138?width=1100&format=jpeg&auto=webp"
download_url(url, fname)
assert fname.exists()

Let's confirm that the file was indeed downloaded correctly.

from PIL import Image
im = Image.open(fname)
plt.imshow(im);
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-153e9d88ae05> in <module>
      1 from PIL import Image
----> 2 im = Image.open(fname)
      3 plt.imshow(im);

NameError: name 'fname' is not defined

As can be seen, the file has been downloaded to the local path provided in dest argument. Calling the function again doesn't trigger a download since the file is already there. This can be confirmed by checking that the last modified time of the file that is downloaded doesn't get updated.

if fname.exists(): last_modified_time = os.path.getmtime(fname)
download_url(url, fname)
test_eq(os.path.getmtime(fname), last_modified_time)
if fname.exists(): os.remove(fname)

We can also use the download_url function to download the pet's dataset straight from the source by simply passing https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz in url.

download_data[source]

download_data(url, fname=None, c_key='archive', force_download=False)

Download url to fname.

The download_data is a convenience function and a wrapper outside download_url to download fastai files to the appropriate local path based on the c_key.

If fname is None, it will default to the archive folder you have in your config file (or data, model if you specify a different c_key) followed by the last part of the url: for instance URLs.MNIST_SAMPLE is http://files.fast.ai/data/examples/mnist_sample.tgz and the default value for fname will be ~/.fastai/archive/mnist_sample.tgz.

If force_download=True, the file is alwayd downloaded. Otherwise, it's only when the file doesn't exists that the download is triggered.

_get_check(URLs.PASCAL_2007),_get_check(URLs.PASCAL_2012)
([1637796771, '433b4706eb7c42bd74e7f784e3fdf244'],
 [2618908000, 'd90e29e54a4c76c0c6fba8355dcbaca5'])

Extract

file_extract[source]

file_extract(fname, dest=None)

Extract fname to dest using tarfile or zipfile.

file_extract is used by default in untar_data to decompress the downloaded file.

newest_folder[source]

newest_folder(path)

Return newest folder on path

rename_extracted[source]

rename_extracted(dest)

Rename file if different from dest

let's rename the untar/unzip data if dest name is different from fname

untar_data[source]

untar_data(url, fname=None, dest=None, c_key='data', force_download=False, extract_func='file_extract')

Download url to fname if dest doesn't exist, and un-tgz or unzip to folder dest.

untar_data is a very powerful convenience function to download files from url to dest. The url can be a default url from the URLs class or a custom url. If dest is not passed, files are downloaded at the default_dest which defaults to ~/.fastai/data/.

This convenience function extracts the downloaded files to dest by default. In order, to simply download the files without extracting, pass the noop function as extract_func.

Note, it is also possible to pass a custom extract_func to untar_data if the filetype doesn't end with .tgz or .zip. The gzip and zip files are supported by default and there is no need to pass custom extract_func for these type of files.

Internally, if files are not available at fname location already which defaults to ~/.fastai/archive/, the files get downloaded at ~/.fastai/archive and are then extracted at dest location. If no dest is passed the default_dest to download the files is ~/.fastai/data. If files are already available at the fname location but not available then a symbolic link is created for each file from fname location to dest.

Also, if force_download is set to True, files are re downloaded even if they exist.

test_eq(untar_data(URLs.MNIST_SAMPLE), config.data/'mnist_sample')

#Test specific fname
untar_data(URLs.MNIST_TINY, fname='mnist_tiny.tgz', force_download=True)
p = Path('mnist_tiny.tgz')
assert p.exists()
p.unlink()
    
#Test specific dest
test_eq(untar_data(URLs.MNIST_TINY, dest='.'), Path('mnist_tiny'))
assert Path('mnist_tiny').exists()
shutil.rmtree(Path('mnist_tiny'))

#Test c_key
tst_model = config.model/'mnist_sample'
test_eq(untar_data(URLs.MNIST_SAMPLE, c_key='model'), tst_model)
assert not tst_model.with_suffix('.tgz').exists() #Archive wasn't downloaded in the models path
assert (config.archive/'mnist_sample.tgz').exists() #Archive was downloaded there
shutil.rmtree(tst_model)

Sometimes the extracted folder does not have the same name as the downloaded file.

untar_data(URLs.MNIST_TINY, fname='mnist_tiny.tgz', force_download=True)
Path('mnist_tiny.tgz').rename('nims_tini.tgz')
p = Path('nims_tini.tgz')
dest = Path('nims_tini')
assert p.exists()
file_extract(p, dest.parent)
rename_extracted(dest)
p.unlink()
shutil.rmtree(dest)