Callbacks and helper functions to train in parallel or use distributed training

Parallel

Patch the parallel models so they work with RNNs

DataParallel.reset[source]

DataParallel.reset()

class ParallelTrainer[source]

ParallelTrainer(device_ids) :: Callback

Basic class handling tweaks of the training loop by changing a Learner in various events

Learner.to_parallel[source]

Learner.to_parallel(device_ids=None)

Learner.detach_parallel[source]

Learner.detach_parallel()

Remove ParallelTrainer callback from Learner.

Learner.parallel_ctx[source]

Learner.parallel_ctx(device_ids=None)

A context manager to adapt a learner to train in data parallel mode.

Distributed

Patch the parallel models so they work with RNNs

DistributedDataParallel.reset[source]

DistributedDataParallel.reset()

Convenience functions to set up/tear down torch distributed data parallel mode.

setup_distrib[source]

setup_distrib(gpu=None)

teardown_distrib[source]

teardown_distrib()

DataLoader

We need to change the dataloaders so that they only get one part of the batch each (otherwise there is no point in using distributed training).

class DistributedDL[source]

DistributedDL(dataset, rank, world_size, bs=64, shuffle=False, num_workers=None, verbose=False, do_setup=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, wif=None, before_iter=None, after_item=None, before_batch=None, after_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None) :: TfmdDL

Transformed DataLoader

dl = TfmdDL(list(range(50)), bs=16, num_workers=2)
for i in range(4):
    dl1 = DistributedDL.from_dl(dl, i, 4)
    test_eq(list(dl1)[0], torch.arange(i, 52, 4)%50)
dl = TfmdDL(list(range(50)), bs=16, num_workers=2, shuffle=True)
res = []
for i in range(4):
    dl1 = DistributedDL.from_dl(dl, i, 4)
    dl1.set_epoch(0)
    res += list(dl1)[0].tolist()
#All items should only be accessed once (except 0 and 1 for final cycle) with seeded shuffle
test_eq(sorted(res), [0,0,1,1] + list(range(2, 50)))

class DistributedTrainer[source]

DistributedTrainer(cuda_id=0) :: Callback

Basic class handling tweaks of the training loop by changing a Learner in various events

Attach, remove a callback which adapts the model to use DistributedDL to train in distributed data parallel mode.

Learner.to_distributed[source]

Learner.to_distributed(cuda_id)

Learner.detach_distributed[source]

Learner.detach_distributed()

Learner.distrib_ctx[source]

Learner.distrib_ctx(cuda_id=None)

A context manager to adapt a learner to train in distributed data parallel mode.

distrib_ctx(cuda_id) the context manager uses cuda_id to prepare a single learner in a single process to participate in distributed training among a group of processes. It assumes the environment variables in this process have all been setup properly (see the example set up in fastai2.launch)

To illustrate what distrib_ctx() does, consider the usage:

with learn.distrib_ctx():
    learn.fit(.....)

Upon entering the context:

  • If cuda_id is not provided, as the example shows, it asks rank_distrib().
  • If there isn't already a pytorch distributed process group aka 'dpg', if creates one.

    Note 1. A dpg is merely a data abstraction, not a real pool of operating system processes -- those should have been created by now. Note 2. The initialization of the dpg waits until all processes have joined.

  • If there are more than 1 GPUs participating the group, it adds a DistributedTrainer callback to the learner object.

  • yield the learner object.

Then line learn.fit(.....) is executed.

Upon exiting the context:

  • The DistributedTrainer is removed from the callback lists.
  • Any locally created pytorch 'dpg' will be destoryed.
  • The process still attaches to the same GPU though.