Pad each batch, not the whole dataset #30

sshleifer · 2019-09-23T20:41:53Z

Previously, each sequence was padded to the length of the longest sequence in the dataset.
In this PR, each batch is padded to the length of the longest sequence in the batch. This results in a 30% speedup with negligible impact on metrics.

Code Changes

ChatDataset yields example dicts like {'input_ids': [[hist + cand1], ..[hist +cand_n]],} for the PADDED_INPUTS and mc_token_ids and mc_labels in the same format as previously.
ChatDataset().collate_fn(examples: list) turns a list of example dicts into the list of 5 tensors by batching them and padding them
As a result, get_dataloaders does much less
There is a data format change to the part of the process where we make lists of examples to facilitate this.
convai_evaluation.py still calls the old pad_dataset

1 Epoch Sanity Check

Before Change: 85 minutes
Validation: {'accuracy': 0.7483655941545956,
'average_accuracy': 0.7483655941545956,
'average_nll': 2.6815188920676687,
'average_ppl': 14.607263311061963,
'nll': 2.6815188920676687}

After Change: 60 minutes
Validation: {'accuracy': 0.7466991411357519,
'average_accuracy': 0.7466991411357519,
'average_nll': 2.6821035040007972,
'average_ppl': 14.615805388160778,
'nll': 2.6821035040007972}

Command:

python train.py --model_checkpoint openai-gpt --dataset_cache dataset_cache --fp16 O1 --n_epochs 1 --train_batch_size 4

sshleifer · 2019-10-01T16:48:55Z

train.py

+    return train_loader, valid_loader, train_sampler, valid_sampler
+
+
+def make_data_lists(args, personachat, tokenizer):


sshleifer · 2019-10-01T16:49:38Z

train.py

@@ -86,36 +139,20 @@ def get_data_loaders(args, tokenizer):
            persona = dialog["personality"].copy()
            for _ in range(args.personality_permutations):
                for utterance in dialog["utterances"]:
-                    history = utterance["history"][-(2*args.max_history+1):]
+                    candidate_instances = defaultdict(list)
+                    history = utterance["history"][-(2 * args.max_history + 1):]


could add assert len(utterance['candidates']) >= num_candidates

sshleifer · 2019-10-01T17:04:14Z

train.py

@@ -72,11 +73,63 @@ def build_input_from_segments(persona, history, reply, tokenizer, lm_labels=Fals
    return instance, sequence  # TODO: second arg is never used, delete it


+def pad_and_tensorize(batch_dict, padding):


this and ChatDataset should be easy to unit test

sshleifer · 2019-10-01T17:04:50Z

train.py

+    valid_dataset = ChatDataset(datasets['valid'], pad_id)
+
+    logger.info("Build train and validation dataloaders")
+    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) if args.distributed else None


(maybe) put this in ChatDataset.to_loader(self, args, shuffle) -> sampler, loader

at some point might also want to document which tensors are 3D

sshleifer · 2019-10-01T17:07:34Z

train.py

                        for input_name, input_array in instance.items():
-                            datasets[dataset_name][input_name].append(input_array)
+                            candidate_instances[input_name].append(input_array)
+                    for k in candidate_instances.keys():


.items() will save some chars

sshleifer · 2019-10-01T17:08:02Z

train.py

                    for j, candidate in enumerate(utterance["candidates"][-num_candidates:]):
-                        lm_labels = bool(j == num_candidates-1)
-                        instance, _ = build_input_from_segments(persona, history, candidate, tokenizer, lm_labels)
+                        lm_labels = bool(j == num_candidates - 1)


better varname?

sshleifer added 14 commits September 18, 2019 21:34

fix logdir

150aaac

Migrate eval code

68f926b

Utils import

bfdd032

Train.py and tokenizer test

e7d6e7b

Pin tensorboardx

517eb77

Add compatibility comment to eval

2f475cb

Partial GPT2 compatibility fix

0f56c4a

add special tokens before interact

2f1207a

Convai eval for GPT2

8477c2e

add args.model_checkpoint to logdir path

1ba3929

comment, warning about infinite loop hack

d4e007f

cleanup: remove extra newlines

1a00f96

Pad on the batch level

e92ee7c

Merge branch 'master' into batch-padding

2c90fdb

sshleifer changed the title ~~(WIP) Pad each batch, not the whole dataset~~ Pad each batch, not the whole dataset Sep 29, 2019

sshleifer commented Oct 1, 2019

View reviewed changes

sshleifer added 2 commits October 21, 2019 11:10

Fix merge conflicts

d7a3c5e

More conflicts

bc89893

sshleifer mentioned this pull request Feb 25, 2020

Update from pytorch-transformers to transformers library #61

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pad each batch, not the whole dataset #30

Pad each batch, not the whole dataset #30

sshleifer commented Sep 23, 2019 •

edited

sshleifer Oct 1, 2019

sshleifer Oct 1, 2019

sshleifer Oct 1, 2019

sshleifer Oct 1, 2019

sshleifer Oct 1, 2019

sshleifer Oct 1, 2019

sshleifer Oct 1, 2019

		return train_loader, valid_loader, train_sampler, valid_sampler


		def make_data_lists(args, personachat, tokenizer):

		@@ -72,11 +73,63 @@ def build_input_from_segments(persona, history, reply, tokenizer, lm_labels=Fals
		return instance, sequence # TODO: second arg is never used, delete it


		def pad_and_tensorize(batch_dict, padding):

Pad each batch, not the whole dataset #30

Are you sure you want to change the base?

Pad each batch, not the whole dataset #30

Conversation

sshleifer commented Sep 23, 2019 • edited

Code Changes

1 Epoch Sanity Check

sshleifer Oct 1, 2019

Choose a reason for hiding this comment

sshleifer Oct 1, 2019

Choose a reason for hiding this comment

sshleifer Oct 1, 2019

Choose a reason for hiding this comment

sshleifer Oct 1, 2019

Choose a reason for hiding this comment

sshleifer Oct 1, 2019

Choose a reason for hiding this comment

sshleifer Oct 1, 2019

Choose a reason for hiding this comment

sshleifer Oct 1, 2019

Choose a reason for hiding this comment

sshleifer commented Sep 23, 2019 •

edited