Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batchwise padding dataset #121

Open
mrghofrani opened this issue Jul 11, 2022 · 0 comments
Open

Batchwise padding dataset #121

mrghofrani opened this issue Jul 11, 2022 · 0 comments

Comments

@mrghofrani
Copy link

Hello
I'm pretty new to Pytorch so sorry if this question was so simple. Because of memory limits, I can't pad my dataset as a whole. So I was wondering what is the simplest way to move the pad_dataset function into the training process, I mean how can I pad the dataset in a batch? For ease of reference, I added the pad_dataset below.
Thanks.

def pad_dataset(dataset, padding=0):
    """ Pad the dataset. This could be optimized by defining a Dataset class and padding at the batch level, but this is simpler. """
    max_l = max(len(x) for x in dataset["input_ids"])
    for name in PADDED_INPUTS:
        dataset[name] = [x + [padding if name != "lm_labels" else -100] * (max_l - len(x)) for x in dataset[name]]
    return dataset
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant