Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with larger datasets #35

Open
mainpyp opened this issue Mar 22, 2023 · 2 comments
Open

Working with larger datasets #35

mainpyp opened this issue Mar 22, 2023 · 2 comments

Comments

@mainpyp
Copy link

mainpyp commented Mar 22, 2023

Hi ,
thank you for this awesome project.
I want to apply DiffuSeq on a larger datasets (~17M sentences) but the tokenizing keeps blowing up my RAM, even though I have 200GB available! Is there a functionality that I am missing that uses cached tokens or is this work in progress?

Thanks again & best!

@summmeer
Copy link
Collaborator

Hi,
Maybe you can try to add keep_in_memory = True in function raw_datasets.map

tokenized_datasets = raw_datasets.map(

If it doesn't work, you can try to split your datasets into separate folds and load them respectively in different training steps.

@mainpyp
Copy link
Author

mainpyp commented Mar 23, 2023

It`s still not working properly. But I think that has something to do with padding and my sequence lengths. I have to investigate that further, but thank you for your help! :)
I found a small thing that accelerated the data loading time a lot:

with open(path, 'r') as f_reader:
for row in f_reader:
sentence_lst['src'].append(json.loads(row)['src'].strip())
sentence_lst['trg'].append(json.loads(row)['trg'].strip())

here a line is loaded twice. By reading it once and then accessing the src and trg I saved a lot of time.

with open(path, 'r') as f_reader: 
     for row in f_reader: 
         line = json.loads(row)
         sentence_lst['src'].append(line['src'].strip()) 
         sentence_lst['trg'].append(line['trg'].strip())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants