Working with larger datasets #35

mainpyp · 2023-03-22T07:44:49Z

Hi ,
thank you for this awesome project.
I want to apply DiffuSeq on a larger datasets (~17M sentences) but the tokenizing keeps blowing up my RAM, even though I have 200GB available! Is there a functionality that I am missing that uses cached tokens or is this work in progress?

Thanks again & best!

summmeer · 2023-03-22T09:28:42Z

Hi,
Maybe you can try to add keep_in_memory = True in function raw_datasets.map

DiffuSeq/diffuseq/text_datasets.py

Line 78 in bea43e1

tokenized_datasets = raw_datasets.map(

If it doesn't work, you can try to split your datasets into separate folds and load them respectively in different training steps.

mainpyp · 2023-03-23T16:11:14Z

It`s still not working properly. But I think that has something to do with padding and my sequence lengths. I have to investigate that further, but thank you for your help! :)
I found a small thing that accelerated the data loading time a lot:

DiffuSeq/diffuseq/text_datasets.py

Lines 163 to 166 in bea43e1

    
           with open(path, 'r') as f_reader: 
        
               for row in f_reader: 
        
                   sentence_lst['src'].append(json.loads(row)['src'].strip()) 
        
                   sentence_lst['trg'].append(json.loads(row)['trg'].strip())

here a line is loaded twice. By reading it once and then accessing the src and trg I saved a lot of time.

with open(path, 'r') as f_reader: 
     for row in f_reader: 
         line = json.loads(row)
         sentence_lst['src'].append(line['src'].strip()) 
         sentence_lst['trg'].append(line['trg'].strip())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working with larger datasets #35

Working with larger datasets #35

mainpyp commented Mar 22, 2023

summmeer commented Mar 22, 2023

mainpyp commented Mar 23, 2023

Working with larger datasets #35

Working with larger datasets #35

Comments

mainpyp commented Mar 22, 2023

summmeer commented Mar 22, 2023

mainpyp commented Mar 23, 2023