[FEATURE] Transform custom dataset to deeplake dataset/database/vectorstore conveniently using DDP #2602

ChawDoe · 2023-09-20T02:09:47Z

Description

Here is my use case:
I have 4 gpu nodes for training (including compute tensors) on aws.
I want to save pre-computed tensors to deeplake (Dataset/database/vectorstore), aiming to save a lot of time for next training.
I use accelerate as my distributed parallel framework.
So my framework works like this:

deeplake_path = 'dataset_{}'.format(current_process_index)
ds = deeplake.dataset(deeplake_path, overwrite=False)
for index, data_dict in enumerate(my_pytorch_dataloader):
  with torch.no_grad():
    a = net_a_frozen(data_dict['a'])
    b = net_b_frozen(data_dict['b'])
  # loss = net_c_training(a, b)
  # the loss is only used in training.
  save_dict = {'data_dict': data_dict, 'a': a.detach().cpu().numpy(), 'b': b.detach().cpu().numpy()}
  append_to_deeplake(deeplake_path, save_dict)
  if index % 100 == 0:
    commit_to_deeplake(deeplake_path)

Note that I can use deeplake instead of computing the tensors i need again in the next training after the deeplake dataset construction.
The problem includes:

I have to assign different deeplake dataset to different processes but i need to merge them into a dataset after this.
I need to design a proper for-loop/parallel workflow for deeplake dataset construction.
The frequent append and commit function takes me a lot of time.
detach() and to_cpu() function takes me a lot of time.

So Is there any feature to transform custom dataset to deeplake dataset?
If we have a function which works like this:

ds.distributed_append_gpu_tensor_and_auto_commit(data_tensor)
ds.auto_transorm_pytorch_dataset(my_pytorch_dataloader)

or could you give me a standard workflow to solve this?
I don't know which is the best method for this scenario.
The document did not cover this problem. #2596 also indicates this problem.

Use Cases

Distributed parallel computing and saving to deeplake.

The text was updated successfully, but these errors were encountered:

ChawDoe · 2023-09-20T05:22:57Z

@davidbuniat Thanks. It's really urgent for me.

FayazRahman · 2023-09-20T15:05:22Z

Hey @ChawDoe! Thanks for opening the issue. Let us look into whether any of our current workflows will satisfy your use case and we'll get back to you in a few days.

ChawDoe · 2023-09-20T15:39:37Z

Hey @ChawDoe! Thanks for opening the issue. Let us look into whether any of our current workflows will satisfy your use case and we'll get back to you in a few days.

Thanks! I hope that I have explained my use case clearly.
Maybe I need function likes this:

ds = deeplake.distributed_dataset('xxx')
ds.distributed_append(xxx)
ds.distributed_commit(xxx)
ds.distributed_append_auto_commit(xxx)

where auto commit will find the best memory-time trade-off in a for loop.

ChawDoe · 2023-10-07T09:22:08Z

@FayazRahman Hi, do you have any updates on this?

FayazRahman · 2023-10-09T18:56:13Z

Sorry @ChawDoe, I haven't been able to work on this yet, I will update here as soon as I make any headway regarding this.

ChawDoe added the enhancement New feature or request label Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Transform custom dataset to deeplake dataset/database/vectorstore conveniently using DDP #2602

[FEATURE] Transform custom dataset to deeplake dataset/database/vectorstore conveniently using DDP #2602

ChawDoe commented Sep 20, 2023 •

edited

ChawDoe commented Sep 20, 2023

FayazRahman commented Sep 20, 2023

ChawDoe commented Sep 20, 2023

ChawDoe commented Oct 7, 2023

FayazRahman commented Oct 9, 2023

[FEATURE] Transform custom dataset to deeplake dataset/database/vectorstore conveniently using DDP #2602

[FEATURE] Transform custom dataset to deeplake dataset/database/vectorstore conveniently using DDP #2602

Comments

ChawDoe commented Sep 20, 2023 • edited

Description

Use Cases

ChawDoe commented Sep 20, 2023

FayazRahman commented Sep 20, 2023

ChawDoe commented Sep 20, 2023

ChawDoe commented Oct 7, 2023

FayazRahman commented Oct 9, 2023

ChawDoe commented Sep 20, 2023 •

edited