Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Transform custom dataset to deeplake dataset/database/vectorstore conveniently using DDP #2602

Open
ChawDoe opened this issue Sep 20, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@ChawDoe
Copy link

ChawDoe commented Sep 20, 2023

Description

Here is my use case:
I have 4 gpu nodes for training (including compute tensors) on aws.
I want to save pre-computed tensors to deeplake (Dataset/database/vectorstore), aiming to save a lot of time for next training.
I use accelerate as my distributed parallel framework.
So my framework works like this:

deeplake_path = 'dataset_{}'.format(current_process_index)
ds = deeplake.dataset(deeplake_path, overwrite=False)
for index, data_dict in enumerate(my_pytorch_dataloader):
  with torch.no_grad():
    a = net_a_frozen(data_dict['a'])
    b = net_b_frozen(data_dict['b'])
  # loss = net_c_training(a, b)
  # the loss is only used in training.
  save_dict = {'data_dict': data_dict, 'a': a.detach().cpu().numpy(), 'b': b.detach().cpu().numpy()}
  append_to_deeplake(deeplake_path, save_dict)
  if index % 100 == 0:
    commit_to_deeplake(deeplake_path)

Note that I can use deeplake instead of computing the tensors i need again in the next training after the deeplake dataset construction.
The problem includes:

  1. I have to assign different deeplake dataset to different processes but i need to merge them into a dataset after this.
  2. I need to design a proper for-loop/parallel workflow for deeplake dataset construction.
  3. The frequent append and commit function takes me a lot of time.
  4. detach() and to_cpu() function takes me a lot of time.

So Is there any feature to transform custom dataset to deeplake dataset?
If we have a function which works like this:

ds.distributed_append_gpu_tensor_and_auto_commit(data_tensor)
ds.auto_transorm_pytorch_dataset(my_pytorch_dataloader)

or could you give me a standard workflow to solve this?
I don't know which is the best method for this scenario.
The document did not cover this problem. #2596 also indicates this problem.

Use Cases

Distributed parallel computing and saving to deeplake.

@ChawDoe ChawDoe added the enhancement New feature or request label Sep 20, 2023
@ChawDoe
Copy link
Author

ChawDoe commented Sep 20, 2023

@davidbuniat Thanks. It's really urgent for me.

@FayazRahman
Copy link
Contributor

Hey @ChawDoe! Thanks for opening the issue. Let us look into whether any of our current workflows will satisfy your use case and we'll get back to you in a few days.

@ChawDoe
Copy link
Author

ChawDoe commented Sep 20, 2023

Hey @ChawDoe! Thanks for opening the issue. Let us look into whether any of our current workflows will satisfy your use case and we'll get back to you in a few days.

Thanks! I hope that I have explained my use case clearly.
Maybe I need function likes this:

ds = deeplake.distributed_dataset('xxx')
ds.distributed_append(xxx)
ds.distributed_commit(xxx)
ds.distributed_append_auto_commit(xxx)

where auto commit will find the best memory-time trade-off in a for loop.

@ChawDoe
Copy link
Author

ChawDoe commented Oct 7, 2023

@FayazRahman Hi, do you have any updates on this?

@FayazRahman
Copy link
Contributor

Sorry @ChawDoe, I haven't been able to work on this yet, I will update here as soon as I make any headway regarding this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants