Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Array of dicts of tensors structure #27

Open
vadimkantorov opened this issue Nov 7, 2022 · 3 comments
Open

[Feature Request] Array of dicts of tensors structure #27

vadimkantorov opened this issue Nov 7, 2022 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@vadimkantorov
Copy link

Storing them in "columnar format" can be more compact in some circumstances (e.g. they are copy-on-write safe in multiprocessing context as they are stored as a very small number of tensors not depending on "dataset" size): https://gist.github.com/vadimkantorov/86c3a46bf25bed3ad45d043ae86fff57

@vadimkantorov vadimkantorov added the enhancement New feature or request label Nov 7, 2022
@vadimkantorov vadimkantorov changed the title [Feature Request] Array of dicts of tensors [Feature Request] Array of dicts of tensors structure Nov 7, 2022
@vmoens
Copy link
Contributor

vmoens commented Nov 8, 2022

Thanks for this @vadimkantorov
How do you see this interacting with TensorDict?
Should Arrays of dicts be a possible data type stored by tensordict? Do you have a typical use case in mind?

@vadimkantorov
Copy link
Author

vadimkantorov commented Nov 8, 2022

I don't know much about TensorDcit project. I just wanted to share a usecase I had for dicts of tensors: represent a dataset in a way that avoids copy-on-write problems: pytorch/pytorch#13246

I represented this array of dicts of tensors as columnar dict of tensors, each key is a tensor that concats all per-items tensors related to a given key.

One way it could integrate with TensorDict: provide a constructor/util function and a "indexing/getitem" method/util that would do slicing of all keys in TensorDict and return a new, "per-item" TensorDict. These may be just recipes from docs or util functions + tests that no copy-on-write/memory expansion is indeed happening, and such structure is safely shared in multiprocessing/dataloading without any copies

Also, similar usecase may be for collecting some partial results from validation loop. Usually, one would store them in a list of dicts of tensors and then analyze it somehow. If such a structure is implemented in some extendable way (as proposed here: pytorch/pytorch#64359), it could be useful

@vmoens
Copy link
Contributor

vmoens commented Nov 8, 2022

Also, similar usecase may be for collecting some partial results from validation loop. Usually, one would store them in a list of dicts of tensors and then analyze it somehow. If such a structure is implemented in some extendable way (as proposed here: pytorch/pytorch#64359), it could be useful

That is something we have I think.

Here's an example:

>>> tensordict1 = TensorDict({"a": torch.zeros(1, 1)}, [1])
>>> tensordict2 = TensorDict({"a": torch.ones(1, 1)}, [1])
>>> tensordict = torch.stack([tensordict1, tensordict2], 0)
>>> 
>>> tensordict
LazyStackedTensorDict(
    fields={
        a: Tensor(torch.Size([2, 1, 1]), dtype=torch.float32)},
    batch_size=torch.Size([2, 1]),
    device=None,
    is_shared=False)
>>>
>>> tensordict[0] is tensordict1
True
>>> tensordict["a"]
tensor([[[0.]],

        [[1.]]])

The LazyStackedTensorDict does not currently support appending but we might consider that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants