Make `IterToMap` loading more lazily #454

ejguan · 2022-05-24T14:14:30Z

🚀 The feature

Currently, IterToMap starts to load all data from prior IterDataPipe when the first __getitem__ is invoked here.

data/torchdata/datapipes/iter/util/converter.py

Line 78 in 13b574c

self._load_map()

We can stop loading data from prior IterDataPipe whenever we find the requested index. And, we might need to add a flag to prevent loading data multiple times.

Motivation, pitch

This would improve the performance if users simply iterate over the MapDataPipe as we don't need to pre-load everything at the beginning of the iteration, basically, simulating the behavior of IterDataPipe.

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

linminhtoo · 2022-05-31T08:52:19Z

This would be very much appreciated for my use-case. I also would like to ask a related question:
I start with a .csv/.parquet file which contains the unique ID of each data sample as well as some other metadata. I then apply a series of maps & filters as needed by my use-case. Finally, I need to generate the tensors needed by my model for training. The last step is expensive in both compute & memory.

I am using IterDataPipe for all the operations until the last step. For the last step, I have tried 3 cases: 1) define my own IterDataPipe 2) define my own MapDataPipe 3) use my defined IterDataPipe but run .to_map_datapipe() to convert it to MapDataPipe AFTER generating the tensors.

I found, somewhat to my surprise, that 3) gives the fastest training speed (in terms of number of samples per second) compared to 1) and even 2). In fact, 3) could be 2-4 times faster than 1). However, 3) is very memory intensive and simply infeasible even for a relatively small dataset (100k samples), since we need to load the whole IterDataPipe into memory in self._load_map().

I also found that 2) is slower than 3), but still faster than 1). The main problem with 2) is that handling errors during tensor generation in __getitem__ is much more complicated than just not yielding in __iter__ for 1). For example, I could return None if an error happened, but I have to tackle this None in collate_fn, and I may also encounter empty batches if the whole batch is None.

Preferably, I would have expected that 1) should be similarly as fast as 3), perhaps just a little slower, e.g. maybe collating an Iterator into batches may need some extra overhead. Please help me understand if I am missing anything here. I have tried num_workers = 0 or > 0, but these did not change my conclusions.

Also, should I open a new Issue for this?

ejguan · 2022-05-31T17:28:09Z

@linminhtoo Thanks for providing feedback on your use case.
Could you please share the operation you are using for the last step? I am trying to understand why generating Tensor would require a custom DataPipe.

In terms of speed, could you please also share how you measure the speed? I think the overall time, 3 should be the same as 1. If we are talking about the speed per iteration, the (3) should be faster as all the loading logics happen at the beginning of the epoch, but (1) is loading data lazily per iteration.

linminhtoo · 2022-06-01T01:11:01Z

Thanks for your very prompt reply. Before the last step, the samples only contain the identity and metadata, such as paths. I'm in the cheminformatics space where we generate various features for a given molecule or a protein. We may, for example, generate morgan fingerprints of dimension 2048, or a large 3d surface mesh of the protein. These can be both computationally and memory expensive. I define a new class myself because there are many different features we want to generate for each sample.

As for the timing, I'm simply measuring the time taken for an epoch of training a very simple model (such that the bottleneck is more likely to be shifted to the dataloading) over about 100k samples. 3) can be as fast as 30 seconds for the 1st epoch, and even faster at 20 seconds from the 2nd epoch onwards. But it almost maxes my 32 gb RAM and is unscalable. For 1, it keeps my RAM nice and low, but could take 2 to 3 mins per epoch. I realized I have not tried 2 on 100k samples (on a smaller set of 3k samples, it was faster than 1), but I can double check that.

I may be misunderstanding something. If I define the computationally expensive step as 1 or 2, there should also be little difference in speed and memory right? 3 is not really usable in practice, but its sheer speedup relative to 1 is a little concerning.

linminhtoo · 2022-06-01T07:50:55Z

Hi, I have run 2) on the larger 100k samples dataset and I discovered the root cause why it is faster than 1).

For 2): MapDataPipe

num_workers = 8 --> 1 min/epoch
num_workers = 0 --> 2.5 min/epoch

For 1): IterDataPipe

num_workers = 8 --> 4 min/epoch
num_workers = 0 --> 2.5 min/epoch

As you can see, clearly, IterDataPipe is not utilizing the num_workers properly, in fact, it is slowed down by multiprocessing. In contrast, MapDataPipe can leverage multiprocessing. With num_workers = 0, both take about the same time per epoch, which is in line with expectations.

To be clear, I am doing exactly the same last step (the expensive step) in both cases, just the difference of whether it's a __iter__ or __getitem__ and whether it inherits from IterDataPipe or MapDataPipe respectively.

For sharding the IterDataPipe, I am simply following the instructions from the latest docs. The worker_init_fn is taken from #458
This is a summary of 1), all remain IterDataPipe

# load data & parse as dict
dp = dp.parse_csv_as_dict()
# my maps & filters
...
# shuffle
dp = dp.shuffle()
# my expensive step, still as an IterDataPipe
dp = dp.get_features_as_iter(my_args)
# prepare for sharding
dp = dp.sharding_filter()
# make dataloader with custom worker_init_fn
dataloader = DataLoader(dp, num_workers=8, batch_size=48, pin_memory=True, worker_init_fn=worker_init_fn)

For 2), as MapDataPipe: the difference is here:

# dp is still IterDataPipe here
dp = dp.shuffle()
# i define custom IterDataPipe to assign integer as key for .to_map_datapipe(), similar to dp.add_index()
dp = dp.assign_integer_as_key()
# no RAM blow-up as I have not generated any feature tensors yet
dp = dp.to_map_datapipe()
# my expensive step, as a MapDataPipe
dp = dp.get_features_as_map(my_args)
# make dataloader
dataloader = DataLoader(dp, num_workers=8, batch_size=48, pin_memory=True)

linminhtoo · 2022-06-01T08:05:57Z

Sorry for another comment. I had a weird intuition to modify the order of operations for 1).
I changed from

# shuffle
dp = dp.shuffle()
# my expensive step, still as an IterDataPipe
dp = dp.get_features_as_iter(my_args)
# prepare for sharding
dp = dp.sharding_filter()  # <--------
# make dataloader with custom worker_init_fn
dataloader = DataLoader(dp, num_workers=8, batch_size=48, pin_memory=True, worker_init_fn=worker_init_fn)

to

# shuffle
dp = dp.shuffle()
# prepare for sharding 
dp = dp.sharding_filter()  # <--------
# my expensive step, still as an IterDataPipe
dp = dp.get_features_as_iter(my_args)
# make dataloader with custom worker_init_fn
dataloader = DataLoader(dp, num_workers=8, batch_size=48, pin_memory=True, worker_init_fn=worker_init_fn)

Now,
For 2): MapDataPipe

num_workers = 8 --> 1 min/epoch
num_workers = 0 --> 2.5 min/epoch

For 1): IterDataPipe

num_workers = 8 --> ~~4 min/epoch~~ 1.5 min/epoch
num_workers = 0 --> 2.5 min/epoch

By placing the .sharding_filter() op BEFORE the expensive step, it can finally utilize the multiple cores properly.
However, it is still slower than MapDataPipe (1.5min vs 1min).

ejguan · 2022-06-01T14:49:57Z

By placing the .sharding_filter() op BEFORE the expensive step, it can finally utilize the multiple cores properly.

It is definitely the right way to place shuffle and sharding as early as possible to get rid of redundant operations. We should add a few words on our tutorial https://pytorch.org/data/beta/tutorial.html#working-with-dataloader

However, it is still slower than MapDataPipe (1.5min vs 1min).

When you measure the total time per epoch, do you include the time creating iterator of DataLoader?

linminhtoo · 2022-06-02T08:24:35Z

By placing the .sharding_filter() op BEFORE the expensive step, it can finally utilize the multiple cores properly.

It is definitely the right way to place shuffle and sharding as early as possible to get rid of redundant operations. We should add a few words on our tutorial https://pytorch.org/data/beta/tutorial.html#working-with-dataloader

However, it is still slower than MapDataPipe (1.5min vs 1min).

When you measure the total time per epoch, do you include the time creating iterator of DataLoader?

I went back to my code to do more extensive and more rigorous speed benchmarks. As I realized the thread is geting a little long, I decided to create a new Issue and link to this one.

Summary: See the feedback from a user: #454 (comment) We should explicitly ask users to place `sharding_filter` as early as possible. Pull Request resolved: #487 Reviewed By: wenleix Differential Revision: D36812259 Pulled By: ejguan fbshipit-source-id: 4c983f3216a80be398f85b20871e65b0e41627e0

Summary: See the feedback from a user: pytorch#454 (comment) We should explicitly ask users to place `sharding_filter` as early as possible. Pull Request resolved: pytorch#487 Reviewed By: wenleix Differential Revision: D36812259 Pulled By: ejguan fbshipit-source-id: 4c983f3216a80be398f85b20871e65b0e41627e0

Summary: See the feedback from a user: #454 (comment) We should explicitly ask users to place `sharding_filter` as early as possible. Pull Request resolved: #487 Reviewed By: wenleix Differential Revision: D36812259 Pulled By: ejguan fbshipit-source-id: 4c983f3216a80be398f85b20871e65b0e41627e0

Summary: See the feedback from a user: pytorch#454 (comment) We should explicitly ask users to place `sharding_filter` as early as possible. Pull Request resolved: pytorch#487 Reviewed By: wenleix Differential Revision: D36812259 Pulled By: ejguan fbshipit-source-id: 4c983f3216a80be398f85b20871e65b0e41627e0

ejguan changed the title ~~IterToMap can be more lazily~~ Make IterToMap loading more lazily May 24, 2022

NivekT added the help wanted Extra attention is needed label May 24, 2022

pmeier mentioned this issue May 26, 2022

Fully exhaust datapipes that are needed to construct a dataset pytorch/vision#6076

Merged

ejguan mentioned this issue Jun 1, 2022

Update tutorial about placing sharding_filter #487

Closed

linminhtoo mentioned this issue Jun 2, 2022

speed benchmark: IterDataPipe noticeably slower than MapDataPipe #492

Open

ejguan mentioned this issue Jun 2, 2022

Update tutorial about placing sharding_filter (#487) #493

Merged

SvenDS9 linked a pull request Feb 15, 2023 that will close this issue

Make itertomap loading more lazy #1016

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `IterToMap` loading more lazily #454

Make `IterToMap` loading more lazily #454

ejguan commented May 24, 2022

linminhtoo commented May 31, 2022 •

edited

ejguan commented May 31, 2022

linminhtoo commented Jun 1, 2022 •

edited

linminhtoo commented Jun 1, 2022 •

edited

linminhtoo commented Jun 1, 2022 •

edited

ejguan commented Jun 1, 2022

linminhtoo commented Jun 2, 2022

Make IterToMap loading more lazily #454

Make IterToMap loading more lazily #454

Comments

ejguan commented May 24, 2022

🚀 The feature

Motivation, pitch

Alternatives

Additional context

linminhtoo commented May 31, 2022 • edited

ejguan commented May 31, 2022

linminhtoo commented Jun 1, 2022 • edited

linminhtoo commented Jun 1, 2022 • edited

linminhtoo commented Jun 1, 2022 • edited

ejguan commented Jun 1, 2022

linminhtoo commented Jun 2, 2022

Make `IterToMap` loading more lazily #454

Make `IterToMap` loading more lazily #454

linminhtoo commented May 31, 2022 •

edited

linminhtoo commented Jun 1, 2022 •

edited

linminhtoo commented Jun 1, 2022 •

edited

linminhtoo commented Jun 1, 2022 •

edited