Enhancing the Performance of flashlight using cudnnFind, data-loader optimization, and control flow optimization #631

mtmd · 2021-06-08T22:30:44Z

Original Issue: #630

Summary

This commit improves the performance of flashlight using the following optimizations:

cudnnFind is used instead of the flashlight benchmark to improve the performance.
A new data structure, Sample, added. Sample transfers the data that it contains to the GPU memory in an asynchronous fashion. Moreover, transformations are performed after prefetch to ensure a single CPU thread deals with GPU interactions, hence, reducing the context switching and lock acquisition overhead.
Checking for invalid values in gradients is now being performed on the GPU side.

Test Plan (required)

This pull request is expected to improve the performance of ResNet-34 on V100 - 16GB, using a batch size of 128 from 1143 fps to 1507 fps. To test, please run:

bin/imgclass/fl_img_imagenet_resnet34 \
--data_dir=/path/to/your/ImageNet/folder \
--distributed_enable=false \
--exp_checkpoint_path=/tmp/test \
--logtostderr \
--fl_amp_use_mixed_precision=false \
--data_batch_size=128

facebook-github-bot · 2021-06-14T23:20:08Z

@xuqiantong has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

xuqiantong · 2021-06-15T05:20:28Z

Hi @mtmd, thanks you so much for your awesome work! We really appreciate your insights and your implementation in this PR! 👍

Below are some quick comments:

Can you please include only the changes to Conv2D and DynamicScaler in this first version of PR? Because the changes to the dataset pipeline seems a bit hacky and will break all the other applications. We may make some redesign to it from our side later to fit your asynchronous optimizations. Specifically, please postpone the changes related to class Sample and Datasets, but I think it worths and is safe to keep "transformations are performed after prefetch".
Regarding DynamicScaler

we use a fl::kAmpMinimumScaleFactorValue to avoid infinite loop when scaleFactor keeps decreasing, as well as maxScaleFactor_ to limit its maximal value. Can you please bring them back in your logic?
Will it be better to change flag_ in DynamicScaler into a boolean array? Also, can we name it more meaningful, like isInvalidArray?
Can you make adjustScaleFactor() return false when flag_ is true?

Regarding configs.h,

Does it make more sense to place it in fl/autograd?
Why do you only index inputX and batchSize, rather than all 4 dimensions?
Since you removed DynamicBenchmark, can we 100% trust the algo selected by the one-time cudnnFind?

Please add the copy-right header to all the new files added. https://github.com/flashlight/flashlight/blob/master/flashlight/app/benchmark/ModelBenchmarker.h#L1-L6

…g optimizations: 1. cudnnFind is used instead of the flashlight benchmark to improve the performance. 2. A new data structure, Sample, added. Sample transfers the data that it contains to the GPU memory in an asynchronous fashion. Moreover, transformations are performed after prefetch to ensure a single CPU thread deals with GPU interactions, hence, reducing the context switching and lock acquisition overhead. 3. Checking for invalid values in gradients is now being performed on the GPU side.

…iew: Limiting the first version of PR to optimizations that do not alter the data loader. Min and Max scale factors are considered while scaling. Revised adjustScaleFactor() so that it returns false when gradients are invlaid. configs.h moved under autograd.

facebook-github-bot · 2021-06-16T22:45:11Z

@mtmd has updated the pull request. You must reimport the pull request before landing.

mtmd · 2021-06-16T23:10:42Z

Thank you @xuqiantong.

Can you please include only the changes to Conv2D and DynamicScaler in this first version of PR?

Done. This decreases the performance of training resenet-34 from 1507 fps (V100 - 16GB) to 1371 fps.

Because the changes to the dataset pipeline seems a bit hacky and will break all the other applications.

The changes didn't appear to be breaking. Can you please point out what they broke so that I can be mindful of it in case we work more on asynchronous aspects in future?

We may make some redesign to it from our side later to fit your asynchronous optimizations.

Sounds great! Happy to be part of that effort.

we use a fl::kAmpMinimumScaleFactorValue to avoid infinite loop when scaleFactor keeps decreasing, as well as maxScaleFactor_ to limit its maximal value. Can you please bring them back in your logic?

Done.

Will it be better to change flag_ in DynamicScaler into a boolean array?

It seems like Arrayfire doesn't support boolean device pointer (undefined reference to `bool* af::array::device() const'). Nonetheless, it doesn't have a measurable impact on the performance since loads operate on the sector granularity.

Also, can we name it more meaningful, like isInvalidArray?

Done.

Can you make adjustScaleFactor() return false when flag_ is true?

Done. However, as we discussed in our last meeting, it decreases the performance by 2 fps (resenet-34, V100 - 16 GB).

Does it make more sense to place it in fl/autograd?

Sure. Done.

Why do you only index inputX and batchSize, rather than all 4 dimensions?

I want to keep the size of cache and the lookup steps as small as possible to reduce to imposed lookup overheads. As a result, I only index parameters that can change in each layer of a given neural network across different iterations. That includes inputX and batchSize.

Since you removed DynamicBenchmark, can we 100% trust the algo selected by the one-time cudnnFind?

DynamicBenchmark and cudnnFind operate analogously. Hence, the answer is yes. Nevertheless, please feel free to try with other models of your interest to certify that the latter yields a satisfactory performance.

Please add the copy-right header to all the new files added.

I'm not sure if I'm authorized to do this. If you need me to work on it, let's talk about in workplace.

facebook-github-bot · 2021-06-17T20:02:40Z

@xuqiantong has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

xuqiantong · 2021-06-17T21:51:52Z

Hi @mtmd, thanks for the update! We are ready to commit it! @tlikhomanenko @jacobkahn can you accept the diff I imported and land it?

To answer you questions:

The changes didn't appear to be breaking.

For example, you preload two things asynchronously -- data and label -- in the PrefetchDataset. It only works for this image classification task, where we MergeDataset them as in here:

flashlight/flashlight/app/imgclass/dataset/Imagenet.cpp

Line 81 in c82d4d0

MergeDataset({imageDataset, labelDataset}));

Overall, it's not trivial to add in the asynchronous component, class "Sample", in the current dataset pipeline. We need to redesign it a bit.

from 1507 fps (V100 - 16GB) to 1371 fps.

Without changing the FL dataset pipeline, I think it still worths to keep your changes to the DistributedDataset, where transformations are performed after prefetch. Do you think we can have some improve with that alone?

mtmd · 2021-06-21T19:23:08Z

Thank you @xuqiantong! Sounds good.

Without changing the FL dataset pipeline, I think it still worths to keep your changes to the DistributedDataset, where transformations are performed after prefetch. Do you think we can have some improve with that alone?

I think given that you are considering to redesign the loader, we can skip this for now and incorporate it in the new design. Changes in the DistributedDataset are mainly beneficial if we refrain from accessing the GPU from individual threads. That, in turn requires us to use Sample (which we decided not to include in the current pull request).

facebook-github-bot · 2021-06-22T06:53:18Z

@mtmd has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-06-22T06:53:43Z

@xuqiantong has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

jacobkahn · 2021-06-30T23:46:42Z

@mtmd — we'll get this merged in pretty soon - there are some broader changes to abstractions that will be helpful here to clean this up.

Would you be able to also submit a PR that contains the additions that aren't in this PR (including Sample abstractions, etc?). I can begin to think about best way to add those once that PR is up as well.

mtmd · 2021-07-06T20:50:13Z

@mtmd — we'll get this merged in pretty soon - there are some broader changes to abstractions that will be helpful here to clean this up.

Would you be able to also submit a PR that contains the additions that aren't in this PR (including Sample abstractions, etc?). I can begin to think about best way to add those once that PR is up as well.

Sounds good @jacobkahn. Sure, will do :).

…data structure: Sample. Sample transfers the data that it contains to the GPU memory in an asynchronous fashion. Moreover, transformations are performed after prefetch to ensure a single CPU thread deals with GPU interactions, hence, reducing the context switching and lock acquisition overhead. This is a WIP. Please see flashlight#631 (comment)

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jun 8, 2021

mtmd added 2 commits June 16, 2021 14:45

mtmd force-pushed the mtmd_resnet_perf_opt branch from 06847a2 to ff35109 Compare June 16, 2021 22:45

xuqiantong requested review from xuqiantong and jacobkahn June 17, 2021 20:08

xuqiantong assigned tlikhomanenko Jun 17, 2021

xuqiantong requested a review from tlikhomanenko June 17, 2021 21:44

xuqiantong approved these changes Jun 22, 2021

View reviewed changes

Merge branch 'master' into mtmd_resnet_perf_opt

2074377

mtmd mentioned this pull request Aug 4, 2021

WIP: A Prototype for Enhancing the Performance of Data loader #708

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing the Performance of flashlight using cudnnFind, data-loader optimization, and control flow optimization #631

Enhancing the Performance of flashlight using cudnnFind, data-loader optimization, and control flow optimization #631

mtmd commented Jun 8, 2021

facebook-github-bot commented Jun 14, 2021

xuqiantong commented Jun 15, 2021 •

edited

facebook-github-bot commented Jun 16, 2021

mtmd commented Jun 16, 2021 •

edited

facebook-github-bot commented Jun 17, 2021

xuqiantong commented Jun 17, 2021 •

edited

mtmd commented Jun 21, 2021

facebook-github-bot commented Jun 22, 2021

facebook-github-bot commented Jun 22, 2021

jacobkahn commented Jun 30, 2021

mtmd commented Jul 6, 2021

Enhancing the Performance of flashlight using cudnnFind, data-loader optimization, and control flow optimization #631

Are you sure you want to change the base?

Enhancing the Performance of flashlight using cudnnFind, data-loader optimization, and control flow optimization #631

Conversation

mtmd commented Jun 8, 2021

Summary

Test Plan (required)

facebook-github-bot commented Jun 14, 2021

xuqiantong commented Jun 15, 2021 • edited

facebook-github-bot commented Jun 16, 2021

mtmd commented Jun 16, 2021 • edited

facebook-github-bot commented Jun 17, 2021

xuqiantong commented Jun 17, 2021 • edited

mtmd commented Jun 21, 2021

facebook-github-bot commented Jun 22, 2021

facebook-github-bot commented Jun 22, 2021

jacobkahn commented Jun 30, 2021

mtmd commented Jul 6, 2021

xuqiantong commented Jun 15, 2021 •

edited

mtmd commented Jun 16, 2021 •

edited

xuqiantong commented Jun 17, 2021 •

edited