Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing the Performance of flashlight using cudnnFind, data-loader optimization, and control flow optimization #631

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mtmd
Copy link
Contributor

@mtmd mtmd commented Jun 8, 2021

Original Issue: #630

Summary

This commit improves the performance of flashlight using the following optimizations:

  1. cudnnFind is used instead of the flashlight benchmark to improve the performance.

  2. A new data structure, Sample, added. Sample transfers the data that it contains to the GPU memory in an asynchronous fashion. Moreover, transformations are performed after prefetch to ensure a single CPU thread deals with GPU interactions, hence, reducing the context switching and lock acquisition overhead.

  3. Checking for invalid values in gradients is now being performed on the GPU side.

Test Plan (required)

This pull request is expected to improve the performance of ResNet-34 on V100 - 16GB, using a batch size of 128 from 1143 fps to 1507 fps. To test, please run:

bin/imgclass/fl_img_imagenet_resnet34 \
--data_dir=/path/to/your/ImageNet/folder \
--distributed_enable=false \
--exp_checkpoint_path=/tmp/test \
--logtostderr \
--fl_amp_use_mixed_precision=false \
--data_batch_size=128

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jun 8, 2021
@facebook-github-bot
Copy link

@xuqiantong has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@xuqiantong
Copy link
Contributor

xuqiantong commented Jun 15, 2021

Hi @mtmd, thanks you so much for your awesome work! We really appreciate your insights and your implementation in this PR! 👍

Below are some quick comments:

  1. Can you please include only the changes to Conv2D and DynamicScaler in this first version of PR? Because the changes to the dataset pipeline seems a bit hacky and will break all the other applications. We may make some redesign to it from our side later to fit your asynchronous optimizations. Specifically, please postpone the changes related to class Sample and Datasets, but I think it worths and is safe to keep "transformations are performed after prefetch".

  2. Regarding DynamicScaler

  • we use a fl::kAmpMinimumScaleFactorValue to avoid infinite loop when scaleFactor keeps decreasing, as well as maxScaleFactor_ to limit its maximal value. Can you please bring them back in your logic?
  • Will it be better to change flag_ in DynamicScaler into a boolean array? Also, can we name it more meaningful, like isInvalidArray?
  • Can you make adjustScaleFactor() return false when flag_ is true?
  1. Regarding configs.h,
  • Does it make more sense to place it in fl/autograd?
  • Why do you only index inputX and batchSize, rather than all 4 dimensions?
  • Since you removed DynamicBenchmark, can we 100% trust the algo selected by the one-time cudnnFind?
  1. Please add the copy-right header to all the new files added. https://github.com/flashlight/flashlight/blob/master/flashlight/app/benchmark/ModelBenchmarker.h#L1-L6

mtmd added 2 commits June 16, 2021 14:45
…g optimizations:

1. cudnnFind is used instead of the flashlight benchmark to improve the performance.

2. A new data structure, Sample, added. Sample transfers the data that it contains to the GPU memory in an asynchronous fashion.
Moreover, transformations are performed after prefetch to ensure a single CPU thread deals with GPU interactions, hence, reducing the context switching and lock acquisition overhead.

3. Checking for invalid values in gradients is now being performed on the GPU side.
…iew:

Limiting the first version of PR to optimizations that do not alter the data loader.
Min and Max scale factors are considered while scaling.
Revised adjustScaleFactor() so that it returns false when gradients are invlaid.
configs.h moved under autograd.
@facebook-github-bot
Copy link

@mtmd has updated the pull request. You must reimport the pull request before landing.

@mtmd
Copy link
Contributor Author

mtmd commented Jun 16, 2021

Thank you @xuqiantong.

Can you please include only the changes to Conv2D and DynamicScaler in this first version of PR?

Done. This decreases the performance of training resenet-34 from 1507 fps (V100 - 16GB) to 1371 fps.

Because the changes to the dataset pipeline seems a bit hacky and will break all the other applications.

The changes didn't appear to be breaking. Can you please point out what they broke so that I can be mindful of it in case we work more on asynchronous aspects in future?

We may make some redesign to it from our side later to fit your asynchronous optimizations.

Sounds great! Happy to be part of that effort.

we use a fl::kAmpMinimumScaleFactorValue to avoid infinite loop when scaleFactor keeps decreasing, as well as maxScaleFactor_ to limit its maximal value. Can you please bring them back in your logic?

Done.

Will it be better to change flag_ in DynamicScaler into a boolean array?

It seems like Arrayfire doesn't support boolean device pointer (undefined reference to `bool* af::array::device() const'). Nonetheless, it doesn't have a measurable impact on the performance since loads operate on the sector granularity.

Also, can we name it more meaningful, like isInvalidArray?

Done.

Can you make adjustScaleFactor() return false when flag_ is true?

Done. However, as we discussed in our last meeting, it decreases the performance by 2 fps (resenet-34, V100 - 16 GB).

Does it make more sense to place it in fl/autograd?

Sure. Done.

Why do you only index inputX and batchSize, rather than all 4 dimensions?

I want to keep the size of cache and the lookup steps as small as possible to reduce to imposed lookup overheads. As a result, I only index parameters that can change in each layer of a given neural network across different iterations. That includes inputX and batchSize.

Since you removed DynamicBenchmark, can we 100% trust the algo selected by the one-time cudnnFind?

DynamicBenchmark and cudnnFind operate analogously. Hence, the answer is yes. Nevertheless, please feel free to try with other models of your interest to certify that the latter yields a satisfactory performance.

Please add the copy-right header to all the new files added.

I'm not sure if I'm authorized to do this. If you need me to work on it, let's talk about in workplace.

@facebook-github-bot
Copy link

@xuqiantong has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@xuqiantong
Copy link
Contributor

xuqiantong commented Jun 17, 2021

Hi @mtmd, thanks for the update! We are ready to commit it! @tlikhomanenko @jacobkahn can you accept the diff I imported and land it?

To answer you questions:

The changes didn't appear to be breaking.

For example, you preload two things asynchronously -- data and label -- in the PrefetchDataset. It only works for this image classification task, where we MergeDataset them as in here:

MergeDataset({imageDataset, labelDataset}));

Overall, it's not trivial to add in the asynchronous component, class "Sample", in the current dataset pipeline. We need to redesign it a bit.

from 1507 fps (V100 - 16GB) to 1371 fps.

Without changing the FL dataset pipeline, I think it still worths to keep your changes to the DistributedDataset, where transformations are performed after prefetch. Do you think we can have some improve with that alone?

@mtmd
Copy link
Contributor Author

mtmd commented Jun 21, 2021

Thank you @xuqiantong! Sounds good.

Without changing the FL dataset pipeline, I think it still worths to keep your changes to the DistributedDataset, where transformations are performed after prefetch. Do you think we can have some improve with that alone?

I think given that you are considering to redesign the loader, we can skip this for now and incorporate it in the new design. Changes in the DistributedDataset are mainly beneficial if we refrain from accessing the GPU from individual threads. That, in turn requires us to use Sample (which we decided not to include in the current pull request).

@facebook-github-bot
Copy link

@mtmd has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link

@xuqiantong has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@jacobkahn
Copy link
Member

@mtmd — we'll get this merged in pretty soon - there are some broader changes to abstractions that will be helpful here to clean this up.

Would you be able to also submit a PR that contains the additions that aren't in this PR (including Sample abstractions, etc?). I can begin to think about best way to add those once that PR is up as well.

@mtmd
Copy link
Contributor Author

mtmd commented Jul 6, 2021

@mtmd — we'll get this merged in pretty soon - there are some broader changes to abstractions that will be helpful here to clean this up.

Would you be able to also submit a PR that contains the additions that aren't in this PR (including Sample abstractions, etc?). I can begin to think about best way to add those once that PR is up as well.

Sounds good @jacobkahn. Sure, will do :).

mtmd added a commit to mtmd/flashlight that referenced this pull request Aug 4, 2021
…data structure: Sample. Sample transfers the data that it contains to the GPU memory in an asynchronous fashion.

Moreover, transformations are performed after prefetch to ensure a single CPU thread deals with GPU interactions, hence, reducing the context switching and lock acquisition overhead.
This is a WIP. Please see flashlight#631 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants