-
Notifications
You must be signed in to change notification settings - Fork 498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancing the Performance of flashlight using cudnnFind, data-loader optimization, and control flow optimization #631
base: main
Are you sure you want to change the base?
Conversation
@xuqiantong has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Hi @mtmd, thanks you so much for your awesome work! We really appreciate your insights and your implementation in this PR! 👍 Below are some quick comments:
|
…g optimizations: 1. cudnnFind is used instead of the flashlight benchmark to improve the performance. 2. A new data structure, Sample, added. Sample transfers the data that it contains to the GPU memory in an asynchronous fashion. Moreover, transformations are performed after prefetch to ensure a single CPU thread deals with GPU interactions, hence, reducing the context switching and lock acquisition overhead. 3. Checking for invalid values in gradients is now being performed on the GPU side.
…iew: Limiting the first version of PR to optimizations that do not alter the data loader. Min and Max scale factors are considered while scaling. Revised adjustScaleFactor() so that it returns false when gradients are invlaid. configs.h moved under autograd.
06847a2
to
ff35109
Compare
@mtmd has updated the pull request. You must reimport the pull request before landing. |
Thank you @xuqiantong.
Done. This decreases the performance of training resenet-34 from 1507 fps (V100 - 16GB) to 1371 fps.
The changes didn't appear to be breaking. Can you please point out what they broke so that I can be mindful of it in case we work more on asynchronous aspects in future?
Sounds great! Happy to be part of that effort.
Done.
It seems like Arrayfire doesn't support boolean device pointer (undefined reference to `bool* af::array::device() const'). Nonetheless, it doesn't have a measurable impact on the performance since loads operate on the sector granularity.
Done.
Done. However, as we discussed in our last meeting, it decreases the performance by 2 fps (resenet-34, V100 - 16 GB).
Sure. Done.
I want to keep the size of cache and the lookup steps as small as possible to reduce to imposed lookup overheads. As a result, I only index parameters that can change in each layer of a given neural network across different iterations. That includes
I'm not sure if I'm authorized to do this. If you need me to work on it, let's talk about in workplace. |
@xuqiantong has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Hi @mtmd, thanks for the update! We are ready to commit it! @tlikhomanenko @jacobkahn can you accept the diff I imported and land it? To answer you questions:
For example, you preload two things asynchronously --
Overall, it's not trivial to add in the asynchronous component, class "Sample", in the current dataset pipeline. We need to redesign it a bit.
Without changing the FL dataset pipeline, I think it still worths to keep your changes to the |
Thank you @xuqiantong! Sounds good.
I think given that you are considering to redesign the loader, we can skip this for now and incorporate it in the new design. Changes in the |
@mtmd has updated the pull request. You must reimport the pull request before landing. |
@xuqiantong has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@mtmd — we'll get this merged in pretty soon - there are some broader changes to abstractions that will be helpful here to clean this up. Would you be able to also submit a PR that contains the additions that aren't in this PR (including |
Sounds good @jacobkahn. Sure, will do :). |
…data structure: Sample. Sample transfers the data that it contains to the GPU memory in an asynchronous fashion. Moreover, transformations are performed after prefetch to ensure a single CPU thread deals with GPU interactions, hence, reducing the context switching and lock acquisition overhead. This is a WIP. Please see flashlight#631 (comment)
Original Issue: #630
Summary
This commit improves the performance of flashlight using the following optimizations:
cudnnFind is used instead of the flashlight benchmark to improve the performance.
A new data structure, Sample, added. Sample transfers the data that it contains to the GPU memory in an asynchronous fashion. Moreover, transformations are performed after prefetch to ensure a single CPU thread deals with GPU interactions, hence, reducing the context switching and lock acquisition overhead.
Checking for invalid values in gradients is now being performed on the GPU side.
Test Plan (required)
This pull request is expected to improve the performance of ResNet-34 on V100 - 16GB, using a batch size of 128 from 1143 fps to 1507 fps. To test, please run: