Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First run slow with CUDA #262

Open
kai-lan opened this issue Aug 24, 2023 · 3 comments
Open

First run slow with CUDA #262

kai-lan opened this issue Aug 24, 2023 · 3 comments

Comments

@kai-lan
Copy link

kai-lan commented Aug 24, 2023

I am using CUDA backend. In the code, I have pretty much the same setting as the Poisson tutorial. And I tested on the same examples: https://sparse.tamu.edu/FEMLAB/poisson3Db. I set up profiler that times the setup time and solve time. However, setup is very slow for the first run and is fast from then on. Solve time is consistent all the time.

But in the Poisson tutorial, the profiled time is always consistent.

Also, what is self is profiler?

First time:


[AMGCL solver:     1.539 s] (100.00%)
[ self:            0.280 s] ( 18.16%)
[  read:           1.141 s] ( 74.13%)
[  setup:          0.111 s] (  7.22%)
[  solve:          0.007 s] (  0.48%)

From then on:

[AMGCL solver:     1.200 s] (100.00%)
[ self:            0.055 s] (  4.54%)
[  read:           1.039 s] ( 86.60%)
[  setup:          0.099 s] (  8.25%)
[  solve:          0.007 s] (  0.60%)
@ddemidov
Copy link
Owner

ddemidov commented Aug 25, 2023

That is a known issue/normal behavior. The first run is "warm up", when the driver does things like kernel compilation for your specific device, caching etc. The numbers in the tutorial are all from the second or later runs.

The "self" portion of the profile is anything that belongs to the outer item, but is not enclosed by any of the inner items. Here is a possible example:

prof.tic("outer");
foo(); // will be recorded as "outer.self"
prof.tic("inner");
bar();
prof.toc("inner");
prof.toc("outer");

@kai-lan
Copy link
Author

kai-lan commented Aug 25, 2023

That is a known issue/normal behavior. The first run is "warm up", when the driver does things like kernel compilation for your specific device, caching etc. The numbers in the tutorial are all from the second or later runs.

The "self" portion of the profile is anything that belongs to the outer item, but is not enclosed by any of the inner items. Here is a possible example:

prof.tic("outer");
foo(); // will be recorded as "outer.self"
prof.tic("inner");
bar();
prof.toc("inner");
prof.toc("outer");

Thanks for your reply. I have a follow-up question. How to access the total runtime from the profiler? I want to use it as a return value for my method.

@ddemidov
Copy link
Owner

profile.toc() returns time since the initiating tic, so you could use that. There is no method to get the total time across all iterations, so you could either accumulate it yourself, or possible make a PR with the functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants