-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant performance overhead when timing requires a syscall on Linux #270
Comments
I'm a bit confused, do you think the CPython upgrade caused the performance regression? Or was it always there? |
Another thing that has changed between these runs is the github actions runner image. But there could also be other factors, like the runner on GHA that actually executed the job. If you can give me a local recreation of the performance issue, I can debug it though. If it's 1.5s when running with |
I did some more digging and the culprit seems to be the sympy calls.
The sympy calls are really not that performance intensive, in the non-instrumented run they take almost no time at all. |
I will try to create a simple reproducer. Should not be so hard. |
All right, here is a reproducer (sympy 1.12, pyinstrument 4.5.3, Python 3.11.5). from datetime import datetime
from sympy import sympify, FF, symbols, Poly, Rational
def do_thing():
# Some elliptic curve crypto stuff that is not important
field = 0xFFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFF
params = {
"a": 0xFFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFC,
"b": 0x5AC635D8AA3A93E7B3EBBD55769886BC651D06B0CC53B0F63BCE3C3E27D2604B,
}
k = FF(field)
expr = sympify(f"3*b - b3", evaluate=False)
for curve_param, value in params.items():
expr = expr.subs(curve_param, k(value))
param = str(expr.free_symbols.pop())
def resolve(expression, k):
if not expression.args:
return expression
args = []
for arg in expression.args:
if isinstance(arg, Rational):
a = arg.p
b = arg.q
res = k(a) / k(b)
else:
res = resolve(arg, k)
args.append(res)
return expression.func(*args)
expr = resolve(expr, k)
poly = Poly(expr, symbols(param), domain=k)
roots = poly.ground_roots()
for root in roots:
params[param] = int(root)
break
if __name__ == "__main__":
start = datetime.now()
for _ in range(1000):
do_thing()
end = datetime.now()
print("This took", end - start) |
The reproducer above, when run with varying the number of iterations leads to this runtime graph. My conclusion from that is:
|
This is highly platform-dependent, I think. Running 1000 iterations on my Mac, I get $ python examples/sympy_perf.py
This took 0:00:00.403314
$ pyinstrument examples/sympy_perf.py
This took 0:00:00.845693 Which admittedly is a 100% overhead, but nothing like the kind of deltas you're getting here. What platform are you running on? We have had reports of stuff like this in Docker containers, for example. |
Having this reproducer is really useful, thank you for that. I think there are some low-level things that can improve performance, once I know what platform you're on. |
I'm assuming you're running this on Linux? I want to test a theory, could you run this code and tell me the output? #include <stdio.h>
#include <elf.h>
#include <sys/auxv.h>
int main() {
unsigned long vdso_start = getauxval(AT_SYSINFO_EHDR);
if (vdso_start) {
printf("vDSO is active at address: %lx\n", vdso_start);
} else {
printf("vDSO does not seem to be active.\n");
}
return 0;
} Save that to a file The vDSO is a mechanism where the kernel makes the timing function |
Interesting, I am running ArchLinux (6.5.5 kernel) on x86_64. But also saw this in GitHub CI. I am not sure what other details are important. Btw. I think a simple reproducer is the following: from datetime import datetime
def func():
pass
def do_func_loop():
for _ in range(6000):
func()
if __name__ == "__main__":
start = datetime.now()
for _ in range(1000):
do_func_loop()
end = datetime.now()
print((end - start).total_seconds()) I get:
Did this, said I could perhaps try measuring performance of the |
Yeah, this would be the next step, I suppose. There are also other ways to get the time on POSIX, like |
Here's a script: #include <time.h>
#include <stdio.h>
#include <sys/time.h>
typedef enum {
CT_BASELINE,
CT_GETTIMEOFDAY,
CT_MONOTONIC,
CT_MONOTONIC_RAW,
CT_CLOCK,
CT_REALTIME,
CT_MONOTONIC_COARSE,
} ClockType;
static volatile double dummy_clock = 0.0;
static double
floatclock(ClockType clock_type)
{
switch (clock_type) {
case CT_BASELINE:
return dummy_clock;
case CT_GETTIMEOFDAY: {
struct timeval t;
gettimeofday(&t, (struct timezone *)NULL);
return (double)t.tv_sec + t.tv_usec*0.000001;
}
case CT_MONOTONIC: {
struct timespec t;
clock_gettime(CLOCK_MONOTONIC, &t);
return (double)t.tv_sec + t.tv_nsec*0.000000001;
}
case CT_MONOTONIC_RAW: {
struct timespec t;
clock_gettime(CLOCK_MONOTONIC_RAW, &t);
return (double)t.tv_sec + t.tv_nsec*0.000000001;
}
case CT_CLOCK:
return (double)clock() / CLOCKS_PER_SEC;
case CT_REALTIME: {
struct timespec t;
clock_gettime(CLOCK_REALTIME, &t);
return (double)t.tv_sec + t.tv_nsec*0.000000001;
}
case CT_MONOTONIC_COARSE: {
#ifdef CLOCK_MONOTONIC_COARSE
struct timespec t;
clock_gettime(CLOCK_MONOTONIC_COARSE, &t);
return (double)t.tv_sec + t.tv_nsec*0.000000001;
#endif
}
}
return 0.0;
}
static double timing_overhead(ClockType clock_type) {
int NUM_ITERATIONS = 1000000;
double counter = 0.0;
double start = floatclock(CT_MONOTONIC);
for (int i = 0; i < NUM_ITERATIONS; i++) {
counter += floatclock(clock_type);
}
double end = floatclock(CT_MONOTONIC);
if (counter == 0.0) {
printf("!no-op! ");
}
return end - start;
}
int main()
{
printf("Baseline: ");
printf("%.3f\n", timing_overhead(CT_BASELINE));
printf("gettimeofday: ");
printf("%.3f\n", timing_overhead(CT_GETTIMEOFDAY));
printf("clock_gettime(CLOCK_MONOTONIC): ");
printf("%.3f\n", timing_overhead(CT_MONOTONIC));
printf("clock_gettime(CLOCK_MONOTONIC_RAW): ");
printf("%.3f\n", timing_overhead(CT_MONOTONIC_RAW));
// clock isn't interesting imo
// printf("clock(): ");
// printf("%.3f\n", timing_overhead(CT_CLOCK));
printf("clock_gettime(CLOCK_REALTIME): ");
printf("%.3f\n", timing_overhead(CT_REALTIME));
printf("clock_gettime(CLOCK_MONOTONIC_COARSE): ");
printf("%.3f\n", timing_overhead(CT_MONOTONIC_COARSE));
return 0;
} This is what it prints on my system (macOS 13.4, arm64 M1 Max). $ ./a.out
Baseline: !no-op! 0.007
gettimeofday: 0.030
clock_gettime(CLOCK_MONOTONIC): 0.031
clock_gettime(CLOCK_MONOTONIC_RAW): 0.023
clock(): 0.288
clock_gettime(CLOCK_REALTIME): 0.022
clock_gettime(CLOCK_MONOTONIC_COARSE): !no-op! 0.003 In an aarch64 ubuntu Docker container on the same machine:
|
I've tried the above script on a few machines. My conclusion is that it's not as simple as the existence of the VDSO. It always seems to be there in my testing. But, the VDSO will make a syscall if it doesn't support the current clock source. If it makes a syscall, it's really slow. You can inspect the clock source like this: $ # get the clock source
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
arch_sys_counter
$ # get available clock sources on this machine
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
arch_sys_counter
$ # I believe you can set the current clock source using this:
$ # echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource In my testing, running Docker containers on arm64 uses a clock source |
I get the following:
I get:
Tested this with just 1000 iterations, I get:
So it seems all except the I found two nice blogposts discussing the situation, apparently it has affected quite some projects before. This post has some code for benchmarking the different clock sources. With it I get:
The |
I tried changing the |
@J08nY I've started some work to improve this here #273 It's still very draft, but on the off-chance that you have time, I'm actually currently hitting a problem on Linux that would be great to get some debugging help on. On macos I can test bits of the C API from ctypes using |
Perhaps |
I will have a look at that PR. |
Hi, I recently saw a performance regression while using pyinstrument. It seems to be connected to the 4.5.1 -> 4.5.3 update but also might be related to Python version (3.9.17 to 3.9.18).Update: This is likely related to sympy library use.
I have two Github CI runs which use pyinstrument to perf some code that uses sympy.
Good run
https://github.com/J08nY/pyecsca/actions/runs/6039852294/job/16389425413
In the good run, pyinstrument is 4.5.1 and Python is 3.9.17 and the logs look reasonable
(look here for example, it takes about 1.2 seconds).
Bad run
https://github.com/J08nY/pyecsca/actions/runs/6247569280/job/16960430761
In the bad run, pyinstrument is 4.5.3 and Python is 3.9.18 and the logs look very long
(look here for example, it now takes 13.6 seconds). I understand the logs are quite verbose since I have the
show_all
optionenabled, but that does not explain the runtime increase.
Local test
Note that there are some additional calls to sympy added in commits between the two runs.
However I believe these are not the cause of the performance regression, since I can observe the regression
locally.
Running the code without pyinstrument I get about 5 second runtime:
Running with pyinstrument I get about 90 seconds:
Note that it seems the pyinstrument version does not matter, as I get roughly 90 seconds even
if I use 4.3.0. I am using
Python 3.11.5 (main, Sep 2 2023, 14:16:33) [GCC 13.2.1 20230801]
locally.The text was updated successfully, but these errors were encountered: