-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
malcpp-example-hello-malcpp giving bl_timeout non-deterministically #6
Comments
I tried increasing
I suppose my system seems to have a higher "warm-up time" to bring the latency of Perhaps the library can just be configured to take more samples - maybe 1000? |
Thanks for debugging. I don't have time now to check, but that code seems to be doing some heuristics to detect clock latencies. On a quick test I couldn't reproduce. What seems obvious is that some heuristics there are flaky on your system. |
This is old code, so I don't even myself remember what it does :) A quick read of the code it seems that it first tries to calculate the "clock_gettime" latency. Then with those same measurements it tries to calculate the time of one tick of the "rdtsc" instruction. The most common method I have seen is to sleep an amount of time big enough so that is magnitudes bigger than "clock_gettime"'s latency and a context switch and to measure how long was the sleep both in monotonic time and into CPU clocks (rdtsc). This has the drawback that the amount of time to sleep is usually on the second range. To avoid that, this code there tries to get (reused from the clock_gettime latency calculation to be faster) two measurements that are separated as much as possible and that are not suspicious of having a context switch in between the CPU timer measurements (rdtsc). The criteria used is that anything above "BL_TIME_CONTEXT_SWITCH_FACTOR" (5) times bigger than "timept64_latency" is considered as a having a context switch between both CPU clock measurements, so you are probably into something with the clock_gettime variability link. You could probably look at the variability you get on your machine. Notice that you should be able to override all these preprocessor variables from the command line without changing the sources: https://mesonbuild.com/Builtin-options.html#compiler-options So this can probably work (untested):
I have to document these variables properly if I find some time. Ideas for less flaky and fast methods to do this are welcome. |
Interesting! I've been looking deeper into this timing code... Specifically this section in
It looks like the timestamp for the each log entry gets set in either Later, in the above code snippet,
It seems like the current code treats this timestamp (read from Which I have confirmed on my system both using that calculation, and grepping the kernel logs to find the same.
This is to say, I think that code should be taking into account the TSC frequency. ================ There is also the issue of validity in even using TSC in the first place to measure time. In older architectures, it seems like the TSC was simply incremented on each tick (which can be impacted by CPUs going into sleep states or stopping altogether), but newer processors come with Invariant TSC, described in the Intel manual here: This can also be checked with On my platform, these assumptions hold true but it may be a good idea to validate these assumptions in the library as well. This link has a nice overview of the gotchas around using |
I haven't dug into how this might be best resolved - I see that Perhaps the |
Notice the "bl_fast_timept_to_nsec" call on "src/malc/serialization.c:510". It is translated afterwards. If my memory serves me right on all modern enough Intel processors the clock runs at a constant frequency and is synchronized. If a C library for using the CPU counters existed it would be definitely much better to use it. Akin to what Windows QueryPerformanceCounter/Frequency does. Writing it myself would have been too much of an effort, especially for testing (QEMU could have probably helped though). Notice that to allow to fixing drift (if it's problematic) and to translate to monotonic to calendar time at postprocessing time, each rotated log file contains the EPOCH and the value of the CPU clock in on each file name. |
I see! I missed that section after the #ifdef - thank you! This makes more sense. |
I was looking to patch the library so that the timestamps would automatically be in wall-clock time, to remove the need for further post processing (I don't need that much performance). I ran the test code below and noticed that the conversion provided by
I needed to modify I am attaching a patch file that adds a test to show the discrepancy, and also split up the |
The way I'd go with this if I were to code it myself would be adding a meson option. I have this project in maintenance-only, so I code no features, even though I accept MRs. You can modify like the diff below and pass -DBL_NO_CPU_TIMEPT to the compilation I guess. That will make the "fast" clock to fallback on the monotonic clock instead of rdtsc (untested).
|
Getting this error when I build and run
malcpp/hello-malcpp
.The error does not happen every time - it seems to be somewhat random when running directly on the terminal, but doesn't
ever seem to show up when running in GDB.
Program Output
Stack trace from the core dump:
Inspecting Frame 9:
Adding some debugging code, the issue seems to be here in
time_extras.c
Things are running okay when running with GDB. Here is the variable values:
But if I print the same from the executable itself, when it fails it prints:
So for some reason, these
first
andlast
variables are not properly conforming to the logic when not running through GDB.Not entirely sure where to proceed with debugging here - I am not fully understanding the intention of this section of code anyway.
Any direction is appreciated!
Jay
The text was updated successfully, but these errors were encountered: