Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

malcpp-example-hello-malcpp giving bl_timeout non-deterministically #6

Open
jay-sridharan opened this issue Jun 13, 2023 · 9 comments

Comments

@jay-sridharan
Copy link

jay-sridharan commented Jun 13, 2023

Getting this error when I build and run malcpp/hello-malcpp.
The error does not happen every time - it seems to be somewhat random when running directly on the terminal, but doesn't
ever seem to show up when running in GDB.

Program Output

jsridhara@RSU-502C9R3:~/Documents/Projects/mini-async-log-c/build (master)$ ./malcpp-example-hello-malcpp
terminate called after throwing an instance of 'malcpp::exception'
  what():  Unable to create malc instance
Aborted (core dumped)

Stack trace from the core dump:

Core was generated by `./malcpp-example-hello-malcpp'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139653185956736) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139653185956736) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139653185956736) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139653185956736, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f038a242476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f038a2287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007f038a6a2bbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f038a6ae24c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f038a6ae2b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f038a6ae518 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00005638e58135fa in malcpp::malcpp<true, true, true>::construct_throw_impl (this=0x5638e5832048 <logger>, alloc=0x0)
    at ../include/malcpp/impl/malc_base.hpp:507
#10 0x00005638e5813307 in malcpp::malcpp<true, true, true>::malcpp (this=0x5638e5832048 <logger>, alloc=0x0)
    at ../include/malcpp/impl/malc_base.hpp:468
#11 0x00005638e5812e93 in __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at ../example/src/malcpp/hello-malcpp.cpp:7
#12 0x00005638e5812ff4 in _GLOBAL__sub_I_logger () at ../example/src/malcpp/hello-malcpp.cpp:24
#13 0x00007f038a229ebb in call_init (env=<optimized out>, argv=0x7fffac600468, argc=1) at ../csu/libc-start.c:145
#14 __libc_start_main_impl (main=0x5638e5812a7c <main(int, char const**)>, argc=1, argv=0x7fffac600468, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffac600458) at ../csu/libc-start.c:379
#15 0x00005638e5812805 in _start ()

Inspecting Frame 9:

(gdb) print e
$1 = {own = 4, sys = 0}

Adding some debugging code, the issue seems to be here in time_extras.c

  if (first == -1 || last == -1 || last - first < BL_TIME_MIN_QSAMPLES) {
    return bl_mkerr (bl_timeout);
  }

Things are running okay when running with GDB. Here is the variable values:

Breakpoint 1, bl_time_extras_get_ro_data (s=0x555555576070 <state+16>) at ../subprojects/base_library/src/bl/time_extras/time_extras.c:192
192	  t64_cycles  = meas[last].t[4] - meas[first].t[1];
$1 = 1  <--- first
$2 = 49 <--- last
$3 = 48 <--- (last-first)

But if I print the same from the executable itself, when it fails it prints:

jsridhara@RSU-502C9R3:~/Documents/Projects/mini-async-log-c/build (master)$ ./malcpp-example-hello-malcpp
37 <-- first
49 <-- last
12 <-- (last - first)

So for some reason, these first and last variables are not properly conforming to the logic when not running through GDB.
Not entirely sure where to proceed with debugging here - I am not fully understanding the intention of this section of code anyway.
Any direction is appreciated!

Jay

@jay-sridharan
Copy link
Author

I tried increasing BL_TIME_QSAMPLES to 200, which significantly reduces the frequency of the error, but still sometimes fails.

jsridhara@RSU-502C9R3:~/Documents/Projects/mini-async-log-c/build (master)$ ./malcpp-example-hello-malcpp
first: 1
last: 199
00000101620.275182859[error]Hello malc, testing 1, 2, 3.0

jsridhara@RSU-502C9R3:~/Documents/Projects/mini-async-log-c/build (master)$ ./malcpp-example-hello-malcpp
first: 42
last: 199
00000101623.368342374[error]Hello malc, testing 1, 2, 3.0

jsridhara@RSU-502C9R3:~/Documents/Projects/mini-async-log-c/build (master)$ ./malcpp-example-hello-malcpp
first: 50
last: 199
00000101629.213781522[error]Hello malc, testing 1, 2, 3.0

jsridhara@RSU-502C9R3:~/Documents/Projects/mini-async-log-c/build (master)$ ./malcpp-example-hello-malcpp
terminate called after throwing an instance of 'malcpp::exception'
  what():  Unable to create malc instance
Aborted (core dumped)

I suppose my system seems to have a higher "warm-up time" to bring the latency of clock_gettime down.
https://stackoverflow.com/questions/53252050/why-does-the-call-latency-on-clock-gettimeclock-realtime-vary-so-much

Perhaps the library can just be configured to take more samples - maybe 1000?

@RafaGago
Copy link
Owner

RafaGago commented Jun 14, 2023

Thanks for debugging.

I don't have time now to check, but that code seems to be doing some heuristics to detect clock latencies. On a quick test I couldn't reproduce.

What seems obvious is that some heuristics there are flaky on your system.

@RafaGago
Copy link
Owner

RafaGago commented Jun 14, 2023

This is old code, so I don't even myself remember what it does :)

A quick read of the code it seems that it first tries to calculate the "clock_gettime" latency.

Then with those same measurements it tries to calculate the time of one tick of the "rdtsc" instruction.

The most common method I have seen is to sleep an amount of time big enough so that is magnitudes bigger than "clock_gettime"'s latency and a context switch and to measure how long was the sleep both in monotonic time and into CPU clocks (rdtsc). This has the drawback that the amount of time to sleep is usually on the second range.

To avoid that, this code there tries to get (reused from the clock_gettime latency calculation to be faster) two measurements that are separated as much as possible and that are not suspicious of having a context switch in between the CPU timer measurements (rdtsc).

The criteria used is that anything above "BL_TIME_CONTEXT_SWITCH_FACTOR" (5) times bigger than "timept64_latency" is considered as a having a context switch between both CPU clock measurements, so you are probably into something with the clock_gettime variability link.

You could probably look at the variability you get on your machine.

Notice that you should be able to override all these preprocessor variables from the command line without changing the sources:

https://mesonbuild.com/Builtin-options.html#compiler-options

So this can probably work (untested):

meson -Dc_args="-DBL_TIME_CONTEXT_SWITCH_FACTOR=10". -Dcpp_args="-DBL_TIME_CONTEXT_SWITCH_FACTOR=10" ...

I have to document these variables properly if I find some time. Ideas for less flaky and fast methods to do this are welcome.

@jay-sridharan
Copy link
Author

jay-sridharan commented Jun 14, 2023

Interesting! I've been looking deeper into this timing code... Specifically this section in malc.c

        err = deserializer_execute(
          &l->ds,
          ((u8*) n) + sizeof *n,
          ((u8*) n) + (slots * l->mem.cfg.slot_size),
          n->info.has_timestamp,
          l->alloc
          );
        if (!err.own) {
          log_entry le = deserializer_get_log_entry (&l->ds);
          malc_log_strings strs;
          bl_err entry_err = entry_parser_get_log_strings (&l->ep, &le, &strs);

It looks like the timestamp for the each log entry gets set in either serializer_init or in deserializer_execute (depending on if timestamping happens on the producer or consumer side.) Both of those functions end up calling bl_fast_timept_get_fast which on my system calls the __rdtsc intrinsic. Everything makes sense so far.

Later, in the above code snippet, deserializer_get_log_entry is called, and copies the result of __rdtsc from the deserializer object (l->ds) to the log_entry object (le), which gets passed to entry_parser_get_log_strings where the following lines are executed:

  snprintf(
    ep->timestamp,
    TSTAMP_INTEGER + 2,
    "%0" bl_pp_to_str (TSTAMP_INTEGER) FMT_U64 ".",
     e->timestamp / bl_nsec_in_sec
    );
  snprintf(
    &ep->timestamp[TSTAMP_INTEGER + 1],
    TSTAMP_DECIMAL + 1,
    "%0" bl_pp_to_str (TSTAMP_DECIMAL) FMT_U64,
    e->timestamp % bl_nsec_in_sec
    );

It seems like the current code treats this timestamp (read from __rdtsc) as if it were in units of nanoseconds, but that is not the case.. For Intel Architecture, the manual states the following:

image

Which I have confirmed on my system both using that calculation, and grepping the kernel logs to find the same.

jsridhara@RSU-502C9R3:~/Documents/Projects/mini-async-log-c (master)$ cpuid -1 | grep -A2 0x15
   Time Stamp Counter/Core Crystal Clock Information (0x15):
      TSC/clock ratio = 166/2
      nominal core crystal clock = 38400000 Hz

38400000 hertz * (166 / 2) = 3.1872 gigahertz

jsridhara@RSU-502C9R3:~/Documents/Projects/mini-async-log-c (master)$ sudo journalctl --boot | grep 'kernel: tsc:' -i | cut -d' ' -f5-
kernel: tsc: Detected 3200.000 MHz processor
kernel: tsc: Detected 3187.200 MHz TSC

This is to say, I think that code should be taking into account the TSC frequency.

================

There is also the issue of validity in even using TSC in the first place to measure time. In older architectures, it seems like the TSC was simply incremented on each tick (which can be impacted by CPUs going into sleep states or stopping altogether), but newer processors come with Invariant TSC, described in the Intel manual here:

image

This can also be checked with /proc/cpuinfo and cpuid - the existence of the constant_tsc and nonstop_tsc flags imply invariant tsc. This also assumes that the TSC of all cores are synchronized on boot, which can be checked in /sys/devices/system/clocksource

On my platform, these assumptions hold true but it may be a good idea to validate these assumptions in the library as well. This link has a nice overview of the gotchas around using rdtsc

@jay-sridharan
Copy link
Author

I haven't dug into how this might be best resolved - I see that time.h declares a few functions like bl_timept_get_freq which are called in functions declared timept64_funcs_arbitrary_base.h.

Perhaps the timepoint_posix.h implementation needs to base off ARBITRARY_BASE instead of NANOSECOND_BASE ? Would really appreciate a pointer here!

@RafaGago
Copy link
Owner

RafaGago commented Jun 15, 2023

Notice the "bl_fast_timept_to_nsec" call on "src/malc/serialization.c:510". It is translated afterwards.

If my memory serves me right on all modern enough Intel processors the clock runs at a constant frequency and is synchronized. If a C library for using the CPU counters existed it would be definitely much better to use it. Akin to what Windows QueryPerformanceCounter/Frequency does. Writing it myself would have been too much of an effort, especially for testing (QEMU could have probably helped though).

Notice that to allow to fixing drift (if it's problematic) and to translate to monotonic to calendar time at postprocessing time, each rotated log file contains the EPOCH and the value of the CPU clock in on each file name.

@jay-sridharan
Copy link
Author

I see! I missed that section after the #ifdef - thank you! This makes more sense.
I am not sure if drift will be an issue... I don't know how much the TSC will drift over time.

@jay-sridharan
Copy link
Author

jay-sridharan commented Jun 15, 2023

I was looking to patch the library so that the timestamps would automatically be in wall-clock time, to remove the need for further post processing (I don't need that much performance).

I ran the test code below and noticed that the conversion provided by bl_cpu_timept_to_sysclock64_diff_ns was not quite right. The conversion from TSC showed my time about 20 seconds in the future.

      struct timespec ts;
      char buffer[256];
      size_t nbytes;
      struct tm* t;
      struct tm* t_tsc;

      printf("Date Command:\n");
      system("date");
  
      bl_timept64 tsc_ns = bl_fast_timept_to_nsec(bl_fast_timept_get());
      clock_gettime (CLOCK_REALTIME, &ts);
  
      printf("SysClock Conversion:\n");
      t =localtime(&ts.tv_sec);
      nbytes = strftime(buffer, 256, "%a %b %e %I:%M:%S.", t);
      snprintf(buffer + nbytes, 256 - nbytes, "%.9ld", ts.tv_nsec);
      printf("%s\n", buffer);

      printf("TSC Conversion:\n");
      bl_timept64 adj_time = tsc_ns + bl_cpu_timept_to_sysclock64_diff_ns();
      time_t epoch = adj_time / bl_nsec_in_sec;
      time_t nanos = adj_time % bl_nsec_in_sec;
      t_tsc = localtime(&epoch);
      nbytes = strftime(buffer, 256, "%a %b %e %I:%M:%S.", t_tsc);
      snprintf(buffer + nbytes, 256 - nbytes, "%.9ld", nanos);
      printf("%s\n", buffer);

I needed to modify bl_cpu_timept_to_sysclock64_diff_ns to _subtract _ state.syncdata.tcpu_to_timept64_ns instead of add for it to work correctly. Though since clock_gettime(MONOTONIC) can have an arbitrary base, perhaps some more robust checks on which is greater is necessary, so that it can be properly added or subtracted.

I am attaching a patch file that adds a test to show the discrepancy, and also split up the bl_get_timept64_to_sysclock64_diff_ns function so that the calibration routine is not executed on every call. Let me know what you think! There are some other minor formatting changes I made for myself as well, but not relevant to this discussion - please ignore.

malcpp.patch.txt

@RafaGago
Copy link
Owner

RafaGago commented Jun 16, 2023

The way I'd go with this if I were to code it myself would be adding a meson option. I have this project in maintenance-only, so I code no features, even though I accept MRs.

You can modify like the diff below and pass -DBL_NO_CPU_TIMEPT to the compilation I guess. That will make the "fast" clock to fallback on the monotonic clock instead of rdtsc (untested).

diff --git a/include/bl/time_extras/time_extras.h b/include/bl/time_extras/time_extras.h
index a89874a..72d977e 100644
--- a/include/bl/time_extras/time_extras.h
+++ b/include/bl/time_extras/time_extras.h
@@ -10,8 +10,12 @@
 #define BL_CPU_TIMEPT_UNAVAILABLE 0
 #define BL_CPU_TIMEPT_RDTSC       1
 /*----------------------------------------------------------------------------*/
-#if BL_ARCH_IS_X64_X86_IA64
-  #define BL_CPU_TIMEPT BL_CPU_TIMEPT_RDTSC
+#ifndef BL_NO_CPU_TIMEPT
+  #if BL_ARCH_IS_X64_X86_IA64
+    #define BL_CPU_TIMEPT BL_CPU_TIMEPT_RDTSC
+  #else
+    #define BL_CPU_TIMEPT BL_CPU_TIMEPT_UNAVAILABLE
+  #endif
 #else
   #define BL_CPU_TIMEPT BL_CPU_TIMEPT_UNAVAILABLE
 #endif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants