Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird time jumps, probably due to icount being disabled in QEMU #1009

Open
j-piecuch opened this issue Feb 18, 2021 · 4 comments
Open

Weird time jumps, probably due to icount being disabled in QEMU #1009

j-piecuch opened this issue Feb 18, 2021 · 4 comments

Comments

@j-piecuch
Copy link
Collaborator

j-piecuch commented Feb 18, 2021

Excerpt from https://github.com/cahirwpz/mimiker/pull/1006/checks?check_run_id=1926763381:

| 8.938410 |   2 | kern/callout.c:101   | KL_CALLOUT | Add callout {0xffff00000010bb20} with wakeup at 8948.                       |
| 8.938421 |   2 | kern/callout.c:101   | KL_CALLOUT | Add callout {0xffff00000010bb50} with wakeup at 8949.                       |
| 8.938431 |   2 | kern/thread.c:179    | KL_THREAD  | Join 2 {0xffff0000000021d0} with 167 {0xffff000000002368}                   |
| 8.938455 |   2 | kern/sleepq.c:119    | KL_SLEEPQ  | Thread 2 goes to sleep on 0xffff000000002370 at pc=0xffff000000213cf4       |
| 8.938499 | 171 | kern/sleepq.c:119    | KL_SLEEPQ  | Thread 171 goes to sleep on 0xffff0000002301e4 at pc=0xffff00000023021c     |
     ^
     |
     v
| 8.963309 |   0 | kern/sleepq.c:157    | KL_SLEEPQ  | Wakeup thread 1 from 0xffff0000002a9090 at pc=0xffff000000228048            |
| 8.966100 |   1 | kern/sleepq.c:157    | KL_SLEEPQ  | Wakeup thread 167 from 0xffff0000002301e4 at pc=0xffff00000023021c          |
| 8.966120 |   1 | kern/sleepq.c:157    | KL_SLEEPQ  | Wakeup thread 168 from 0xffff0000002301e4 at pc=0xffff00000023021c          |
| 8.966134 |   1 | kern/sleepq.c:157    | KL_SLEEPQ  | Wakeup thread 169 from 0xffff0000002301e4 at pc=0xffff00000023021c          |

Excerpt from https://github.com/cahirwpz/mimiker/pull/1006/checks?check_run_id=1926763381:

| 4.025008 | 134 | kern/sleepq.c:119    | KL_SLEEPQ  | Thread 134 goes to sleep on 0xffff0000002301a4 at pc=0xffff0000002301dc     |
| 4.025024 | 135 | kern/sleepq.c:119    | KL_SLEEPQ  | Thread 135 goes to sleep on 0xffff0000002301a4 at pc=0xffff0000002301dc     |
| 4.025040 | 136 | kern/sleepq.c:119    | KL_SLEEPQ  | Thread 136 goes to sleep on 0xffff0000002301a4 at pc=0xffff0000002301dc     |
    ^
    |
    v
| 4.125152 |   0 | kern/sleepq.c:157    | KL_SLEEPQ  | Wakeup thread 1 from 0xffff0000002a9090 at pc=0xffff000000228018            |
| 4.134341 |   1 | kern/sleepq.c:157    | KL_SLEEPQ  | Wakeup thread 132 from 0xffff0000002301a4 at pc=0xffff0000002301dc          |
| 4.134365 |   1 | kern/sleepq.c:157    | KL_SLEEPQ  | Wakeup thread 133 from 0xffff0000002301a4 at pc=0xffff0000002301dc          |

In both cases the test callout_sync fails due to the time jump, which causes callouts that are supposed to run on different ticks to run at the same time.
The AArch64 timer is driven by the QEMU virtual clock, not the host clock, which makes the time jump all the more strange.

UPDATE: the same thing happens on MIPS: https://github.com/cahirwpz/mimiker/runs/1928695030

| 6.950163 | 102 | kern/sleepq.c:119    | KL_SLEEPQ  | Thread 102 goes to sleep on 0xc0162b18 at pc=0xc0162b68                     |
| 6.950205 | 103 | kern/sleepq.c:119    | KL_SLEEPQ  | Thread 103 goes to sleep on 0xc0162b18 at pc=0xc0162b68                     |
| 6.950244 | 104 | kern/sleepq.c:119    | KL_SLEEPQ  | Thread 104 goes to sleep on 0xc0162b18 at pc=0xc0162b68                     |
| 7.063905 |   0 | kern/sleepq.c:157    | KL_SLEEPQ  | Wakeup thread 1 from 0xc01fa420 at pc=0xc01548fc                            |
| 7.064159 |   1 | kern/sleepq.c:157    | KL_SLEEPQ  | Wakeup thread 100 from 0xc0162b18 at pc=0xc0162b68                          |
| 7.064252 |   1 | kern/sleepq.c:157    | KL_SLEEPQ  | Wakeup thread 101 from 0xc0162b18 at pc=0xc0162b68                          |
@j-piecuch j-piecuch added AArch64 all tasks that are related to AArch64 port and removed AArch64 all tasks that are related to AArch64 port labels Feb 18, 2021
@j-piecuch j-piecuch changed the title Weird time jumps on AArch64 Weird time jumps Feb 18, 2021
@j-piecuch
Copy link
Collaborator Author

Another one caught on AArch64, this time not during the callout_sync test: https://github.com/cahirwpz/mimiker/pull/1005/checks?check_run_id=1928670781

| 39.504748 | 600 | aarch64/trap.c:54   | KL_UNDEF   | Last message repeated 1 times.                                              |
| 39.824406 | 600 | aarch64/trap.c:54   | KL_UNDEF   | 25 at $ffff0000002293f0, caused by reference to $410000!                    |

Also, the fact that the time in klog shows 40 seconds before timing out suggests that the timer is in fact driven by host time, since the test timeout on the host is 40 seconds.

@j-piecuch
Copy link
Collaborator Author

Ok... without the icount option, the virtual clock is basically the host's monotonic clock.
I have traced how QEMU emulates the value of the cntpct_el0 register:

  1. Definition of the register: reading the value is accomplished using the gt_cnt_read() function.
  2. gt_cnt_read() returns gt_get_countervalue(), which returns the value of the virtual clock (qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL)) in nanoseconds divided by the counter period in nanoseconds
  3. qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) returns cpus_get_virtual_clock()
  4. cpus_accel->get_virtual_clock is not set, because icount is disabled (see here). Therefore, cpus_get_virtual_clock() returns cpu_get_clock(), which ultimately uses get_clock(), which calls clock_gettime(CLOCK_MONOTONIC).

@j-piecuch j-piecuch changed the title Weird time jumps Weird time jumps, probably due to icount being disabled in QEMU Feb 18, 2021
@j-piecuch
Copy link
Collaborator Author

I think the way to resolve this is to enable icount in QEMU. If I remember correctly, we initally had it enabled, but it was disabled because guest time was elapsing faster than host time.

The shift suboption of icount controls the number of guest nanoseconds that each guest instruction takes. If shift=N, each instruction executed in the guest advances the guest time by 2^N nanoseconds. If the shift value is too high, guest time will pass very fast. If it's too low, guest time will pass slowly. If shift=auto, QEMU will adjust the shift value at runtime so that guest time passes at more or less the same rate as host time. The drawback is that guest's perceived rate of instruction execution will vary.

Another solution is to have shift=N,align=on, with N large enough for guest time to elapse faster than host time. With these options, the guest's perceived instruction execution rate remains constant, and QEMU will periodically check the gap between guest and host time. If it's too large, QEMU will sleep so that host time catches up.

I don't think we rely on the instruction execution rate being constant, so shift=auto seems like a good enough solution to me.

@j-piecuch
Copy link
Collaborator Author

Ok... shift=auto seems to cause test_signal_mask() to fail, probably because 3 calls to sched_yield() aren't enough to make sure the child process runs when shift is very high (the maximum value is 10, so one executed instruction advances guest time by 1024 ns). We can fix this by using a better method to wait for a signal to become pending, e.g. sigpending(), which is implemented in #1012.
Also, it might be a good idea to use a constant shift value for non-interactive tests, and use shift=auto for everything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant