Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: cluster/restart #643

Open
gibmat opened this issue Apr 24, 2024 · 5 comments
Open

Flaky test: cluster/restart #643

gibmat opened this issue Apr 24, 2024 · 5 comments

Comments

@gibmat
Copy link
Contributor

gibmat commented Apr 24, 2024

During a recent rebuild of dqlite 1.16.4 (using bundled libraft) on an arm64 host, one test failed. It's likely this is a flaky test:

> cluster/restart                                             
>   num_records=0, disk_mode=0                                [ OK    ] [ 0.09290296 / 0.01869385 CPU ]
>   num_records=0, disk_mode=1                                [ OK    ] [ 0.11798750 / 0.01738483 CPU ]
>   num_records=1, disk_mode=0                                [ OK    ] [ 0.16095490 / 0.01925361 CPU ]
>   num_records=1, disk_mode=1                                [ OK    ] [ 0.09314319 / 0.01966425 CPU ]
>   num_records=256, disk_mode=0                              [ ERROR ]
> Error: test/lib/server.c:65: assertion failed: rv == 0 (1 == 0)
> Error: child killed by signal 6 (Aborted)
>   num_records=256, disk_mode=1                              [ ERROR ]
> Error: test/lib/server.c:65: assertion failed: rv == 0 (1 == 0)
> Error: child killed by signal 6 (Aborted)
>   num_records=993, disk_mode=0                              [ OK    ] [ 4.22504766 / 0.85921956 CPU ]
>   num_records=993, disk_mode=1                              [ OK    ] [ 4.26175325 / 0.89829095 CPU ]
>   num_records=2200, disk_mode=0                             [ OK    ] [ 7.24100937 / 0.67460259 CPU ]
>   num_records=2200, disk_mode=1                             [ OK    ] [ 7.25931588 / 0.81881875 CPU ]
@cole-miller
Copy link
Contributor

cole-miller commented Apr 25, 2024

Thanks, we'll keep an eye on it. Unfortunately our tests are not well-isolated so this kind of flakiness is hard to squash entirely. If if happens again then LIBDQLITE_TRACE=1 output would be useful in figuring out the culprit.

@gibmat
Copy link
Contributor Author

gibmat commented Apr 27, 2024

Yeah, I haven't been able to reproduce this exact failure myself. Would adding LIBDQLITE_TRACE=1 when running the tests have any adverse side effects? If not, I'll just add it to the normal packaging rules so it's always there to help with future debugging.

@cole-miller
Copy link
Contributor

cole-miller commented Apr 27, 2024 via email

@freeekanayaka
Copy link
Contributor

I'd say that the problem is that those tests are time-dependent and not deterministic. I would argue that in this kind of unit/integration test suites that are ran at package build time there should be only deterministic and time-independent tests.

From what I've seen, this (and other) tests just fail because the hard-coded timeouts or time expectations don't match the capacity of the underlying hardware.

While adding LIBDQLITE_TRACE=1 might help figuring exactly what timing is wrong, I think the most robust solution would be to rewrite those tests, because tweaking timings is intrinsically fragile.

@gibmat
Copy link
Contributor Author

gibmat commented May 4, 2024

I've enabled LIBDQLITE_TRACE=1 when running the tests, and it will be included whenever the next upload of dqlite is made to unstable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants