testscr_check_kdb_internal_suite: very slow at least on a7 #3512

mpranj · 2020-10-10T17:20:03Z

The testscr_check_kdb_internal_suite takes more than 30 minutes on a7.

        Start 181: testscr_check_kdb_internal_suite
218/262 Test #181: testscr_check_kdb_internal_suite ..............   Passed  1804.16 sec

We need to investigate why this takes so long and if things can be improved.

The text was updated successfully, but these errors were encountered:

markus2330 · 2020-10-11T07:18:38Z

Thank you for creating the issue! As said in #3510 we should probably reduce the number of tests for normal builds and only have them run for special builds (but then really all of them). Ideally these builds can also be executed from PRs (e.g. if someone modifies a storage plugin).

Maybe these special builds also include fuzzing methods #185, which is also only needed from time to time (or when relevant changes are done).

markus2330 · 2020-10-12T10:06:15Z

Just an idea: maybe libeatmydata speeds up these tests as Elektra uses fsync excessively.

mpranj · 2020-10-12T10:23:35Z

Another idea: we could use ramdisk for parts or all of the tests, at least on slow machines.

markus2330 · 2020-10-12T11:37:35Z

Yes, this is might be the easier solution!

kodebach · 2020-10-12T11:49:28Z

We need to investigate why this takes so long

This is a result of #2595. Now we finally test all storage plugins. Maybe fsync is the root cause of this issue, but it might also be the case that some plugin is a very slow.

Currently the tests are executed with these pluigns:

dini
dump
ini
kconfig
ni
quickdump
simpleini
tcl
toml
xerces
xmltool
yajl

(based on the testdata in tests/shell/shell)

The tcl plugin only runs the basic tests, ini runs basic string umlauts, dump and quickdump run basic string umlauts binary naming and everything else run basic string.

The script also runs everything twice. Once for the user namespace and once for system.

On my machine the test takes only ~63s (Ryzen 3700X, 32GB RAM, NVMe SSD), but the distribution looks like this:

dump -> 26s
ini -> 9s
quickdump -> 26s

tcl doesn't work on my machine, and the remaining tests take less than 1s combined. The times where obtained by adding replacing "$KDB" test "$ROOT" $TESTS with time "$KDB" test "$ROOT" $TESTS in the test script.

The fact that dump and quickdump have nearly identical runtime combined with looking at the code for kdb test indicates to me that the problem is the kdb test itself. Specifically the umlauts, binary and naming tests. Each of these calls kdbGet and kdbSet more than 250 times. Everytime just with a single key. It also destroys all KeySets, Keys and KDB for each of these 250+ tests.

kdb test is just poorly written and will always take forever to run. It didn't profile it but I highly doubt that fsync is the issue and suspect that it's simply the overhead of kdbGet and kdbSet themselves.

PS. the meta test is even worse and that's why it is never used for any of the plugins.

Maybe disabling fsync or using a RAM disk will do something, but I don't think we can avoid changing kdb test itself.

markus2330 · 2020-10-12T12:12:33Z

The script also runs everything twice. Once for the user namespace and once for system.

Thanks for the hint, this is actually useless, it does not test the storage plugins. Furthermore, it would be a very easy fix.

Everytime just with a single key.

This is on purpose to test such KeySets. Further tests with bigger KeySets can be added (contributions welcome!) but this should not replace the simple tests.

It also destroys all KeySets, Keys and KDB for each of these 250+ tests.

Reusing KDB might make the tests faster but then they would be different tests. Tests that test the behavior of sequences of get/set are in tests/kdb.

Btw. using tmpfs might be positive for all tests, testscr_check_kdb_internal_suite is not the only test that takes some time because of I/O load.

Maybe Robert will look at these things.

kodebach · 2020-10-12T12:41:52Z

This is on purpose to test such KeySets.

What exactly is the difference between testing a KeySet with just a single Key a then with Key b, etc. compared to testing with a KeySet containing a, b, c, etc. all at once? In the end, either the plugin could deal with all the key names and stored everything properly or it didn't.

I can't think of a reason why a plugin would be able to store KeySet containing a and b, but a KeySet with just a.

What exactly is the goal of kdb test anyway? It seems to be to verify that a mountpoint configuration works.
But the goal of testscr_check_kdb_internal_suite seems to be to check that a storage plugin can be used as such. This could also be achieved by calling the plugin directly and doing some checks around that. The tests would also be a lot faster, if they didn't use kdbGet/kdbSet but instead called the plugin directly.

In particular, I don't see the point of running the kdbGet/kdbSet sequence for all 256 possible 1 byte key base names for every plugin. It would be much quicker to validate kdbGet/kdbSet and the plugins separately instead of revalidating kdbGet/kdbSet for every plugin.

Reusing KDB might make the tests faster but then they would be different tests. Tests that test the behavior of sequences of get/set are in tests/kdb.

Yes, reusing the KDB handle would be a different test. But at least the KeySet and Key instances could be reused. I'm not sure this would change a lot in practice, but at least in theory it saves a lot of malloc and free.

Btw. using tmpfs might be positive for all tests, testscr_check_kdb_internal_suite is not the only test that takes some time because of I/O load.

That is certainly true. But definitely needs further investigation. There are a few testmod_* tests for storage plugins that do file I/O that are still quite fast. I don't think any of them use fsync, but also don't call kdbGet/kdbSet and that also has a big influence.

Another thing I just noticed, umlauts may be a subset of naming. Therefore calling both for the same plugin is just a waste of time.

markus2330 · 2020-10-12T13:13:32Z

What exactly is the goal of kdb test anyway?

To test round-trip and other storage properties of a given mountpoint.

What exactly is the difference between testing a KeySet with just a single Key a then with Key b, etc. compared to testing with a KeySet containing a, b, c, etc. all at once?

The difference is for storage plugin developers: you get a very small example of when your plugin fails and not a huge test case where you need to find out yourself what actually the problem is.

It is possible that there is some overlap between the tests but I would not waste my time (nor I would ask anyone to waste their time) trying to eliminate this.

But at least the KeySet and Key instances could be reused

As long as we do not know for sure that this is a bottleneck (and I doubt it), such optimizations would be waste of development time.

kodebach · 2020-10-12T13:38:25Z

The difference is for storage plugin developers: you get a very small example of when your plugin fails and not a huge test case where you need to find out yourself what actually the problem is.

Ok, but that should not be the goal of testscr_check_kdb_internal_suite. If that is the goal than testscr_check_kdb_internal_suite should not be run all time in our CI.

The tests we run in our CI are there to ensure PRs don't introduce regressions and adhere to some baseline requirements. The CI is not a debugging tool for developers.

We can provide a separate test suite that plugin developers can run themselves to get details about what is going wrong.

markus2330 · 2020-10-12T14:27:58Z

Exactly, thus the idea to extract the tests which can always run and other tests only to be run for releases (or development). This was my first comment on this issue.

mpranj · 2020-11-28T19:07:38Z

The tests are running much faster on a7 since the ext4 formatted SSD was added:

        Start 178: testscr_check_kdb_internal_suite
215/258 Test #178: testscr_check_kdb_internal_suite ..............   Passed   94.78 sec

I'll change this to low priority until somebody can fix the real underlying problem as @kodebach pointed out.

mpranj · 2021-07-09T23:44:28Z

The testscr_check_kdb_internal_suite test alone takes now about 40 minutes of time on i7.
Using tmpfs for the config and cache directories (due to heavy I/O) reduces the test time from 40 minutes to about 6 seconds on i7.

markus2330 · 2021-07-15T07:24:09Z

Thank you very much! Can we close this issue or is separation of some tests still useful?

stale · 2022-07-30T23:54:04Z

I mark this issue stale as it did not have any activity for one year. I'll close it in two weeks if no further activity occurs. If you want it to be alive again, ping the issue by writing a message here or create a new issue with the remainder of this issue.
Thank you for your contributions 💖

stale · 2022-09-21T03:18:41Z

I closed this issue now because it has been inactive for more than one year. If I closed it by mistake, please do not hesitate to reopen it or create a new issue with the remainder of this issue.
Thank you for your contributions 💖

mpranj added continuous integration testing errors in tests without implication on real usage labels Oct 10, 2020

mpranj mentioned this issue Oct 10, 2020

Jenkins: build timeouts #3510

Closed

robaerd mentioned this issue Oct 20, 2020

Continuous Releases #3519

Closed

15 tasks

markus2330 mentioned this issue Oct 29, 2020

release automation #3536

Closed

mpranj added the low priority label Nov 28, 2020

mpranj mentioned this issue Jul 14, 2021

optimize syscalls #3943

Closed

stale bot added the stale label Jul 30, 2022

stale bot closed this as completed Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testscr_check_kdb_internal_suite: very slow at least on a7 #3512

testscr_check_kdb_internal_suite: very slow at least on a7 #3512

mpranj commented Oct 10, 2020

markus2330 commented Oct 11, 2020

markus2330 commented Oct 12, 2020

mpranj commented Oct 12, 2020

markus2330 commented Oct 12, 2020

kodebach commented Oct 12, 2020

markus2330 commented Oct 12, 2020

kodebach commented Oct 12, 2020

markus2330 commented Oct 12, 2020 •

edited

kodebach commented Oct 12, 2020

markus2330 commented Oct 12, 2020

mpranj commented Nov 28, 2020

mpranj commented Jul 9, 2021

markus2330 commented Jul 15, 2021

stale bot commented Jul 30, 2022

stale bot commented Sep 21, 2022

testscr_check_kdb_internal_suite: very slow at least on a7 #3512

testscr_check_kdb_internal_suite: very slow at least on a7 #3512

Comments

mpranj commented Oct 10, 2020

markus2330 commented Oct 11, 2020

markus2330 commented Oct 12, 2020

mpranj commented Oct 12, 2020

markus2330 commented Oct 12, 2020

kodebach commented Oct 12, 2020

markus2330 commented Oct 12, 2020

kodebach commented Oct 12, 2020

markus2330 commented Oct 12, 2020 • edited

kodebach commented Oct 12, 2020

markus2330 commented Oct 12, 2020

mpranj commented Nov 28, 2020

mpranj commented Jul 9, 2021

markus2330 commented Jul 15, 2021

stale bot commented Jul 30, 2022

stale bot commented Sep 21, 2022

markus2330 commented Oct 12, 2020 •

edited