Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testscr_check_kdb_internal_suite: very slow at least on a7 #3512

Closed
mpranj opened this issue Oct 10, 2020 · 15 comments
Closed

testscr_check_kdb_internal_suite: very slow at least on a7 #3512

mpranj opened this issue Oct 10, 2020 · 15 comments
Labels
continuous integration low priority stale testing errors in tests without implication on real usage

Comments

@mpranj
Copy link
Member

mpranj commented Oct 10, 2020

The testscr_check_kdb_internal_suite takes more than 30 minutes on a7.

        Start 181: testscr_check_kdb_internal_suite
218/262 Test #181: testscr_check_kdb_internal_suite ..............   Passed  1804.16 sec

We need to investigate why this takes so long and if things can be improved.

@mpranj mpranj added continuous integration testing errors in tests without implication on real usage labels Oct 10, 2020
@markus2330
Copy link
Contributor

Thank you for creating the issue! As said in #3510 we should probably reduce the number of tests for normal builds and only have them run for special builds (but then really all of them). Ideally these builds can also be executed from PRs (e.g. if someone modifies a storage plugin).

Maybe these special builds also include fuzzing methods #185, which is also only needed from time to time (or when relevant changes are done).

@markus2330
Copy link
Contributor

Just an idea: maybe libeatmydata speeds up these tests as Elektra uses fsync excessively.

@mpranj
Copy link
Member Author

mpranj commented Oct 12, 2020

Another idea: we could use ramdisk for parts or all of the tests, at least on slow machines.

@markus2330
Copy link
Contributor

Yes, this is might be the easier solution!

@kodebach
Copy link
Member

We need to investigate why this takes so long

This is a result of #2595. Now we finally test all storage plugins. Maybe fsync is the root cause of this issue, but it might also be the case that some plugin is a very slow.

Currently the tests are executed with these pluigns:

dini
dump
ini
kconfig
ni
quickdump
simpleini
tcl
toml
xerces
xmltool
yajl

(based on the testdata in tests/shell/shell)

The tcl plugin only runs the basic tests, ini runs basic string umlauts, dump and quickdump run basic string umlauts binary naming and everything else run basic string.

The script also runs everything twice. Once for the user namespace and once for system.

On my machine the test takes only ~63s (Ryzen 3700X, 32GB RAM, NVMe SSD), but the distribution looks like this:

dump -> 26s
ini -> 9s
quickdump -> 26s

tcl doesn't work on my machine, and the remaining tests take less than 1s combined. The times where obtained by adding replacing "$KDB" test "$ROOT" $TESTS with time "$KDB" test "$ROOT" $TESTS in the test script.

The fact that dump and quickdump have nearly identical runtime combined with looking at the code for kdb test indicates to me that the problem is the kdb test itself. Specifically the umlauts, binary and naming tests. Each of these calls kdbGet and kdbSet more than 250 times. Everytime just with a single key. It also destroys all KeySets, Keys and KDB for each of these 250+ tests.

kdb test is just poorly written and will always take forever to run. It didn't profile it but I highly doubt that fsync is the issue and suspect that it's simply the overhead of kdbGet and kdbSet themselves.

PS. the meta test is even worse and that's why it is never used for any of the plugins.


Maybe disabling fsync or using a RAM disk will do something, but I don't think we can avoid changing kdb test itself.

@markus2330
Copy link
Contributor

The script also runs everything twice. Once for the user namespace and once for system.

Thanks for the hint, this is actually useless, it does not test the storage plugins. Furthermore, it would be a very easy fix.

Everytime just with a single key.

This is on purpose to test such KeySets. Further tests with bigger KeySets can be added (contributions welcome!) but this should not replace the simple tests.

It also destroys all KeySets, Keys and KDB for each of these 250+ tests.

Reusing KDB might make the tests faster but then they would be different tests. Tests that test the behavior of sequences of get/set are in tests/kdb.

Btw. using tmpfs might be positive for all tests, testscr_check_kdb_internal_suite is not the only test that takes some time because of I/O load.

Maybe Robert will look at these things.

@kodebach
Copy link
Member

This is on purpose to test such KeySets.

What exactly is the difference between testing a KeySet with just a single Key a then with Key b, etc. compared to testing with a KeySet containing a, b, c, etc. all at once? In the end, either the plugin could deal with all the key names and stored everything properly or it didn't.

I can't think of a reason why a plugin would be able to store KeySet containing a and b, but a KeySet with just a.

What exactly is the goal of kdb test anyway? It seems to be to verify that a mountpoint configuration works.
But the goal of testscr_check_kdb_internal_suite seems to be to check that a storage plugin can be used as such. This could also be achieved by calling the plugin directly and doing some checks around that. The tests would also be a lot faster, if they didn't use kdbGet/kdbSet but instead called the plugin directly.

In particular, I don't see the point of running the kdbGet/kdbSet sequence for all 256 possible 1 byte key base names for every plugin. It would be much quicker to validate kdbGet/kdbSet and the plugins separately instead of revalidating kdbGet/kdbSet for every plugin.

Reusing KDB might make the tests faster but then they would be different tests. Tests that test the behavior of sequences of get/set are in tests/kdb.

Yes, reusing the KDB handle would be a different test. But at least the KeySet and Key instances could be reused. I'm not sure this would change a lot in practice, but at least in theory it saves a lot of malloc and free.

Btw. using tmpfs might be positive for all tests, testscr_check_kdb_internal_suite is not the only test that takes some time because of I/O load.

That is certainly true. But definitely needs further investigation. There are a few testmod_* tests for storage plugins that do file I/O that are still quite fast. I don't think any of them use fsync, but also don't call kdbGet/kdbSet and that also has a big influence.


Another thing I just noticed, umlauts may be a subset of naming. Therefore calling both for the same plugin is just a waste of time.

@markus2330
Copy link
Contributor

markus2330 commented Oct 12, 2020

What exactly is the goal of kdb test anyway?

To test round-trip and other storage properties of a given mountpoint.

What exactly is the difference between testing a KeySet with just a single Key a then with Key b, etc. compared to testing with a KeySet containing a, b, c, etc. all at once?

The difference is for storage plugin developers: you get a very small example of when your plugin fails and not a huge test case where you need to find out yourself what actually the problem is.

It is possible that there is some overlap between the tests but I would not waste my time (nor I would ask anyone to waste their time) trying to eliminate this.

But at least the KeySet and Key instances could be reused

As long as we do not know for sure that this is a bottleneck (and I doubt it), such optimizations would be waste of development time.

@kodebach
Copy link
Member

The difference is for storage plugin developers: you get a very small example of when your plugin fails and not a huge test case where you need to find out yourself what actually the problem is.

Ok, but that should not be the goal of testscr_check_kdb_internal_suite. If that is the goal than testscr_check_kdb_internal_suite should not be run all time in our CI.

The tests we run in our CI are there to ensure PRs don't introduce regressions and adhere to some baseline requirements. The CI is not a debugging tool for developers.

We can provide a separate test suite that plugin developers can run themselves to get details about what is going wrong.

@markus2330
Copy link
Contributor

Exactly, thus the idea to extract the tests which can always run and other tests only to be run for releases (or development). This was my first comment on this issue.

@mpranj
Copy link
Member Author

mpranj commented Nov 28, 2020

The tests are running much faster on a7 since the ext4 formatted SSD was added:

        Start 178: testscr_check_kdb_internal_suite
215/258 Test #178: testscr_check_kdb_internal_suite ..............   Passed   94.78 sec

I'll change this to low priority until somebody can fix the real underlying problem as @kodebach pointed out.

@mpranj
Copy link
Member Author

mpranj commented Jul 9, 2021

The testscr_check_kdb_internal_suite test alone takes now about 40 minutes of time on i7.
Using tmpfs for the config and cache directories (due to heavy I/O) reduces the test time from 40 minutes to about 6 seconds on i7.

@markus2330
Copy link
Contributor

Thank you very much! Can we close this issue or is separation of some tests still useful?

@stale
Copy link

stale bot commented Jul 30, 2022

I mark this issue stale as it did not have any activity for one year. I'll close it in two weeks if no further activity occurs. If you want it to be alive again, ping the issue by writing a message here or create a new issue with the remainder of this issue.
Thank you for your contributions 💖

@stale stale bot added the stale label Jul 30, 2022
@stale
Copy link

stale bot commented Sep 21, 2022

I closed this issue now because it has been inactive for more than one year. If I closed it by mistake, please do not hesitate to reopen it or create a new issue with the remainder of this issue.
Thank you for your contributions 💖

@stale stale bot closed this as completed Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
continuous integration low priority stale testing errors in tests without implication on real usage
Projects
None yet
Development

No branches or pull requests

3 participants