New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change tests parallelizing mechanism #471
Comments
From a
The way parallelization implemented here is not present among other APIs (Spanner, Bigtable, Storage etc.). Hence I'd suggest removing it at least for now. |
I'm seeing the same error in #469, which shouldn't have any real test failures. It looks like some test runs may not be cleaning up test instances, which prevents us from creating new instances. |
@odeke-em what's up with this path? Was this the source for |
There may also be some answers in #413 and 43993b1. It looks like create instance calls are randomly staggered across workers (43993b1#diff-b65ed1fb7a1fb00036a946fa744fb712R74), presumably because this fails if we call it too frequently? In which case we may have just gotten unlucky with timing. In any case throttling the tests like this is sure to cause some flakiness, and we should probably find a better way to avoid resource limits. In any case it sounds like we have three problems here: (1) kokoro reports that tests pass (instead of just pending) before reporting that they fail, (2) the kokoro tests are flaky or outright broken, and (3) other spanner repos don't run tests this way. (1) and (2) seem like a high priority, but (3) may not be a problem for a while. If |
Looking at the Go code: it creates an instance for every worker and plans deleting it on the tests end with
But if the instance deletion or the test itself fails, the instance stays in project. For such a cases we usually use pre-test cleanup: list all the instances within project and delete those older then 24h (example in Go) - that is done before starting the test. We use such a practise in Spanner, Storage, PubSub, etc., and I don't see anything like this here.
Looks like the longest test worker can take ~2 hours to finish (see logs). It's the longest one, meaning if half of the workers are longer then 0.5h then it can take several hours to run all the tests if we'll unparallelize them. This will hit the build limit of 2h. All of it means that we should change the tests parallelizing method to fix the mentioned problems and not to hit any limit. Seems to me the best way is to split workers onto several kokoro checks (see the example). |
Ideally kokoro should not report success until all it's child workers have finished executing. I will look into how to change the kokoro return to wait until all workers are either successful or failed. |
Resolved by making kokoro return status after all tests are finished. |
While working on the last PR I've got in the situation when kokoro checks passed (green), but after some time their status became red (because some parallelized tests failed - some can take more than 20 min.). This can cause problems with "automerge" (bot that runs checks and merges PR automatically, if they are green). Automerge is used in the original Spanner repo, which means this API will be using it too. And, after all, it's inconvenient when checks are green, but can become red in future (~35 min.).
The text was updated successfully, but these errors were encountered: