Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change tests parallelizing mechanism #471

Closed
IlyaFaer opened this issue Aug 26, 2020 · 7 comments
Closed

Change tests parallelizing mechanism #471

IlyaFaer opened this issue Aug 26, 2020 · 7 comments
Assignees
Labels
api: spanner Issues related to the googleapis/python-spanner-django API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: process A process-related concern. May include testing, release, or the like.

Comments

@IlyaFaer
Copy link
Contributor

IlyaFaer commented Aug 26, 2020

While working on the last PR I've got in the situation when kokoro checks passed (green), but after some time their status became red (because some parallelized tests failed - some can take more than 20 min.). This can cause problems with "automerge" (bot that runs checks and merges PR automatically, if they are green). Automerge is used in the original Spanner repo, which means this API will be using it too. And, after all, it's inconvenient when checks are green, but can become red in future (~35 min.).

@IlyaFaer IlyaFaer added type: process A process-related concern. May include testing, release, or the like. api: spanner Issues related to the googleapis/python-spanner-django API. labels Aug 26, 2020
@mf2199
Copy link
Contributor

mf2199 commented Aug 26, 2020

From a kokoro presubmit log:

...

startIndex: 96 endIndex: 120 totalApps 24
createInstance: throttling by sleeping for 11.098s
Sleeping for 632.969759ms
panic: rpc error: code = ResourceExhausted desc = Project 1065521786570 cannot add 1 nodes in region us-west1.

goroutine 1 [running]:
main.main()
	/Users/emmanuelodeke/Desktop/django-spanner/parallelize_tests.go:81 +0xfa1


[ID: 3692114] Build finished after 150 secs, exit value: 2


Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
[17:14:00] Collecting build artifacts from build VM
Build script failed with exit code: 2

The way parallelization implemented here is not present among other APIs (Spanner, Bigtable, Storage etc.). Hence I'd suggest removing it at least for now.

@mf2199 mf2199 added the priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. label Aug 26, 2020
@c24t
Copy link
Contributor

c24t commented Aug 27, 2020

I'm seeing the same error in #469, which shouldn't have any real test failures. It looks like some test runs may not be cleaning up test instances, which prevents us from creating new instances.

@c24t
Copy link
Contributor

c24t commented Aug 27, 2020

/Users/emmanuelodeke/Desktop/django-spanner/parallelize_tests.go:81 +0xfa1

@odeke-em what's up with this path? Was this the source for bin/parallelize_tests_linux? In any case we probably want to avoid checking in opaque binaries, even if they're only used in CI.

@c24t
Copy link
Contributor

c24t commented Aug 27, 2020

There may also be some answers in #413 and 43993b1. It looks like create instance calls are randomly staggered across workers (43993b1#diff-b65ed1fb7a1fb00036a946fa744fb712R74), presumably because this fails if we call it too frequently? In which case we may have just gotten unlucky with timing. In any case throttling the tests like this is sure to cause some flakiness, and we should probably find a better way to avoid resource limits.

In any case it sounds like we have three problems here: (1) kokoro reports that tests pass (instead of just pending) before reporting that they fail, (2) the kokoro tests are flaky or outright broken, and (3) other spanner repos don't run tests this way.

(1) and (2) seem like a high priority, but (3) may not be a problem for a while. If parallelize_tests is causing (1) or (2) instead of kokoro config issues or the tests just being flaky, then that's good reason to remove it. But I'd like to know why this is causing problems (and tests are failing) now but not in the past.

@IlyaFaer
Copy link
Contributor Author

IlyaFaer commented Aug 27, 2020

(2) the kokoro tests are flaky or outright broken

Looking at the Go code: it creates an instance for every worker and plans deleting it on the tests end with defer:

defer deleteInstance()

But if the instance deletion or the test itself fails, the instance stays in project. For such a cases we usually use pre-test cleanup: list all the instances within project and delete those older then 24h (example in Go) - that is done before starting the test. We use such a practise in Spanner, Storage, PubSub, etc., and I don't see anything like this here.

(1) kokoro reports that tests pass (instead of just pending) before reporting that they fail

Looks like the longest test worker can take ~2 hours to finish (see logs). It's the longest one, meaning if half of the workers are longer then 0.5h then it can take several hours to run all the tests if we'll unparallelize them. This will hit the build limit of 2h.

All of it means that we should change the tests parallelizing method to fix the mentioned problems and not to hit any limit. Seems to me the best way is to split workers onto several kokoro checks (see the example).

@IlyaFaer IlyaFaer changed the title Avoid parallelizing tests Change tests parallelizing mechanism Aug 27, 2020
@c24t c24t added this to the Beta milestone Oct 22, 2020
@c24t c24t removed this from the 11/5 Preview Release milestone Feb 4, 2021
@vi3k6i5
Copy link
Contributor

vi3k6i5 commented May 20, 2021

Ideally kokoro should not report success until all it's child workers have finished executing. I will look into how to change the kokoro return to wait until all workers are either successful or failed.

@vi3k6i5 vi3k6i5 self-assigned this May 20, 2021
@vi3k6i5
Copy link
Contributor

vi3k6i5 commented May 20, 2021

Resolved by making kokoro return status after all tests are finished.

@vi3k6i5 vi3k6i5 closed this as completed May 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the googleapis/python-spanner-django API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: process A process-related concern. May include testing, release, or the like.
Projects
None yet
Development

No branches or pull requests

4 participants