Change tests parallelizing mechanism #471

IlyaFaer · 2020-08-26T16:43:37Z

While working on the last PR I've got in the situation when kokoro checks passed (green), but after some time their status became red (because some parallelized tests failed - some can take more than 20 min.). This can cause problems with "automerge" (bot that runs checks and merges PR automatically, if they are green). Automerge is used in the original Spanner repo, which means this API will be using it too. And, after all, it's inconvenient when checks are green, but can become red in future (~35 min.).

mf2199 · 2020-08-26T20:50:49Z

From a kokoro presubmit log:

...

startIndex: 96 endIndex: 120 totalApps 24
createInstance: throttling by sleeping for 11.098s
Sleeping for 632.969759ms
panic: rpc error: code = ResourceExhausted desc = Project 1065521786570 cannot add 1 nodes in region us-west1.

goroutine 1 [running]:
main.main()
	/Users/emmanuelodeke/Desktop/django-spanner/parallelize_tests.go:81 +0xfa1


[ID: 3692114] Build finished after 150 secs, exit value: 2


Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
[17:14:00] Collecting build artifacts from build VM
Build script failed with exit code: 2

The way parallelization implemented here is not present among other APIs (Spanner, Bigtable, Storage etc.). Hence I'd suggest removing it at least for now.

c24t · 2020-08-27T05:02:12Z

I'm seeing the same error in #469, which shouldn't have any real test failures. It looks like some test runs may not be cleaning up test instances, which prevents us from creating new instances.

c24t · 2020-08-27T05:04:47Z

/Users/emmanuelodeke/Desktop/django-spanner/parallelize_tests.go:81 +0xfa1

@odeke-em what's up with this path? Was this the source for bin/parallelize_tests_linux? In any case we probably want to avoid checking in opaque binaries, even if they're only used in CI.

c24t · 2020-08-27T05:28:32Z

There may also be some answers in #413 and 43993b1. It looks like create instance calls are randomly staggered across workers (43993b1#diff-b65ed1fb7a1fb00036a946fa744fb712R74), presumably because this fails if we call it too frequently? In which case we may have just gotten unlucky with timing. In any case throttling the tests like this is sure to cause some flakiness, and we should probably find a better way to avoid resource limits.

In any case it sounds like we have three problems here: (1) kokoro reports that tests pass (instead of just pending) before reporting that they fail, (2) the kokoro tests are flaky or outright broken, and (3) other spanner repos don't run tests this way.

(1) and (2) seem like a high priority, but (3) may not be a problem for a while. If parallelize_tests is causing (1) or (2) instead of kokoro config issues or the tests just being flaky, then that's good reason to remove it. But I'd like to know why this is causing problems (and tests are failing) now but not in the past.

IlyaFaer · 2020-08-27T09:24:11Z

(2) the kokoro tests are flaky or outright broken

Looking at the Go code: it creates an instance for every worker and plans deleting it on the tests end with defer:

python-spanner-django/parallelize_tests.go

Line 83 in 5551b58

defer deleteInstance()

But if the instance deletion or the test itself fails, the instance stays in project. For such a cases we usually use pre-test cleanup: list all the instances within project and delete those older then 24h (example in Go) - that is done before starting the test. We use such a practise in Spanner, Storage, PubSub, etc., and I don't see anything like this here.

(1) kokoro reports that tests pass (instead of just pending) before reporting that they fail

Looks like the longest test worker can take ~2 hours to finish (see logs). It's the longest one, meaning if half of the workers are longer then 0.5h then it can take several hours to run all the tests if we'll unparallelize them. This will hit the build limit of 2h.

All of it means that we should change the tests parallelizing method to fix the mentioned problems and not to hit any limit. Seems to me the best way is to split workers onto several kokoro checks (see the example).

vi3k6i5 · 2021-05-20T16:31:21Z

Ideally kokoro should not report success until all it's child workers have finished executing. I will look into how to change the kokoro return to wait until all workers are either successful or failed.

vi3k6i5 · 2021-05-20T18:24:31Z

Resolved by making kokoro return status after all tests are finished.

IlyaFaer added type: process A process-related concern. May include testing, release, or the like. api: spanner Issues related to the googleapis/python-spanner-django API. labels Aug 26, 2020

mf2199 added the priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. label Aug 26, 2020

IlyaFaer changed the title ~~Avoid parallelizing tests~~ Change tests parallelizing mechanism Aug 27, 2020

This was referenced Aug 28, 2020

feat: Stage 2 of nox implementation - adding docs target #473

Merged

feat: cursor must detect if the parent connection is closed #463

Merged

Add nox target for django tests #474

Closed

c24t mentioned this issue Sep 10, 2020

[Cleanup] Don't use django fork in build script #485

Open

c24t added this to the Beta milestone Oct 22, 2020

c24t mentioned this issue Jan 22, 2021

test: use parallel workflows to run Django tests #569

Merged

c24t removed this from the 11/5 Preview Release milestone Feb 4, 2021

vi3k6i5 self-assigned this May 20, 2021

vi3k6i5 closed this as completed May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change tests parallelizing mechanism #471

Change tests parallelizing mechanism #471

IlyaFaer commented Aug 26, 2020 •

edited

mf2199 commented Aug 26, 2020

c24t commented Aug 27, 2020

c24t commented Aug 27, 2020 •

edited

c24t commented Aug 27, 2020

IlyaFaer commented Aug 27, 2020 •

edited

vi3k6i5 commented May 20, 2021

vi3k6i5 commented May 20, 2021

Change tests parallelizing mechanism #471

Change tests parallelizing mechanism #471

Comments

IlyaFaer commented Aug 26, 2020 • edited

mf2199 commented Aug 26, 2020

c24t commented Aug 27, 2020

c24t commented Aug 27, 2020 • edited

c24t commented Aug 27, 2020

IlyaFaer commented Aug 27, 2020 • edited

vi3k6i5 commented May 20, 2021

vi3k6i5 commented May 20, 2021

IlyaFaer commented Aug 26, 2020 •

edited

c24t commented Aug 27, 2020 •

edited

IlyaFaer commented Aug 27, 2020 •

edited