[ci-matrix] Distribute unknown packages between shards. #339

Pawka · 2023-06-16T10:04:02Z

Currently all new packages which are not in timing files are assigned to a single bucket with shortest duration. This happens because duration of "unknown" packages is set to 0 but in reality it might be longer. There are also empty packages with no tests which have 0s duration but there is little overhead to run those tests.

This patch changes how those are distributed between shards to avoid all all of them added to a single bucket. If package run duration is lower than threshold - assign it by using hash function instead of minimal bucket. Hashing function is consistent what means the same package is assigned to the same bucket if package's name and count of buckets is not changed. This helps to reuse go build cache.

Bucket name is retrieved by calculating md5 hash of package's name, then taking part of hash, converting to integer and calculating modulus by buckets count. Only a part of hash is converted to integer to avoid overflow - converting md5 hex to integer might give a result bigger than int64. I've tested distribution of such method and it seems to be pretty balanced. The 1st column shows how many packages were assigned to the bucket.

$ go list ./... | gotestsum tool ci-matrix --timing-files="path_to_files/*.json" --partitions 16 --debug 2>debug.log
$ grep "bucket \d+" -Eo < debug.log  | sort | uniq -c
  76 bucket 0
  64 bucket 1
  70 bucket 10
  78 bucket 11
  75 bucket 12
  78 bucket 13
  55 bucket 14
  60 bucket 15
  59 bucket 2
  71 bucket 3
  59 bucket 4
  67 bucket 5
  78 bucket 6
  65 bucket 7
  67 bucket 8
  74 bucket 9

This approach can be used when sharding strategy is not updated on each test run but is reused for extended period (e.g. week) because new tests will be distributed among shards.

Currently all new packages which are not in timing files are assigned to a single bucket with shortest duration. This happened because duration of "unknown" packages is set to 0 but in reality it might be longer. Also empty packages with no tests have 0s duration but there is little overhead to run those tests. To avoid all such packages adding into a single bucket, change how those are distributed between shards. If package run duration is lower than threshold - assign it by using hash function instead of minimal bucket.

Pawka · 2023-06-16T11:37:24Z

P.S. We might want to make this feature opt-in by providing threshold via flag.

dnephin · 2023-06-17T16:39:37Z

Thank you for the PR! This would be a great thing to improve. It has been a little while since I've looked at this code. I believe the only special case it was handling previously was when no packages had any timing data. In that case it was using round robin to distribute the packages.

Generalizing it so that all packages with 0 timing data are round robin distributed seems like a great improvement. A couple questions for you about this approach. What's the reason for using a threshold of 5ms instead of using 0? Is it common for your packages to start out with very little test?

What do you think about an approach like this:

add a TotalZeroTime field to the bucket struct, which will count the number of packages assigned to that bucket that had 0 elapsed time
When placing a package with 0 elapsed time, round robin between all the buckets by looking for the bucket with the smallest number of TotalZeroTime. That way new packages should be evenly distributed between the buckets, starting with the bucket with the least amount of time.
As a special case we could avoid the largest bucket when playing packages with 0 elapsed time. Since the largest bucket is likely going to be the bottleneck, we don't want to accidentally assign a new package to that bucket.

Pawka · 2023-06-19T05:50:17Z

Thanks for reply!

What's the reason for using a threshold of 5ms instead of using 0?

It is just arbitrary number which I've picked to treat a test as negligible. Even 100 of 5ms tests won't increase duration of a bucket significantly so it might make sense to distribute those among buckets. But I feel we might want to make this value configurable via flag. What do you think?

As a special case we could avoid the largest bucket when playing packages with 0 elapsed time. <...>

100% to this. It is better to avoid adding additional tests to the "longest" bucket. I was thinking is it enough just to exclude one or some arbitrary part of buckets such as 20% but no less than 1?

Few comments on round robin vs consistent assigning since we need to agree which path to choose.

I see two ways how ci-matrix can be used:

Run the command before each CI job depending on recent timing files from previous builds. In this way there is high probability packages would be distributed to different partitions each time. Golang build cache hit rate might not be high if agents are not long-living. This approach is fine with not very huge codebases.
Generated ci-matrix distribution file is reused for extended period. Benefit of this approach is that same tests always will be assigned to the same shards what might help to achieve higher Golang cache hit rate.

While cache hit rate for 0s packages is not important but new packages, which are not in distribution file, are also treated as 0s. But in reality those will run longer. If the 2nd approach is used - it is better to keep consistency and assign specific package to the same bucket so build cache could be reused. To stay consistent round robin approach can not be used because a new package would shift all packages after it by one.

Let me know if you think this is something what you'd like to adopt and I can make changes to this PR.

Pawka · 2023-06-21T08:00:26Z

After some investigation I've found that the best way is just to ignore packages without tests at all. Running go test for a package with no tests still takes time. In my experiment running go test for ~70 empty packages took ~30-120s (with no cache). This might become significant overhead for huge repositories.

@dnephin what do you think about excluding packages from matrix if Elapsed value is 0?

dnephin · 2023-06-21T15:59:35Z

To stay consistent round robin approach can not be used because a new package would shift all packages after it by one.

One important detail, that may be not be obvious, is that the packages are sorted by elapsed time before attempting to place them into buckets. This means that packages with 0 elapsed time will always be last. I think that makes it safe to round-robin them, because only new packages or packages with no tests could come later. It will never impact existing packages with non-zero elapsed time.

ignore packages without tests at all

How do we know if the package has no tests, or if the package is new and has no previous runtime? I'm not sure if this is safe, because it sounds like it could accidentally skip tests in new packages.

In my experiment running go test for ~70 empty packages took ~30-120s

Do you think it is common for projects to have that many packages without tests? I've seen projects with a few small packages with no tests (maybe 10-20 packages), but generally the time to compile those packages is negligible compared to the time required to run all the other tests.

Pawka · 2023-06-21T18:53:52Z

Yeah, I agree with you - standard project probably won't have much packages without tests.

On a flip side empty packages can be excluded without any changes to gotestsum - by using go list -json ./... | jq 'select(.TestGoFiles | length > 0) | .ImportPath' -r instead of simple go list ./.... So probably it is not worth to extend gotestsum here. Sorry for jumping back and forward :)

I'll update this PR in near time to support round robin approach.

lasiar · 2023-11-17T02:02:24Z

cmd/tool/matrix/matrix.go

+
+func consistentBucket(pkg string, n uint) int {
+	h := md5.New()
+	io.WriteString(h, pkg)
+	hash := fmt.Sprintf("%x", h.Sum(nil))
+	decimal, _ := strconv.ParseUint(hash[:10], 16, 64)
+	return int(decimal) % int(n)


Maybe use maphash.Hash{} ?

Suggested change

func consistentBucket(pkg string, n uint) int {

h := md5.New()

io.WriteString(h, pkg)

hash := fmt.Sprintf("%x", h.Sum(nil))

decimal, _ := strconv.ParseUint(hash[:10], 16, 64)

return int(decimal) % int(n)

var hashSeed = maphash.MakeSeed()

func consistentBucket(pkg string, n uint) int {

return int(maphash.String(hashSeed, str)) % int(n)

Maphash was designed specifically for this case.

Props:

fastest hash from standart library.

zero allocation.

Thanks!

lasiar reviewed Nov 17, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci-matrix] Distribute unknown packages between shards. #339

[ci-matrix] Distribute unknown packages between shards. #339

Pawka commented Jun 16, 2023

Pawka commented Jun 16, 2023

dnephin commented Jun 17, 2023

Pawka commented Jun 19, 2023

Pawka commented Jun 21, 2023

dnephin commented Jun 21, 2023 •

edited

Pawka commented Jun 21, 2023

lasiar Nov 17, 2023

[ci-matrix] Distribute unknown packages between shards. #339

Are you sure you want to change the base?

[ci-matrix] Distribute unknown packages between shards. #339

Conversation

Pawka commented Jun 16, 2023

Pawka commented Jun 16, 2023

dnephin commented Jun 17, 2023

Pawka commented Jun 19, 2023

Pawka commented Jun 21, 2023

dnephin commented Jun 21, 2023 • edited

Pawka commented Jun 21, 2023

lasiar Nov 17, 2023

Choose a reason for hiding this comment

dnephin commented Jun 21, 2023 •

edited