Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution segments for partitioning work between k6 instances #997

Closed
na-- opened this issue Apr 9, 2019 · 8 comments
Closed

Execution segments for partitioning work between k6 instances #997

na-- opened this issue Apr 9, 2019 · 8 comments
Assignees
Labels
Milestone

Comments

@na--
Copy link
Member

na-- commented Apr 9, 2019

This is a proposal of an implementation approach/algorithm for distributing "work" between multiple semi-independent k6 instances in a distributed/cloud execution scenario. I think I have a clear idea how that could work and I'd appreciate any feedback on it. Especially if there's something fishy... I plan to have this supported in the new schedulers, and unit-tested fairly extensively, but still.

Requirements

We need a way to reproducibly, correctly, consistently, and efficiently distribute work between multiple k6 instances with the minimum amount of coordination and communication between them, and with no central scheduler.

There shouldn't be any off-by-one errors, rounding mistakes, VU numbers jumping up and down, or any other scheduling inconsistencies. Basically, launching the same test multiple times should produce identical scheduling outcomes - there shouldn't be any randomness involved with the way we distribute the work (i.e. VUs/iterations/iterations-per-second/etc., or even things like JSON/CSV/XML records) across the different instances, load zones, distributions, etc.

Proposal

So, the basic idea of my proposal is this. We should use rational numbers [1] to divide the test execution work in (from, to] segments [2] for each instance, according to the user-specified distributions and any other placement constraints. from and to are rational numbers between 0 and 1 - they can be percentages, as well as fractions like 1/3 or decimals 0.25. As an example, if we want to split a single test equally between 3 different k6 instances, here's how that would look like:

k6 run --execution-segment "0:1/3" sameScriptSameOptions.tar  # instance 1
k6 run --execution-segment "1/3:2/3" sameScriptSameOptions.tar # instance 2
k6 run --execution-segment "2/3:1" sameScriptSameOptions.tar # instance 3

If each k6 instance received only the percentage/ratio of its share of the work (e.g. 1/3. 25%, etc.), that won't be enough to cover the requirements listed above. To have correct distributions, we'd also either need an external scheduler. Or we'd need each k6 instance to know the percentage details for other instances involved in the test, so it can independently and correctly figure out its own workload portion, accounting for things like rounding and off-by-one errors. And even then, calculating all of those things will be somewhat annoying and wasteful...

Instead, if each k6 instance receives a (from, to] segment of the work [3], it should be able to correctly, efficiently and consistently calculate the amount of work it should be doing (e.g. how many VUs it should have running) at any given moment with absolute precision. Each instance won't need to know what the work shares for the other k6 instances are, or even how many other k6 instances are actually involved, just the size and position of its own slice of the work "pie". And for the whole duration of the test, the number of VUs/iterations-per-second/etc. scheduled among all of the k6 instances should be exactly equal to the amount if the test was scheduled to run locally, on a single machine! Magic ✨ (the good kind 😄)

Calculations

Here's how the math works... I think 😅 I'd be very grateful for any corrections here, since it's kind of important that I'm not mistaken 😄

Since the whole interval is from 0 to 1 (i.e. 0% - 100%), if each instance knows its own (from, to] segment, then it can easily calculate the segment that encompasses all of the other instances "before" it, however many those are. Let's call that prevSegment - it will either be empty (if from is 0), or it will be (0; from]. I think that this is all that's necessarily for an instance to calculate with absolute precision every scaling of its VUs/iterations/iterations-per-second/etc. it needs to do. Let's use VUs, here's how we can calculate how many VUs an individual instance should have if TotalVUs is the number of VUs that should be active for the whole test:

floor(
    remainder(TotalVUs * prevSegmentLength)  // rational number less than 1, will be 0 for the first segment
    +
    (TotalVUs * currentSegmentLength) // rational number
) // floor gives us only the integer part of the resulting rational number, the remainder will be taken into account in the calculations for the following segments

Examples

Here's a table with an example. Say, we have 7 VUs that we want to split equally between 3 instances:

InstID segment rem(prevSegment) segmentLength * VU sum floor(sum)
1 (0 - ⅓] 0 ⅓ * 7 = 2⅓ 2⅓ 2
2 (⅓ - ⅔] rem(⅓ * 7) = ⅓ ⅓ * 7 = 2⅓ 2⅔ 2
3 (⅔ - 1] rem(⅔ * 7) = ⅔ ⅓ * 7 = 2⅓ 3 3

See how to calculate the end value (last column) for each segment doesn't require any information besides the (from; to] values for that specific segment. And yet, when we sum the values in the last column, we'd get 7 - precisely the number of VUs we're distributing!

Another example, how to fairly split 7 VUs between 4 instances this time:

InstID segment rem(prevSegment) segmentLength * VU sum floor(sum)
1 (0 - ¼] 0 ¼ * 7 = 1¾ 1
2 (¼ - ½] rem(¼ * 7) = ¾ ¼ * 7 = 1¾ 2
3 (½ - ¾] rem(½ * 7) = ½ ¼ * 7 = 1¾ 2
4 (¾ - 1] rem(¾ * 7) = ¼ ¼ * 7 = 1¾ 2 2

Each k6 instance can precisely scale any arbitrary integer (i.e. indivisible, like VUs or instances or record numbers) or fraction (iterations/second) value in a constant time, with just a few simple arithmetical operations! 🎉

Another example, dividing 11 VUs in a 20%, 25%, 45%, 10% distribution:

InstID segment rem(prevSegment) segmentLength * VU sum floor(sum)
1 (0.00 - 0.20] 0 0.20 * 11 = 2.20 2.20 2
2 (0.20 - 0.45] rem(0.20 * 11) = 0.20 0.25 * 11 = 2.75 2.95 2
3 (0.45 - 0.90] rem(0.45 * 11) = 0.95 0.45 * 11 = 4.95 5.90 5
4 (0.90 - 1.00] rem(0.90 * 11) = 0.90 0.10 * 11 = 1.10 2.00 2

🎉 😄

And as a demonstration why the position of the segment is important, let's switch the order of the percentages in the above example, i.e. let's calculate the 11 VUs spread across a 10%, 45%, 20%, 25% distribution:

InstID segment rem(prevSegment) segmentLength * VU sum floor(sum)
1 (0.00 - 0.10] 0 0.10 * 11 = 1.10 1.10 1
2 (0.10 - 0.55] rem(0.10 * 11) = 0.10 0.45 * 11 = 4.95 5.05 5
3 (0.55 - 0.75] rem(0.55 * 11) = 0.05 0.20 * 11 = 2.20 2.25 2
4 (0.75 - 1.00] rem(0.75 * 11) = 0.25 0.25 * 11 = 2.75 3.00 3

Of course, such differences are usually negligible when hundreds or thousands of VUs are involved. The important guarantee this algorithm gives us is that when we partition the workload into static segments, the calculations for each segment will always be consistent - the same inputs will be scaled to the same outputs. There shouldn't be any of the inconsistencies we've seen with VU numbers jumping up and down... And, since the calculations are so simple, they will be both fast and will save us from headaches from wondering how we should round stuff 😄

This approach is much saner than some we've previously discussed for complicated situations. Imagine the following scenario: linearly scaling up VUs from 0 to 6 over 1 minute and 10 instances ( startVUs=0, stage.target=6, stage.duration=60s, instances=10 ). How many VUs should each instance be running at time=30s? 😄 This is easy to calculate when you have a central scheduling authority, but it becomes quite annoying to do if each k6 instance has to individually do it. On the other hand, with from:to execution segments, it's floor( (3 * from) % 1 + 3 * (to - from) ), with rational math of course 😎

Data partitioning

I initially though that we should disable the shared-iterations scheduler in the distributed/cloud execution, since we'd have to split the iterations between instances and different instances may finish at very different times, depending on latency, capabilities or load.

But, given that we can support that execution mode via the execution segments described above with no extra effort, I no longer feel that way 😄 we may emit a warning that "the test will be executed on multiple machines and different iteration groups can finish at different times due to network latency", or something like that, but we should still support shared iterations in a distributed/cloud execution environment 🙂

Besides being both useful to some users, and easy to do with execution segments, they can serve another purpose 🙂 I thought some more about how we can implement the use case of "parsing a huge CSV/JSON/XML/etc. file using each entry from that file exactly once" in k6 - both locally in a single k6 instance and spread across multiple different instances.

I think the answer may be the following combination of things:

  • streaming CSV/JSON/XML/etc. parsers [4] - this way, we can relatively quickly get the number of items in a data file, and we can also start iterating over data entries from some arbitrary point in the middle of the file, while not needing to load the whole file in memory
  • being able to specify when you parse a data file that you should do it in an execution-segment-aware way - then each k6 instance will restrict itself to its own "slice" of the data entries in the file, as determined by its own execution segment 😎
  • a shared-iterations scheduler... thus my desire to support it 😄
  • having a better archive format than the current tar one [5]

And... that's it, I think 😄 🎉 If we have those 3 parts above, we can get all of the rest in a user-friendly, natural and composable way. The initial reading of the file and finding the correct starting place for each k6 instance (according to its own segment) could take some time, even with streaming parsers - if the file is huge, or if we do some processing like XPath/JSONPath/etc. And finding the total count of data entries in that file may also take a few seconds. But that's fine, in both cases, since it will happen in the init phase 🙂 And we'll never actually need to load the whole file in memory!

Here's a made-up code demo with made-up CSV APIs:

let myCSVData = parseCSV("data.csv", { segment: true });

export let options = {
   execution: {
       myCoolOncePerDataItemScheduler: {
           type: "shared-iterations",
           vus: 123,
           iterations: myCSVData.GetTotalCount(),
       },
   },
   // ...
};

export default function () {
   let myUniqueDataPiece = myCSVData.ReadRow();
   http.get(myUniqueDataPiece[1]);
   // ...
}

Notes

[1] Rational numbers are pretty straightforward in Go, they even have a cool name - big.Rat 😄

[2] Not sure if "segments" is the best term. Other potential options: intervals, portions, shares

[3] UX bonus - users can use the --execution-segment option as a tool for quick and dirty scaling down of a load test! For example, if they want to run the test with only 10% of the load without fiddling with all of the script options, just run it with k6 run --execution-segment 10% script.js (equivalent to --execution-segment="0-10%")

[4] There are Go libraries for those like https://github.com/jszwec/csvutil for streaming CSV parsing. And for JSON and XML, the Go stdlib support may be enough: https://golang.org/pkg/encoding/json/#Decoder and https://golang.org/pkg/encoding/xml/#Decoder. Though, as we've discussed in other issues, XPath/JSONPath may be more problematic.

[5] In my head, the ideal new k6 archive format would be something that:

  • Compressed, so that we can accomodate reasonably huge static CSV/XML/JSON data files, while still being able to be shipped to a server or saved in S3.
  • Indexable, so we wouldn't have to traverse each file and save it in memory: .zip files have an actual file system, .tar files are just a stream of chunks...
  • Streamable, so won't even have to extract files from the archive to be able to run it, while at the same time we won't need to load the whole archive in memory.

Go has pretty good native support for archives, both in the standard library, and via community libraries, so we should have plenty of good choices here...

Connected issues:

@na-- na-- added feature evaluation needed proposal needs to be validated or tested before fully implementing it in k6 labels Apr 9, 2019
@na-- na-- mentioned this issue Apr 9, 2019
@na-- na-- changed the title Execution segments Execution segments for partitioning work between k6 instances Apr 9, 2019
@na--
Copy link
Member Author

na-- commented Apr 18, 2019

I've mostly implemented this, and it seems that's it's working as intended for now, pending more tests... 😄 I actually thought of a more accurate way to split work between different instances with execution segments that has the same nice properties as the one described above, and I've documented it at the bottom of this post. But more importantly, I've realized that there is one tricky caveat we have to be wary about when partitioning tests between multiple instances, so I'll start with that:

Caveat when splitting work between instances

Let's say that a certain test with multiple different schedulers has x MaxVUs with normal (i.e. 100% - no execution segment, single instance) execution. If we split it into different segments, the execution of each of those segments will have y1, y2, ..., yn MaxVUs. The catch is that y1 + y2 + ... + yn may be more than x, due mostly to the fact that schedulers can operate both concurrently and in sequence. Here's an example:

import { sleep } from "k6";

export let options = {
  execution: {
    per_vu_iters: {
      type: "per-vu-iterations",
      vus: 5,
      iterations: 5,
      maxDuration: "5s",
      gracefulStop: "0s",
    },
    shared_iters: {
      type: "shared-iterations",
      vus: 5,
      iterations: 25,
      maxDuration: "5s",
      gracefulStop: "0s",
    },
    constantrate: {
      type: "constant-arrival-rate",
      rate: 2,
      timeUnit: "1s",
      duration: "5s",
      preAllocatedVUs: 10,
      maxVUs: 10,
      gracefulStop: "0s",
      startTime: "5s",
    },
  },
};

export default function () {
  sleep(0.5 + Math.random() * 0.5);
}

With a normal 100% execution, this test will run with 10 MaxVUS for 10 seconds in total (notice the 5s startTime in constantrate and gracefulStop: 0s everywhere). But if we run it in two instances, the first with --execution-segment=0:0.5 and the second with --execution-segment=0.5:1, then the first instance will have 6 MaxVUs and the second one 5 MaxVUs... Definitely a "wait, wat?" moment 😄

The reason for this is pretty simple and intuitive once you've grokked things. Basically it's the result of the combination of the following 3 statements: 1) different schedulers can run simultaneously, 2) VUs are indivisible entities, 3) VUs are reused between schedulers when it's safe to do so, i.e. when their execution times are not overlapping.

In the example above, per_vu_iters and shared_iters are running in parallel for the first 5 seconds (each with 5 VUs) and constantrate is running for the next 5 seconds with 10 MaxVUs. So, because there isn't any overlap between the first two and the third scheduler (gracefulStop is 0s in all three), 10 VUs will be enough to run the whole test in a single instance. But let's calculate the MaxVUs when we split the test into two equal segments:

Segment Scheduler MaxVUs
0% - 50% per_vu_iters 3
0% - 50% shared_iters 3
0% - 50% constantrate 5
50% - 100% per_vu_iters 2
50% - 100% shared_iters 2
50% - 100% constantrate 5

As you can see, both per_vu_iters and shared_iters splits their VUs 3:2 between the two segments, and since the two run in parallel, the split for the test is 6:4. But constantrate runs after the two iteration-based schedulers and has 10 MaxVUs in total, thus splitting them 5:5 between the segments. And, since VUs are reused between schedulers (when possible, i.e. when there's no overlap), the MaxVUs for the whole first segment is Max( (3+3), 5) = 6, and the MaxVUs for the second segment is Max( (2+2), 5) = 5. Thus, 11 max VUs in total when we split the script on 2 instances, even if 10 VUs would have been enough had we ran it on one...

This isn't a huge deal, I'm just mentioning it as something to keep in mind when we're splitting execution between multiple k6 instances. Once we generate an execution plan (i.e. how many max VUs would be needed at any given time in the test execution), we can't correctly scale those aggregated values. We could scale them if we only want a rough estimate, but for precise MaxVU calculations, we should generate an execution plan only after we've applied the execution segment value.

More accurate algorithm

As mentioned in the beginning, I think I have a better algorithm for calculating values in different segments. It's pretty similar to the one described in the first message and has the same nice properties, but it should be slightly more accurate and slightly simpler.

The original algorithm for scaling an indivisible integer values was basically floor( (value * from) % 1 + value * (to - from) ). By using remainders and floor() to get the resulting whole numbers, some unnecessary fractions could accumulate between the segments and can slightly skew results in favor of later segments. In the example with splitting 11 VUs in a 20%, 25%, 45%, 10% distribution, see how both the 20% segment and the 10% segment have 2 VUs. And if we flip the distribution to be 10%, 45%, 25%, 20%:

InstID segment rem(prevSegment) segmentLength * VU sum floor(sum)
1 (0.00 - 0.10] 0 0.10 * 11 = 1.10 1.10 1
2 (0.10 - 0.55] rem(0.10 * 11) = 0.10 0.45 * 11 = 4.95 5.05 5
3 (0.55 - 0.80] rem(0.55 * 11) = 0.05 0.25 * 11 = 2.75 2.80 2
4 (0.80 - 1.00] rem(0.80 * 11) = 0.80 0.20 * 11 = 2.20 3 3

See how suddenly the 20% segment has 3 VUs, while the 25% segment has only 2. It's not the end of the world, these error won't ever get bigger than 1, simply because they're due to using floor(), i.e. due to leaving fractions unused.

But they're also unnecessary 😄! Instead of using floor() and remainders from [0, 1), we can use rounding and have remainders be from [0, 0.5) in absolute value. It still will be rounding of rational numbers, not floating point ones, so it will be nice to work with and we won't have to worry about floating point representation and accuracy problems. Here's how we can calculate the scaling of an integer value for a given (from, to] segment:

round( (value * from) - round(value * from) + (value * (to - from)) )

(value * from) - round(value * from) is basically the "error" of the prevSegment's value, the same role rem(prevSegment) played before, only instead of from [0, 1), its value will be in (-0.5, 0.5), so basically half the "error". The other nice property of the above equation is that once you open the brackets, it reduces to:

round( (value * to) - round(value * from) )

Here are the same previous examples, but with the new formula:

7 VUs/iterations/whatever-integer split equally between 3 instances:

InstID segment A = to * value B = round(from * value) A-B round(A-B)
1 (0 - ⅓] ⅓ * 7 = 2⅓ 0 2⅓ 2
2 (⅓ - ⅔] ⅔ * 7 = 4⅔ round(2⅓) = 2 2⅔ 3
3 (⅔ - 1] 1 * 7 = 7 round(4⅔) = 5 2 2

7 split equally between 4 instances:

InstID segment A = to * value B = round(from * value) A-B round(A-B)
1 (0 - ¼] ¼ * 7 = 1¾ 0 2
2 (¼ - ½] ½ * 7 = 3½ round(¼ * 7) = 2 2
3 (½ - ¾] ¾ * 7 = 5¼ round(½ * 7) = 4 1
4 (¾ - 1] 1 * 7 = 7 round(¾ * 7) = 5 2 2

11 in a 20%, 25%, 45%, 10% distribution:

InstID segment A = to * value B = round(from * value) A-B round(A-B)
1 (0.00 - 0.20] 0.20 * 11 = 2.20 0 2.20 2
2 (0.20 - 0.45] 0.45 * 11 = 4.95 round(0.20 * 11) = 2 2.95 3
3 (0.45 - 0.90] 0.90 * 11 = 9.90 round(0.45 * 11) = 5 4.90 5
4 (0.90 - 1.00] 1.00 * 11 = 11 round(0.90 * 11) = 10 1 1

11 in a 10%, 45%, 20%, 25% distribution:

InstID segment A = to * value B = round(from * value) A-B round(A-B)
1 (0.00 - 0.10] 0.10 * 11 = 1.10 0 1.10 1
2 (0.10 - 0.55] 0.55 * 11 = 6.05 round(0.10 * 11) = 1 5.05 5
3 (0.55 - 0.75] 0.75 * 11 = 8.25 round(0.55 * 11) = 6 2.25 2
4 (0.75 - 1.00] 1.00 * 11 = 11 round(0.75 * 11) = 8 3 3

11 in a 10%, 45%, 25%, 20% distribution:

InstID segment A = to * value B = round(from * value) A-B round(A-B)
1 (0.00 - 0.10] 0.10 * 11 = 1.10 0 1.10 1
2 (0.10 - 0.55] 0.55 * 11 = 6.05 round(0.10 * 11) = 1 5.05 5
3 (0.55 - 0.80] 0.80 * 11 = 8.80 round(0.55 * 11) = 6 2.80 3
4 (0.80 - 1.00] 1.00 * 11 = 11 round(0.75 * 11) = 9 2 2

Same nice properties of the old algorithm, but even better! Each segment's end value is still consistently and correctly calculated based only on the from and to values for that particular segment, and summing the end values for each segments results in exactly the original value. The part that's better is that the segments results are much more stable and don't depend on their position quite as much as before. That's especially apparent in the 3 examples with 11 VUs - the order of the distribution produced 3 slightly different sets of values with the old algorithm, but it produces consistent values with the new one.

@na-- na-- mentioned this issue May 15, 2019
39 tasks
@na-- na-- removed the evaluation needed proposal needs to be validated or tested before fully implementing it in k6 label May 15, 2019
@na-- na-- mentioned this issue May 16, 2019
@na--
Copy link
Member Author

na-- commented May 17, 2019

After this comment yesterday, I though a bit more about how data could be partitioned between multiple independent k6 instances without synchronization between them.

The data would need to be something with a known discrete size in the init context, so that the scheduler VUs/iterations could be synchronized by it in some way. But other than that, it could be anything - bytes in a file, lines in a text file, lines in a CSV file, JSON objects, JSONPath/XPath query results, etc.

Then there are a few possible ways to distribute (i.e. iterate over) these discrete pieces of data between the iterations. Say we have 10 pieces of data and we want to spread them them in a 50%, 30%, 20% distribution:

Piece # Sequential Striped
0 D1 D1
1 D1 D2
2 D1 D1
3 D1 D3
4 D1 D1
5 D2 D2
6 D2 D1
7 D2 D3
8 D3 D1
9 D3 D2

Whether this striping is something that should actually be a responsibility of the data segmentation functions in k6 is still unclear to me, but it seems like maybe it should be... On the one hand, users should be able to achieve something pretty similar by just randomizing the source data. On the other hand, that won't work with streaming data sources like large files. And it's unclear how practical it would be when users want to iterate over the data set multiple times...

@na--
Copy link
Member Author

na-- commented Jan 20, 2020

Last Friday, while thinking about #1299 (comment), I had the seed of an idea how we can implement something like striped data partitioning. More importantly, this could also probably help us with the distribution of iterations between instances for the arrival-rate executors, so that they aren't grouped closely together (#1007 (comment), #1299). Sorry for the long wall of text... again... but I wanted to describe my thought process, since even though I think this solution should work, I'm hopeful that someone else can spot something simpler that I've missed.

The basic idea is that with the arrival-rate executors, both the constant and the variable one, we can look at the iterations as a sequence of discrete events. Similarly, for data partitioning, we can look at the data entries as a sequence. And we don't really need to know how long these sequences are, they can even be quasi-infinite, giving us the nice option of easily looping through small data files multiple times! Or not bothering to calculate precisely the exact number of iterations a variable-arrival-rate should make... 😅 The important thing is that each instance, with its particular execution segment, just has to reliably, reproducibly, and independently figure out which elements out of that sequence of iterations/data chunks are apportioned to it.

As I've shown above, that's very easy to do if you know the total number of elements and don't mind big sequential blocks of iterations/data. Let's say we have 3 execution segments, ES0=(0 - ⅓], ES1=(⅓ - ½], ES2=(½ - 1], and we need to split a file with 60 data entries... Or, alternatively, we need to execute 6 iterations per second over 10 seconds, i.e. 60 total iterations. Let's name the sequence of iterations i0, i1, ..., i59.

The simplest (non-striped) way of splitting the iterations would cause ES0 to execute iterations i0 - i19, ES1 to execute i20-i29 and ES2 to execute i30-i59. We can easily do that by just segmenting the interval, but this partitioning isn't very useful for the arrival-rate execution, since most instances would sit idle until their turn comes, which isn't really the idea... And even though this approach would often work for data partitioning, as I mentioned in the thread previously, not always. Sometimes we may need striped data partitioning. Sometimes, when we don't know the total length of the sequence, the approach isn't actually even possible...

Ideally, we'd like all instances to process data/iterations simultaneously. Additionally, for the arrival-rate executors, we'd like the iterations of each instance to be as spread-apart as possible, so we can use fewer VUs and so that the load is spread more evenly across the load generators. So, in our example above, the last instance (responsible for 50% of the iterations) shouldn't process i3, i4, i5, i9, i10, i11, ..., because if the iterations are too closely clumped together, the load would be uneven... Instead, ideally, the last instance should execute every second iteration, say i1, i3, i5, i7, ... - that way it will have a more consistent load and need fewer VUs.

So, my initial idea was that we can turn the information we have, the execution segment start and finish values, into a quasi-sequence as well. Only this sequence would be finite and made out of booleans, and we would use it as a circular "mine" / "not mine" filter for the iterations/data from the main sequence. Because it's finite and circular, the main sequence of data/iterations can even be infinite. And each instance, using its own execution segment, would generate its own "mine" sequence independently, and have it be disjoint with the "mine" sequences of all other instances, totally reproducibly.

This would have been easy to do if all execution segments had the same denominator... That is, instead of 1/3, 1/2 and 1, we had 2/6, 3/6 and 6/6 in the above example... Because if all instances reliably knew the lowest common denominator (LCD) of all of the execution segments, we can simply use it as the length of our boolean sequence! If we iterate over the data entries with the variable i, then every element of it that's segmentStart < (i % LCD)/LCD <= segmentEnd is part of our segment! But because each instance knows only its own execution segment, it can calculate only the total size of all preceding and following segments, but not their exact values or even their number.

But since we don't know the LCD of all execution segments, we have no way of guaranteeing disjoint sets between instances, i.e. that there wouldn't be 2 instances that would both consider a particular iteration to "belong" to them... And while we could require our users to use matching denominators for segments, I don't think that's the right approach... It would make execution segments very cumbersome to use and it would also be impossible to validate from any single k6 instance, so we can't even give meaningful feedback to our users if they make a mistake.

Instead, I think I figured out a way to get 95% of the benefits of perfectly striped data partitioning by just adding some deterministic pseudo-randomness in the mix. I remembered that you could use prime numbers to iterate through all cells of an array exactly once in a pseudo-random (but deterministic) manner! This is a short post I found on the topic, though there's probably a better explanation (or alternatives) in some algorithms book. This would have been sufficient if we had a single denominator for execution segments, but we don't, so we have to dig a little deeper...

Essentially, because we don't know the total number of iterations/data entries, and we don't want each instance to know all execution segments, we need to have a way to evenly map any integer number i (from 0 to infinity, i.e. the sequence number of each iteration/piece of data) to a rational number between 0 and 1 (i.e. the execution segment):

  • We probably don't quite need cryptography-level guarantees of evenness and fairness, but we need this function to be reasonably fair, i.e. for any value i, any value in the interval (0 - 1] (or [0 - 1)) should be roughly equally possible.
  • The result space of possible values in the interval (0 - 1] (or [0 - 1)) should be reasonably big, so that even very small segments can be hit. So, a result space of {1/10, 2/10, ..., 9/10, 1} is obviously too small, since an instance with an execution segment of 91/100:99/100 would never match any iteration. So, the number of possible results between 0 and 1, should be at least a few thousand, or, better yet, several million, so that even the smallest reasonable execution segments have a chance to be matched.
  • Each instance should be able to independently calculate the result of this function (for each iteration number/data piece number) and get the same result.
  • Since we're trying to fit a potentially infinite sequence into a finite field, there's no way to have a bijection, rather it would be a surjection, so we'd have some sort of cycles for big enough values of i. But I think that, in order to ensure even distribution, it's important to make sure that before we have any cycles in the results, we should have first returned all possible values exactly once, i.e. to have our algorithm be full cycle.

A simple demo of the concept is to just pick a random prime, and a random big-enough window size (100000 for example), and just do something like this for each iteration i:

  1. calculate X = (i * randomPrime) % window
  2. if X/window (as a rational number) is within the execution segment of our instance, we execute the iteration, otherwise we ignore it

Here's a short demo of the concept:

package main

import "fmt"

const (
	randomPrime = 17
	window      = 20
	start       = 1
)

func main() {
	for i := 1; i <= 500; i++ {
		if i%window == 1 {
			fmt.Print("\n")
		}
		fmt.Printf("%d/%d, ", (start+i*randomPrime)%window, window)
	}
}

the result would look like this:

18/20, 15/20, 12/20, 9/20, 6/20, 3/20, 0/20, 17/20, 14/20, 11/20, 8/20, 5/20, 2/20, 19/20, 16/20, 13/20, 10/20, 7/20, 4/20, 1/20,
18/20, 15/20, 12/20, 9/20, 6/20, 3/20, 0/20, 17/20, 14/20, 11/20, 8/20, 5/20, 2/20, 19/20, 16/20, 13/20, 10/20, 7/20, 4/20, 1/20,
...
18/20, 15/20, 12/20, 9/20, 6/20, 3/20, 0/20, 17/20, 14/20, 11/20, 8/20, 5/20, 2/20, 19/20, 16/20, 13/20, 10/20, 7/20, 4/20, 1/20,

This specific algorithm above is probably not be the best approach, since it's not very random or even, but it illustrates the point. The next "level", which is what we probably want to use, is something like a LCG (Linear congruential generator). It's not going to give us the perfect even distribution (e.g. a 50% segment executing exactly only every second iteration), but assuming the function is fairly random, it should be pretty close. In any case, whoever ends up implementing this would need to read up some more on the topic, and thoroughly test things... Here are some potential resources on similar issues, and though I still haven't read them fully, a book or paper on the topic would probably be even better:

@na--
Copy link
Member Author

na-- commented Jan 20, 2020

As a more concrete example of the type of pseudo-randomness we need, I found a convenient Go implementation of a full-cycle LFSR: https://github.com/taylorza/go-lfsr (some discussion and other related links in here ).

So, this might not be the best algorithm for our use case (though it's probably good enough), and we'd definitely want to use at least the 16-bit version (and who needs more than 64k anyway...), but here's a short demo with the 8-bit LFSR implementation:

package main

import (
	"fmt"

	"github.com/taylorza/go-lfsr"
)

func main() {
	rnd := lfsr.NewLfsr8(13)
	for i := 1; i <= 255*3; i++ {
		next, nl := rnd.Next()
		fmt.Printf("%d,", next)
		if nl {
			fmt.Println()
		}
	}
}

it produces something like

134,195,225,240,248,124,190,223,111,183,219,237,246,123,189,94,175,215,235,117,186,93,46,23,139,69,34,17,8,132,194,97,176,216,108,54,27,141,198,227,241,120,60,158,207,231,115,57,156,206,103,51,25,140,70,163,209,104,180,90,45,150,75,37,18,137,68,162,81,40,148,74,165,82,169,84,42,149,202,229,114,185,220,238,119,187,221,110,55,155,205,230,243,121,188,222,239,247,251,253,126,191,95,47,151,203,101,50,153,204,102,179,89,172,86,43,21,138,197,98,49,24,12,6,131,193,224,112,184,92,174,87,171,85,170,213,234,245,250,125,62,159,79,167,83,41,20,10,133,66,33,144,200,228,242,249,252,254,255,127,63,31,15,135,67,161,208,232,244,122,61,30,143,199,99,177,88,44,22,11,5,2,1,128,64,32,16,136,196,226,113,56,28,142,71,35,145,72,164,210,233,116,58,29,14,7,3,129,192,96,48,152,76,38,147,73,36,146,201,100,178,217,236,118,59,157,78,39,19,9,4,130,65,160,80,168,212,106,181,218,109,182,91,173,214,107,53,154,77,166,211,105,52,26,13,
134,195,225,240,248,124,190,223,111,183,219,237,246,123,189,94,175,215,235,117,186,93,46,23,139,69,34,17,8,132,194,97,176,216,108,54,27,141,198,227,241,120,60,158,207,231,115,57,156,206,103,51,25,140,70,163,209,104,180,90,45,150,75,37,18,137,68,162,81,40,148,74,165,82,169,84,42,149,202,229,114,185,220,238,119,187,221,110,55,155,205,230,243,121,188,222,239,247,251,253,126,191,95,47,151,203,101,50,153,204,102,179,89,172,86,43,21,138,197,98,49,24,12,6,131,193,224,112,184,92,174,87,171,85,170,213,234,245,250,125,62,159,79,167,83,41,20,10,133,66,33,144,200,228,242,249,252,254,255,127,63,31,15,135,67,161,208,232,244,122,61,30,143,199,99,177,88,44,22,11,5,2,1,128,64,32,16,136,196,226,113,56,28,142,71,35,145,72,164,210,233,116,58,29,14,7,3,129,192,96,48,152,76,38,147,73,36,146,201,100,178,217,236,118,59,157,78,39,19,9,4,130,65,160,80,168,212,106,181,218,109,182,91,173,214,107,53,154,77,166,211,105,52,26,13,
134,195,225,240,248,124,190,223,111,183,219,237,246,123,189,94,175,215,235,117,186,93,46,23,139,69,34,17,8,132,194,97,176,216,108,54,27,141,198,227,241,120,60,158,207,231,115,57,156,206,103,51,25,140,70,163,209,104,180,90,45,150,75,37,18,137,68,162,81,40,148,74,165,82,169,84,42,149,202,229,114,185,220,238,119,187,221,110,55,155,205,230,243,121,188,222,239,247,251,253,126,191,95,47,151,203,101,50,153,204,102,179,89,172,86,43,21,138,197,98,49,24,12,6,131,193,224,112,184,92,174,87,171,85,170,213,234,245,250,125,62,159,79,167,83,41,20,10,133,66,33,144,200,228,242,249,252,254,255,127,63,31,15,135,67,161,208,232,244,122,61,30,143,199,99,177,88,44,22,11,5,2,1,128,64,32,16,136,196,226,113,56,28,142,71,35,145,72,164,210,233,116,58,29,14,7,3,129,192,96,48,152,76,38,147,73,36,146,201,100,178,217,236,118,59,157,78,39,19,9,4,130,65,160,80,168,212,106,181,218,109,182,91,173,214,107,53,154,77,166,211,105,52,26,13,

@na--
Copy link
Member Author

na-- commented Jan 20, 2020

This is a deep rabbit hole 😅 Some benchmarking and a lot of further investigation is definitely needed, but part of the sequence of the previous 8-bit LFSR ( ...,2,1,128,64,32,16,136,196,226,...) wasn't very random-looking, and similar sequences were also present in the 16-bit version... So even though for our use case this probably isn't a huge deal, here's a demo of another (seemingly better) full-cycle random number generator that was linked in the reddit thread:

package main

import (
	"fmt"
	"log"

	"modernc.org/mathutil"
)

func main() {
	rnd, err := mathutil.NewFC32(0, 255, true)
	if err != nil {
		log.Fatal(err)
	}
	rnd.Seed(13)

	for i := 1; i <= 3*256; i++ {
		next := rnd.Next()
		fmt.Printf("%d,", next)
		if i%256 == 0 {
			fmt.Println()
		}
	}
}

which, even with a cycle of only 256, seems fairly random to my monkey brain:

116,11,102,199,129,244,53,94,187,159,223,40,85,174,153,237,26,60,183,133,2,76,213,110,14,100,200,130,252,50,88,190,161,221,41,86,171,147,231,27,62,184,134,10,70,214,113,12,101,208,127,246,44,91,188,162,222,42,83,165,148,234,29,63,185,142,4,71,216,111,13,109,202,121,247,47,89,189,163,230,39,77,168,150,232,30,64,182,136,5,73,217,112,21,103,203,124,249,45,90,197,160,224,33,80,166,151,233,31,61,176,137,7,74,218,120,15,104,205,122,250,46,98,191,154,225,36,78,167,152,241,28,55,179,139,8,75,215,114,253,16,106,206,123,251,54,92,192,157,227,34,79,175,149,235,22,58,177,140,9,72,209,115,18,107,207,131,248,48,93,194,155,228,35,87,169,143,236,25,56,178,141,6,66,212,117,254,19,108,204,125,242,49,95,195,156,229,43,81,170,146,238,23,57,186,138,0,69,210,118,255,20,105,198,126,245,51,96,196,164,226,37,82,172,144,239,24,65,180,132,3,67,211,119,17,99,201,128,243,52,97,193,158,220,38,84,173,145,240,32,59,181,135,1,68,219,
116,11,102,199,129,244,53,94,187,159,223,40,85,174,153,237,26,60,183,133,2,76,213,110,14,100,200,130,252,50,88,190,161,221,41,86,171,147,231,27,62,184,134,10,70,214,113,12,101,208,127,246,44,91,188,162,222,42,83,165,148,234,29,63,185,142,4,71,216,111,13,109,202,121,247,47,89,189,163,230,39,77,168,150,232,30,64,182,136,5,73,217,112,21,103,203,124,249,45,90,197,160,224,33,80,166,151,233,31,61,176,137,7,74,218,120,15,104,205,122,250,46,98,191,154,225,36,78,167,152,241,28,55,179,139,8,75,215,114,253,16,106,206,123,251,54,92,192,157,227,34,79,175,149,235,22,58,177,140,9,72,209,115,18,107,207,131,248,48,93,194,155,228,35,87,169,143,236,25,56,178,141,6,66,212,117,254,19,108,204,125,242,49,95,195,156,229,43,81,170,146,238,23,57,186,138,0,69,210,118,255,20,105,198,126,245,51,96,196,164,226,37,82,172,144,239,24,65,180,132,3,67,211,119,17,99,201,128,243,52,97,193,158,220,38,84,173,145,240,32,59,181,135,1,68,219,
116,11,102,199,129,244,53,94,187,159,223,40,85,174,153,237,26,60,183,133,2,76,213,110,14,100,200,130,252,50,88,190,161,221,41,86,171,147,231,27,62,184,134,10,70,214,113,12,101,208,127,246,44,91,188,162,222,42,83,165,148,234,29,63,185,142,4,71,216,111,13,109,202,121,247,47,89,189,163,230,39,77,168,150,232,30,64,182,136,5,73,217,112,21,103,203,124,249,45,90,197,160,224,33,80,166,151,233,31,61,176,137,7,74,218,120,15,104,205,122,250,46,98,191,154,225,36,78,167,152,241,28,55,179,139,8,75,215,114,253,16,106,206,123,251,54,92,192,157,227,34,79,175,149,235,22,58,177,140,9,72,209,115,18,107,207,131,248,48,93,194,155,228,35,87,169,143,236,25,56,178,141,6,66,212,117,254,19,108,204,125,242,49,95,195,156,229,43,81,170,146,238,23,57,186,138,0,69,210,118,255,20,105,198,126,245,51,96,196,164,226,37,82,172,144,239,24,65,180,132,3,67,211,119,17,99,201,128,243,52,97,193,158,220,38,84,173,145,240,32,59,181,135,1,68,219,

@na--
Copy link
Member Author

na-- commented Jan 21, 2020

To make things simpler to understand, I'll try to give an example. Let's use the execution segments from the example above, Inst0 having execution segment ES0=(0 - ⅓], Inst1 having ES1=(⅓ - ½], and Inst2 having ES2=(½ - 1], and the pseudo-random numbers from the last example. Let's try to distribute 24 iterations, i0, i1, ..., i22, i23.

The basic idea is that each of the three instances would independently generate the same pseudo-random numbers while looping from 0 to 23 (i.e. each instance loops through all possible iterations/data entries), generate the sequential pseudo-random number, turn it into a ratio between 0 and 1, and see if it fits into its own execution segment. If it's inside its execution segment, it executes it, and if not, it skips that iteration and moves to the next one.

Iteration
number
Random
value
In ES0?
(0 - ⅓]
In ES1?
(⅓ - ½]
In ES2?
(½ - 1]
0 116/256
1 11/256
2 102/256
3 199/256
4 129/256
5 244/256
6 53/256
7 94/256
8 187/256
9 159/256
10 223/256
11 40/256
12 85/256
13 174/256
14 153/256
15 237/256
16 26/256
17 60/256
18 183/256
19 133/256
20 2/256
21 76/256
22 213/256
23 110/256

If I have calculated everything correctly:

  • the first instance makes 8 iterations, exactly 1/3 of the total 24.
  • the second one makes 4, exactly 1/6
  • the third makes 12, exactly 1/2

This isn't always going to hold, for example ratios at half the iterations (i==11) are a bit skewed, but for large enough numbers, each instance should execute precisely the proportion of iterations that was configured. In this example the iterations for a single instance were more clumped together than I would have liked, but I'm somewhat hopeful that this problem would be ameliorated by some combination of a {more instances, larger result set of the pseudo-random function, better random function}. Still, this result is definitely better than each instance hammering the SUT precisely at the same times.

In any case, I'd very much appreciate if someone can think of some simpler approach to this problem. The basic idea behind partitioning work is that, if I run a load test from a single machine and if I run that same load test spread across multiple machines, I shouldn't notice any difference at the receiving end (i.e. SUT) or in any charts of the active VUs or iterations/s (barring things like network speed/latency/hardware differences, of course).

@na--
Copy link
Member Author

na-- commented Jan 22, 2020

Hmm thinking some more about this, I'm not sure that we can avoid something like this predictable-pseudo-random approach if we want something like striped data/iteration partitioning, even if each instance had the same denominator for its execution segment (or some other extra synchronization).

I think that, in order to get near-perfect distribution (if that's even possible for all cases, which I think can be done, but I'm still not 100% sure), we'd have to send the list of all execution segments for a test run to each instance, besides its own segment. This is probably the minimum requirement, since it seems to me that we must have some sort of deterministic bin-packing-like algorithm (though that's probably not the correct term in our case) that partitions the segments... That algorithm would probably be relatively easy in our use case (the simplest one I can think of involves an array with size LCD(ex.segments) again), but it requires knowledge of all values (including the numerators) of the execution segments.

To illustrate why, let's imagine 3 instances with the following segments: ES0=(0:¼], ES1=(¼:¾], ES2=(¾:1]. A human, knowing all the segments, would distribute iterations like i0->ES0, i1->ES1, i2->ES2, i3->ES1, i4->ES0, i5->ES1, i6->ES2, i7->ES1, .... But how would the instance with ES2 know that it has to execute i2? If we had ES'0=(0:¼], ES'1=(¼:½], ES'2=(½:¾], ES'3=(¾:1], ES'3, which has the exact same execution segment as ES2, but doesn't have to execute i2...

@na--
Copy link
Member Author

na-- commented Jul 9, 2020

Since the execution segments were implemented in #1007, I'm closing this issue. The only part not implemented is the data segmentation, and that can now be tracked in #1539

@na-- na-- closed this as completed Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant