Avoiding bound checks in setunion code #301

lemire · 2021-04-16T19:44:59Z

This PR would remove most bound checks in the setunion code. Sadly it seems to make no difference to the performance. Marking it as a draft in case someone can make it work better.

The assembly does look leaner to me, at a glance.

coveralls · 2021-04-16T19:57:50Z

Coverage increased (+0.004%) to 86.941% when pulling 2e7b454 on dlemire/trimmingboundchecksinsetunion into 1477e28 on master.

puzpuzpuz · 2021-07-12T19:12:11Z

For me, Go v1.16.5's produces CMPQ + JBE instructions for bounds checks in slices. The branch should never be taken in the correct code, so it seems to me that the check should be very cheap.

I've tried to make union2by2's loop fully branchless, except for the loop condition here:
puzpuzpuz@5630617

The results are worse than the branchy implementation, so I'm either doing something wrong (which is quite probable) or due to some factors (data dependencies between loop iterations? cost of additional arithmetical operations? something else) it's just not worth it. I'm under an impression that the proper way to approach union2by2 is to use SIMD instructions just like it's done in the C++ roaring bitmap library, but obviously such implementation in Go assembler would be hard to maintain.

lemire · 2021-07-13T13:15:09Z

@puzpuzpuz

It is damn hard to benchmark branchy code. For example, to see the benefits of a branchless approach, you have to have mispredicted branches to begin with, but most benchmarking techniques minimize mispredicted branches by looping over the same data.

We tend to always underestimate the value of branchless approaches.

The branch should never be taken in the correct code, so it seems to me that the check should be very cheap.

Very cheap is relative. You still incur the two instructions (maybe less with fusion) per access. A union function is basically just that... access, compare, access, compare... so if access cost you even a bit more, it should show up in the performance.

But you are, of course, correct that the real solution is to use assembly.

Your branchless approach is interesting but, given that this is Go, it may not compile to efficient assembly. Have you looked at the assembly?

puzpuzpuz · 2021-07-13T13:45:29Z

For example, to see the benefits of a branchless approach, you have to have mispredicted branches to begin with, but most benchmarking techniques minimize mispredicted branches by looping over the same data.

Yeah, I'm aware of the problem. That's why I'm using a large array of randomly generated arrays (~1M of numbers in total) here: puzpuzpuz@5630617#diff-c2898d6069ac5db5285b63cc9066e814584908dd34801fde55919838e1ef53bbR176

In theory, 1M of iterations should be enough to get rid of the effects of the branch predictor. For instance, if I change the benchmark to the similar approach as you did in the shortgun intersection experiment (generate random array on each iteration), perf linux utility reports me 2.96% of all branches for the union2by2 scenario (the original branchy implementation), but with the approach from my branch it reports 5.76% of all branches. That seems to be an indication of the fact that the second approach eliminates effects of the branch prediction in a more efficient manner to me, but I may be wrong.

Have you looked at the assembly?

Yes, I did. I'm not sure about how good the compiler output is in terms of the "cost" of the emitted instructions, but I didn't notice any branches except for the loop condition one. Most likely those instructions are rather expensive since I see higher instruction per cycle value reported by perf for the union2by2_branchless function (2.23 vs 1.46 insn per cycle).

Update. Higher insn per cycle value is in fact a good indicator (see the below messages). Also, it's not a surprise that the total number of instructions executed in the benchmark is higher for the union2by2_branchless function (20.5M vs 13.5M of instructions).

lemire · 2021-07-13T15:06:10Z

In theory, 1M of iterations should be enough to get rid of the effects of the branch predictor.

I agree that what you did looks good.

Ideally, you'd want the intersection size to vary unpredictably between 'all' to 'nothing'... I don't expect your approach to be good at that... but that's not likely to be an issue in this instance.

I see higher instruction per cycle value

We want more instructions per cycle. Being at 1.46 ins per cycle is bad. Keep in mind that recent processors like the Apple M1 can consistently retire 8 instructions per cycle.

But it does explain why trying to get a leaner code does not help. If we are at 1.46 ins per cycle, then trying to save instructions will not help us because we have plenty of room for more instructions. This suggest that we are primarily limited by data dependencies issues. And bound checks do nothing to harm data dependency.

This suggests that the move forward would be algorithmic in nature, at least for big superscalar processors.

Note that we have a faster assembly version of this function for ARM:

https://github.com/RoaringBitmap/roaring/blob/master/setutil_arm64.s

It is not algorithmic in nature. But it is possible that the gains are explained by the fact that it is tested on relatively small cores.

puzpuzpuz · 2021-07-13T15:28:12Z

Thanks for you inputs!

We want more instructions per cycle. Being at 1.46 ins per cycle is bad. Keep in mind that recent processors like the Apple M1 can consistently retire 8 instructions per cycle.

I misinterpreted the metric, my bad (probably, due to incredibly hot weather I have locally). The higher insn per cycle measurement we get, the better since in general it means that the code benefits from ILP.

Thanks for the correction.

This suggest that we are primarily limited by data dependencies issues. And bound checks do nothing to harm data dependency.

I'm also under the same impression so far.

This suggests that the move forward would be algorithmic in nature, at least for big superscalar processors.

Makes sense. As for the ARM union2by2 version, it's probably faster than the generic one due to Go compiler producing suboptimal code for arm64. But I'm just guessing since I don't have an ARM machine around to experiment with it.

lemire · 2021-07-13T15:43:54Z

But I'm just guessing since I don't have an ARM machine around to experiment with it.

It may depend on the specific ARM core microarchitecture. There are many small ARM cores that struggle to retire 2 instructions per cycle. But there are also powerful ARM cores.

Historically, x64 cores have been more similar (large superscalar cores)... there are tiny x86 cores used for mobile devices but most of us don't use them very much.

On small cores, saving a few instructions can make a big difference.

Anyhow, I think that the long-term fix might be algorithmic.

(Note: your investigation has been fruitful. Let us consider algorithmic improvements!)

puzpuzpuz · 2021-07-14T09:24:02Z

A minor update on the branchless implementation. I was able to simplify it and got rid of the lookup tables. The main part of the changes in the loop code may be seen here: puzpuzpuz@a2b2510#diff-7916b4e45da125122d637e8abb339ad262f2efc9fde0e2ceb6671ae9e75fd298R36

With this update, the union2by2_branchless is now slightly faster than the baseline branchy function in the microbenchmark.

Before:

BenchmarkUnion2by2/union2by2_branchless-8         	      40	  29035424 ns/op
BenchmarkUnion2by2/union2by2-8                    	      48	  24743344 ns/op

After:

BenchmarkUnion2by2/union2by2_branchless-8         	      50	  22842745 ns/op
BenchmarkUnion2by2/union2by2-8                    	      46	  24849983 ns/op

I'm glad to finally see the improvement. On the other hand, even in such synthetic scenario it's not significant enough to be considered as a valuable optimization.

lemire · 2021-07-14T13:13:20Z

@puzpuzpuz A consistent 5% speed boost would be valuable.

puzpuzpuz · 2021-07-14T13:19:43Z

@puzpuzpuz A consistent 5% speed boost would be valuable.

OK, let me run real-roaring-datasets and synthetic benchmarks and see if the branchless function makes any difference. If it does, I'll submit a PR. Thanks for your inputs once again.

lemire · 2021-07-14T13:36:18Z

What happens when the two values are equal in your algorithm?

lemire · 2021-07-14T13:41:31Z

I would also like to see how something stupidly simple like the following works...

while(pos1 < size1 && pos2 < size2) {
        v1 = input1[pos1];
        v2 = input2[pos2];
        if(v1 <= v2) {
          output_buffer[pos++] = v1;
        } else {
          output_buffer[pos++] = v2;
        }
        if v1<=v2 {
             pos1 += 1;
        }
        if v2<=v1 {
             pos2 += 1;
        }
    }

I am not convinced that your shifting routine is worth it.

puzpuzpuz · 2021-07-14T15:03:45Z

What happens when the two values are equal in your algorithm?

Duplicates between arrays are ignored, just like in the original implementation.

I would also like to see how something stupidly simple like the following works...

Compared with the following implementation of the union2by2 function (only the loop code is included for the sake of brevity):

	for k1 < len1 && k2 < len2 {
		s1 = set1[k1]
		s2 = set2[k2]

		if s1 < s2 {
			buffer[pos] = s1
			pos++
			k1++
		} else if s1 == s2 {
			buffer[pos] = s1
			pos++
			k1++
			k2++
		} else { // s1 > s2
			buffer[pos] = s2
			pos++
			k2++
		}
	}

Not sure if that's what you asked for, but the benchmark result is the following:

BenchmarkUnion2by2/union2by2_branchless-8         	      51	  22744960 ns/op
BenchmarkUnion2by2/union2by2-8                    	      48	  24475920 ns/op

The branchless function is still a bit faster.

lemire · 2021-07-14T16:42:37Z

@puzpuzpuz Please see Faster sorted array unions by reducing branches https://lemire.me/blog/2021/07/14/faster-sorted-array-unions-by-reducing-branches/

(I credit you in the blog post and in the corresponding code.)

puzpuzpuz · 2021-07-14T18:08:31Z

Thanks for mentioning me in the blog post. As I wrote before, I'll try to conduct more benchmark runs and post an update on #327, or simply open a PR if the results are promising.

lemire added 3 commits April 16, 2021 15:24

This should reduce bound checks in a function.

0572790

Tuning.

9f95696

Removing dead code.

2e7b454

lemire mentioned this pull request Jul 14, 2021

Try a branchless approach for the union of two sorted arrays (array containers) #327

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding bound checks in setunion code #301

Avoiding bound checks in setunion code #301

lemire commented Apr 16, 2021

coveralls commented Apr 16, 2021

puzpuzpuz commented Jul 12, 2021

lemire commented Jul 13, 2021 •

edited

puzpuzpuz commented Jul 13, 2021 •

edited

lemire commented Jul 13, 2021

puzpuzpuz commented Jul 13, 2021 •

edited

lemire commented Jul 13, 2021

puzpuzpuz commented Jul 14, 2021

lemire commented Jul 14, 2021

puzpuzpuz commented Jul 14, 2021

lemire commented Jul 14, 2021

lemire commented Jul 14, 2021

puzpuzpuz commented Jul 14, 2021 •

edited

lemire commented Jul 14, 2021

puzpuzpuz commented Jul 14, 2021

Avoiding bound checks in setunion code #301

Are you sure you want to change the base?

Avoiding bound checks in setunion code #301

Conversation

lemire commented Apr 16, 2021

coveralls commented Apr 16, 2021

puzpuzpuz commented Jul 12, 2021

lemire commented Jul 13, 2021 • edited

puzpuzpuz commented Jul 13, 2021 • edited

lemire commented Jul 13, 2021

puzpuzpuz commented Jul 13, 2021 • edited

lemire commented Jul 13, 2021

puzpuzpuz commented Jul 14, 2021

lemire commented Jul 14, 2021

puzpuzpuz commented Jul 14, 2021

lemire commented Jul 14, 2021

lemire commented Jul 14, 2021

puzpuzpuz commented Jul 14, 2021 • edited

lemire commented Jul 14, 2021

puzpuzpuz commented Jul 14, 2021

lemire commented Jul 13, 2021 •

edited

puzpuzpuz commented Jul 13, 2021 •

edited

puzpuzpuz commented Jul 13, 2021 •

edited

puzpuzpuz commented Jul 14, 2021 •

edited