Explore speedup techniques #20

foobar2019 · 2019-01-18T19:03:04Z

Vast majority of computation cost lies in ge_add. mkp uses same technique as horse, that is adding 8-multiplies to a work public point, and bumping the secret scalar by 8 in lockstep.

One step costs 4 field multiplications (ge_add), a very costly field inversion (p3_tobytes) and finally another 4 multiplications (back to projective for next add).

The heaviest thing now is the field inversion, so time for some moonmath. Turns there's this thing called Montgomery trick,:
https://github.com/dalek-cryptography/curve25519-dalek/blob/d58b663ea8eab24c3232cc92781e1ff3e24a866d/src/field.rs#L139

Where you multiply all the numbers in a batch into tower of products, invert the single last (upper most) one, and then walk the tower backwards multiplying the past scratch products with that inverse to get inverses for individual numbers in the batch. This reduces cost of inversion from 20+ muls to just 3 with a batch large enough.

Further speed up would be getting rid of ge_add, and go with point doubling instead. Which could possibly provide at least 2x more speed up, but I'm not well versed in explicit ed25519 formulas of the point formats involved yet.

foobar2019 · 2019-01-18T19:58:15Z

Another thing: provided that batching drives cost of point add from ~30M (4+~20+4) to 11M (3+4+4), the 4+4M now become significant cost instead. Luckily, we need to compute those in separate lanes for the batch, before and after. Meaning prime target for SIMD.

cathugger · 2019-01-19T12:24:59Z

That batch invert seems pretty simple to implement. Then I guess we'd have to do something like batch p3_tobytes. For every worker have array we ge_add stuff to, then batch_p3_tobytes, then compare each. How big these arrays/batches should be? I guess I could pick some random number like 64 first and after implementing try tweaking and checking what could work better. This kind of batching has assumption that hits for correct key are rather rare, which is very probably true for the most use cases of this tool.

cathugger · 2019-01-19T16:49:43Z

Damn this just got a whole lot faster.
354e777
Gotta do relevant changes for other backends too.

foobar2019 · 2019-01-19T16:57:05Z

Well done. Forget about what I said about SIMD, i tried few prototypes, they all suck. 128bit mulx hogs the ALU completely, and SIMD brings next to no benefit, only massive complexity. Only real speedup is switching from ge_add / p3p3 to ge_dbl / p2. Doing that would involve massive refactor though, and can only speed up things about twice from the current batch algo.

foobar2019 · 2019-01-19T17:26:38Z

I guess I could pick some random number like 64 first and after implementing try tweaking and checking what could work better.

L1 size to fit groupe elements and thread stacks vs diminishing results. If you want to get super fancy, make autotuning option which grows the batch until best timing is obtained per one mega-batch (guesstimate anywhere between 64 to 256).

cathugger · 2019-01-19T17:29:46Z

Only real speedup is switching from ge_add / p3p3 to ge_dbl / p2.

Explain. We need p3 version converted to bytes for public key check. But we can have whatever representation for computing as long as we convert to p3 later.
As for point doubling, you mean something like ge_p3_dbl (in ref10) function does, except we would first convert to p2, then repeatedly do ge_p2_dbl?
I don't see anything wrong with that other than different way to keep secret key in sync, and since it's multiplications counter end up running out sooner and we'd need to do more reseeds. But we don't need to source from randombytes() on every reseed only after successful match so that may be not that bad.

cathugger · 2019-01-19T17:33:51Z

Actually it'd effectively be shift of secret key (power of 2) so uhhhh only way to do it fast would be starting from lower-entropy key with upper bits zeroed.

cathugger · 2019-01-19T17:38:06Z

Wait uh. We can't do that. Shifting wouldn't work because we need certain upper bits pattern in secret key. So yeah without deeper understanding of ed25519 I'm not sure if it's good idea to touch these either.

foobar2019 · 2019-01-19T17:39:05Z

ge_dbl has p2 input, and p1 -> p2 is one mul cheaper than p1 -> p3. Then ge_dbl is significantly cheaper than on its own than ge_add, as it's mostly squares, not muls.

As for the scalar, it's run of the mill double-exponentiation. Say you get counter=12345 as counter on match.

Then you do (secret * ((2^counter) mod M) mod M) which is somewhat expensive modular math, but not much (same you'd do for DH key exchange). This is computed only on key match, so you don't particularly care it's costly.

Edit: The formula should be mul/exp, not expexp.

foobar2019 · 2019-01-19T17:47:49Z

Wait uh. We can't do that. Shifting wouldn't work because we need certain upper bits pattern in secret key. So yeah without deeper understanding of ed25519 I'm not sure if it's good idea to touch these either.

As for what the scalar actually is in ed25519. The bit twiddling djb does simply ensures that it is a multiple of 8 (cofactor, super important) and that it is less than M (group order). Note that scalar which is raised to power may no longer be "multiple of 8" at first sight, but is perfectly valid (still lies in group). This is why this is checked only on generation, but later operations check only the restricted 2^252 bound. As for that, on rare occasion it falls into between M and 2^252. Most of ed25519 code can't cope with such scalar because they use bit twiddle as quick range check which can't comprehend the real M prime, just something a bit less than that.

Chance of that happening is rare, but must be checked after raising the scalar to power of counter to remain interoperable with other 25519 code.

cathugger · 2019-01-19T18:02:21Z

My main question is whether we can do it.
We have

int crypto_sign_seckey_expand(unsigned char *sk,const unsigned char *seed)
{
  crypto_hash_sha512(sk,seed,32);
  sk[0] &= 248;
  sk[31] &= 63;
  sk[31] |= 64;

  return 0;
}

which zeros 2 topmost bits and sets 3rd topmost bit so we essentially have only 2 bits to shift, anything more and we're truncating.
would it be safe to start with number of different structure?

foobar2019 · 2019-01-19T18:09:16Z

I'm not sure why you want to shift the scalar at all. That would be horribly inefficient (you'd have to do that every round AND you'd have to do it mod M. Bit twiddles are ONLY for generation, MUST NOT BE APPLIED after further operations, such as addition or multiplication of scalar. Later on, you can only check for overflows.

                if ((sk[0] & 248) != sk[0] || ((sk[31] & 63) | 64) != sk[31])
                        goto initseed;

Can be done if you don't wan't to bother with modular reductions - you simply detect overflow and bail.

foobar2019 · 2019-01-19T18:14:02Z

The reason why you can't use "bail on overflow" with ge_dbl is as you noted - you'd shift out of range pretty quickly, so you'd end up constantly reseeding.

cathugger · 2019-01-19T18:21:46Z

Heh, my lack of understanding of primitives involved shows.

Can be done if you don't wan't to bother with modular reductions

I think I'm starting to understand, fe doubling is modular operation, therefore we need to do scalar exponentation in modular way too?

foobar2019 · 2019-01-19T18:38:16Z

I think I'm starting to understand, fe doubling is modular operation, therefore we need to do scalar exponentation in modular way too?

Indeed. The formula I posted earlier turned to be incorrect (my confusion, for a change, deleted my misinformative posts).

cathugger · 2019-01-19T18:46:51Z

For some reason I deem ge_dbl as raising to power of 2, which it is not.

Uhh, it's *2, not ^2, yeah.

Well, I'll explain my understanding.

Technically, you can just add the scalar to itself (aka shift left), modulo M, every ge_dbl.

I wasn't going to do that. I'd shift left n amount of times, where n is number of times I did ge_p2_dbl.
Now, since you mentioned that math is modular, we'd mask some bits after that, and we should be all set, right? I'm unaware what is M though, if it's not power of 2 then this won't work.
There's limit how much we can shift this way, as we decrease amount of entropy every time we do it. Secret key scalar is 256 bits so we'd need to shift 256 times max and not even that because at that point we'd have no entropy left so we want to stop much earlier.

cathugger · 2019-01-19T18:51:10Z

This sorta has issue that we're going to reseed much more often, especially if we wanna more entropy.

foobar2019 · 2019-01-19T18:51:33Z

You have M points on the curve, and scalar is simply "index" for each point. All prime modulus math which applies on curve, applies to scalar. Except for scalar its ordinary counts, on curve you have only + and *2 operations from which you can build scalar_multiply.

https://github.com/floodyberry/ed25519-donna/blob/master/modm-donna-64bit.h

Note that ref10 is exceedingly clever wrt scalars, it handles only sc_muladd, which does what it says using primitive level reductions (very slow, but it hardly matters).

cathugger · 2019-01-19T18:57:09Z

Uh so M isn't power of 2 and that makes my assumptions invalid, thanks for clarification.

cathugger · 2019-01-19T18:58:34Z

2^252 + 27742317777372353535851937790883648493 yeah looks pretty non-power-of-2-ish.

foobar2019 · 2019-01-19T19:02:58Z

Yes, M is a prime, and all the math in there sucks (see the barret reduction horrors in donna for reasonably fast code). The way the formula would go in practice is is compute 2^counter % M using shift-and-multiply exponentiation, which costs N field muls for N bit counter. And then just straight multiplying the result to scalar. Plus side is that base is just 2, so you can precompute all its powers modulo M, so the exp cost is only N/2.

cathugger · 2019-01-19T19:17:33Z

Thanks for patience explaining, understood.
I guess I should focus on making Montgomery trick work for other than ref10 implementations now, kinda have various stuff going on IRL atm so I'm unsure when I'll be able to get back on this, but now I at least have clear understanding what and how I need to do this.

foobar2019 · 2019-01-19T19:20:24Z

@cathugger The amount of refactoring to go into this not sure if its worth it indeed. The batching is the most crucial thing, as it has speed up most dramatic.

I'm considering making a general-purpose ed25519 vanitygen solely for opencl, so this is more of an excercise of "wishlist" for things to have in there.

cathugger · 2019-01-19T23:33:58Z

~16x speedup with 256 batch at 9972a83
damn son I need sunglasses for this

cathugger · 2019-01-19T23:57:49Z

Speedup is a little bit smaller (~14x) when actual filter is specified but still pretty good, doing more batching beyond 1024 helps but not as sharply. I have feeling CPU caching benefits aren't as big as gain we get avoiding multiplications.

cathugger · 2019-01-20T00:06:54Z

Ok seems at 4096 it's not faster anymore, kinda even slower so I guess CPU caching benefits win at that point.

cathugger · 2019-01-20T00:29:45Z

Things work reasonably well so far though I wanna make donna impl work before enabling this, and I'm going away from this for a while, so whoever wanna test on their own using ref10 or amd64* impls can just edit Makefile after configure, and add -DBATCHKEYGEN -DBATCHNUM=2048 (or whatever other batchnum value they wanna try) to CFLAGS=, and then make clean -s; make -s -j.

cathugger · 2019-01-20T00:34:03Z

Oh and pass -B flag to actually activate it.

foobar2019 · 2019-01-20T01:32:31Z

Ok, here is ge_dbl prototype to formalize what it is all about:

package main

import (
	"crypto/sha512"
	"encoding/base32"
	"github.com/agl/ed25519/edwards25519"
	"log"
	"math/big"
)

func GetScalar(seed []byte) *[32]byte {
	h := sha512.New()
	h.Write(seed)
	digest := h.Sum(nil)

	digest[0] &= 248
	digest[31] &= 127
	digest[31] |= 64

	var hBytes [32]byte
	copy(hBytes[:], digest)
	return &hBytes
}

// SetBytes uses BE, but ed25519 is LE
func Flip(buf []byte) (res [32]byte) {
	// Bytes() from big can return less than 32bytes. Ensure that we start from right index.
	top := len(buf)-1
	for i,v := range buf {
		res[top-i] = v
	}
	return
}

func main() {
	// The counter we'd find wanted onion at
	counter := int64(12345)
	M, _ := new(big.Int).SetString("7237005577332262213973186563042994240857116359379907606001950938285454250989", 10)

	onion := base32.NewEncoding("abcdefghijklmnopqrstuvwxyz234567").WithPadding(base32.NoPadding)
	var ge edwards25519.ExtendedGroupElement
	// This is just standard address seeded by all-zero (for reference)
	var seed [32]byte
	sc := GetScalar(seed[:])
	edwards25519.GeScalarMultBase(&ge, sc)

	// Print reference
	var buf [32]byte
	ge.ToBytes(&buf)
	log.Println(onion.EncodeToString(buf[:]))

	// Now double it 12345 times
	for i := int64(0); i < counter; i++ {
		var c edwards25519.CompletedGroupElement
		ge.Double(&c)
		c.ToExtended(&ge)
	}

	// Print after 12345 times
	ge.ToBytes(&buf)
	log.Println(onion.EncodeToString(buf[:]))

	// Now take the starting scalar, and multiply it by 2^12345 mod M
 	result := new(big.Int)
	result.Exp(big.NewInt(2), big.NewInt(counter), M)
	flipped := Flip(sc[:])
	result.Mul(result, new(big.Int).SetBytes(flipped[:]))
	result.Mod(result, M)

	// Which should be same as public key doubled 12345 times
	sc2 := Flip(result.Bytes())
	var ge2 edwards25519.ExtendedGroupElement
	edwards25519.GeScalarMultBase(&ge2, &sc2)
	ge2.ToBytes(&buf)
	log.Println(onion.EncodeToString(buf[:]))
}

cathugger · 2019-01-27T18:41:20Z

0befa41
added support to ed25519-donna, set stuff to compile by default, idk if any issues may arise so it's not yet activated by default, use -B flag for that.

regarding point doubling, I don't think current primitives in ed25519 implementations support proper exponent thingie, will probably need to look elsewhere (later), otherwise I don't see much other issues implementing it yet.

programagor · 2019-10-25T13:42:34Z

Is -p PASSWORD incompatible with -B? Because when I specify a seed via the -p flag, the -B flag seems to have no effect on speed. My ./configure options: --enable-amd64-64-24k --enable-intfilter=32

cathugger · 2019-10-25T18:59:51Z

@programagor yeah, they're currently not compatible (because pass x batch mainloop func isn't written). I should fix that tbh.

cathugger · 2019-11-15T05:22:34Z

@programagor added to master, will be part of next release

joshjames · 2020-10-02T17:44:27Z

Can I just ask, wouldnt the easiest way to speed this up be to split the load? if possible?
I have not looked in to this at all yet but I know from other projects that being able to use multiple machines to cluster the workload like with OCLhash/hashcat made a huge difference, especially with the cloud and container based scale, a few quick scripts and I'm able to access 20 x multi core xeon machines.... for something like this imagine what 20x the speed would do.

programagor · 2020-10-04T21:15:01Z

@joshjames You can simply run mkp224o on multiple machines with a different starting seed. Individual instances don't need to communicate with each other, they will explore their own key space based on the seed. If you run two instances with the same seed, they will generate the same keys.

cathugger · 2020-10-17T19:52:09Z

additionally, yaml output can be combined with something like netcat to deliver generated keys to single server
regarding seed, that's only in password mode, which isn't the default.
it uses random seed by default, and reseeds upon every successful match, so as long as worker machines have okay entropy it should just work.

cathugger · 2023-03-04T18:36:55Z

https://lib25519.cr.yp.to seems to be a new thing with possibly improved fe25519_mul impl in crypto_mGnP/ed25519/amd64-{maax,mxaa} folders & potentially requiring BMI1/BMI2 instructions for these (i haven't checked in full).
could be a nice thing for next perf bump if/when i actually care enough to try integrating that.

cathugger pinned this issue Jan 19, 2019

cathugger added the enhancement label Jan 27, 2019

cathugger unpinned this issue May 18, 2019

cathugger added the cool issue label Jul 19, 2020

Explore speedup techniques #20

Explore speedup techniques #20

Comments

foobar2019 commented Jan 18, 2019 • edited

foobar2019 commented Jan 18, 2019

cathugger commented Jan 19, 2019 • edited

cathugger commented Jan 19, 2019

foobar2019 commented Jan 19, 2019

foobar2019 commented Jan 19, 2019

cathugger commented Jan 19, 2019

cathugger commented Jan 19, 2019

cathugger commented Jan 19, 2019

foobar2019 commented Jan 19, 2019 • edited

foobar2019 commented Jan 19, 2019 • edited

cathugger commented Jan 19, 2019

foobar2019 commented Jan 19, 2019

foobar2019 commented Jan 19, 2019

cathugger commented Jan 19, 2019

foobar2019 commented Jan 19, 2019

cathugger commented Jan 19, 2019

cathugger commented Jan 19, 2019

foobar2019 commented Jan 19, 2019

cathugger commented Jan 19, 2019

cathugger commented Jan 19, 2019

foobar2019 commented Jan 19, 2019 • edited

cathugger commented Jan 19, 2019

foobar2019 commented Jan 19, 2019

cathugger commented Jan 19, 2019

cathugger commented Jan 19, 2019

cathugger commented Jan 20, 2019

cathugger commented Jan 20, 2019

cathugger commented Jan 20, 2019

foobar2019 commented Jan 20, 2019 • edited

cathugger commented Jan 27, 2019

programagor commented Oct 25, 2019 • edited

cathugger commented Oct 25, 2019

cathugger commented Nov 15, 2019

joshjames commented Oct 2, 2020

programagor commented Oct 4, 2020

cathugger commented Oct 17, 2020

cathugger commented Mar 4, 2023 • edited

foobar2019 commented Jan 18, 2019 •

edited

cathugger commented Jan 19, 2019 •

edited

foobar2019 commented Jan 19, 2019 •

edited

foobar2019 commented Jan 19, 2019 •

edited

foobar2019 commented Jan 19, 2019 •

edited

foobar2019 commented Jan 20, 2019 •

edited

programagor commented Oct 25, 2019 •

edited

cathugger commented Mar 4, 2023 •

edited