Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better default RNG in the future? #27614

Closed
PallHaraldsson opened this issue Jun 16, 2018 · 75 comments
Closed

Better default RNG in the future? #27614

PallHaraldsson opened this issue Jun 16, 2018 · 75 comments
Labels
RNG Random number generation and the Random stdlib

Comments

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Jun 16, 2018

EDIT: "August 2018 Update: xoroshiro128+ fails PractRand very badly. Since this article was published, its authors have supplanted it with xoshiro256**. It has essentially the same performance, but better statistical properties. xoshiro256 is now my preferred PRNG**." I previously quoted: "The clear winner is xoroshiro128+"

https://nullprogram.com/blog/2017/09/21/

Nobody ever got fired for choosing Mersenne Twister. It’s the classical choice for simulations, and is still usually recommended to this day. However, Mersenne Twister’s best days are behind it. ..

Ultimately I would never choose Mersenne Twister for anything anymore. This was also not surprising.

@StefanKarpinski
Copy link
Sponsor Member

Too late for this release but the default RNG stream is explicitly documented to not be stable so we can change it in 1.x versions. Another contender is the PCG family of RNGs.

@PallHaraldsson
Copy link
Contributor Author

PallHaraldsson commented Jun 16, 2018

Ok, I thought last change (people assume stable). It's 7 lines of C code.

@PallHaraldsson
Copy link
Contributor Author

I'm on smartphone, I at least can't locate "(un)stable" at

https://docs.julialang.org/en/latest/stdlib/Random/

@PallHaraldsson
Copy link
Contributor Author

PallHaraldsson commented Jun 16, 2018

MT, Julia's current default, is "between 1/4 and 1/5 the speed of xoroshiro128+." and MT has "one statistical failure (dab_filltree2)."

"PCG only generates 32-bit integers despite using 64-bit operations. To properly generate a 64-bit value we’d need 128-bit operations, which would need to be implemented in software."

http://xoshiro.di.unimi.it

64-bit Generators
xoshiro256** (XOR/shift/rotate) is our all-purpose, rock-solid generator (not a cryptographically secure generator, though, like all PRNGs in these pages). It has excellent (sub-ns) speed, a state space (256 bits) that is large enough for any parallel application, and it passes all tests we are aware of.

If, however, one has to generate only 64-bit floating-point numbers (by extracting the upper 53 bits) xoshiro256+ is a slightly (≈15%) faster generator with analogous statistical properties. For general usage, one has to consider that its lowest bits have low linear complexity and will fail linearity tests; however, low linear complexity can have hardly any impact in practice, and certainly has no impact at all if you generate floating-point numbers using the upper bits (we computed a precise estimate of the linear complexity of the lowest bits).

If you are tight on space, xoroshiro128** (XOR/rotate/shift/rotate) and xoroshiro128+ have the same speed and use half of the space; the same comments apply. They are suitable only for low-scale parallel applications; moreover, xoroshiro128+ exhibits a mild dependency in Hamming weights that generates a failure after 8 TB of output in our test. We believe this slight bias cannot affect any application.

Finally, if for any reason (which reason?) you need more state, we provide in the same vein xoshiro512** / xoshiro512+ and xoroshiro1024** / xoroshiro1024* (see the paper).

@andreasnoack
Copy link
Member

andreasnoack commented Jun 16, 2018

It looks like the timings are based on his own implementation of MT. The one we use is probably quite a bit faster.

Update: The Dieharder test suite is also considered outdated

@PallHaraldsson
Copy link
Contributor Author

PallHaraldsson commented Jun 16, 2018

Also intriguing-just use AES:

Speed
The speed reported in this page is the time required to emit 64 random bits, and the number of clock cycles required to generate a byte (thanks to the PAPI library). If a generator is 32-bit in nature, we glue two consecutive outputs. Note that we do not report results using GPUs or SSE instructions: for that to be meaningful, we should have implementations for all generators. Otherwise, with suitable hardware support we could just use AES in counter mode and get 64 secure bits in 1.12ns.

@JeffreySarnoff
Copy link
Contributor

Extensive testing by others (Daniel Lemire, John D. Cook) favors PCG and notes a few problems with the xor__s. One or the other or another -- it will require Julia-centric testing and benchmarking.

@Nosferican
Copy link
Contributor

Random is a stdlib so it could have breaking changes and a new release before the next release of the language, no?

@StefanKarpinski
Copy link
Sponsor Member

Yes, it could have breaking changes before Julia 2.0.

@staticfloat
Copy link
Sponsor Member

staticfloat commented Jun 19, 2018

Just for fun, I coded up a prototype implementation:

# A quick test to test a new RNG for Julia
using BenchmarkTools

function xorshift1024(N, s)
    x = Array{UInt64}(undef, N)
    p = 0
    for i = 1:N
        # One-based indexing strikes again
        s0 = s[p + 1]
        p = (p + 1)%16
        s1 = s[p + 1]

        # Make sure you use xor() instead of ^
        s1 = xor(s1, s1 << 31)
        s1 = xor(s1, s1 >> 11)
        s0 = xor(s0, s0 >> 30)
        s[p + 1] = xor(s0, s1)

        # I love magic constants
        x[i] = s[p + 1] * 1181783497276652981
    end
    return x
end

function xoroshiro128plus(N, s)
    x = Array{UInt64}(undef, N)
    @inbounds for i = 1:N
        # copy state to temp variables
        s1, s2 = (s[1], s[2])
        
        # Calculate result immediately
        x[i] = s1 + s2
        
        # Mash up state
        s2 = xor(s2, s1)
        s1 = xor(((s1 << 55) | (s1 >> 9)), s2, (s2 << 14))
        s2 = (s2 << 36) | (s2 >> 28)
        
        # Save new state
        s[1], s[2] = (s1, s2)
    end
    return x
end

# Collect some performance data
for N in round.(Int64, 2 .^ range(12, stop=20, length=3))
    println("Testing generation of $N 64-bit numbers")
    local s = rand(UInt64, 16)

    # compare xorshift against rand()
    @btime xorshift1024($N, $s)
    @btime xoroshiro128plus($N, $s)
    @btime rand(UInt64, $N)
end

This gives surprisingly good timings, considering MT has been optimized for many a year:

Testing generation of 4096 64-bit numbers
  13.889 μs (2 allocations: 32.08 KiB)
  8.136 μs (2 allocations: 32.08 KiB)
  6.733 μs (2 allocations: 32.08 KiB)
Testing generation of 65536 64-bit numbers
  219.095 μs (2 allocations: 512.08 KiB)
  130.786 μs (2 allocations: 512.08 KiB)
  100.082 μs (2 allocations: 512.08 KiB)
Testing generation of 1048576 64-bit numbers
  4.658 ms (2 allocations: 8.00 MiB)
  3.245 ms (2 allocations: 8.00 MiB)
  3.118 ms (2 allocations: 8.00 MiB)

I note that @inbounds actually makes xorshift1024 slower, which is why that's not included in the code above.

@StefanKarpinski
Copy link
Sponsor Member

I don't think it's all that surprising—MT is pretty complex. One of the huge advantages of xoroshiro and PCG is that they're much simpler and easier to generate really complex code for.

@JeffreySarnoff
Copy link
Contributor

they put the suit in pseutdorandom

@Godisemo
Copy link

Godisemo commented Jun 19, 2018

The package RandomNumbers.jl implements many different RNGs, including xoroshiro128+ which is quite a bit faster than the standard RNG from base according to the benchmarks in the documentation.

@PallHaraldsson
Copy link
Contributor Author

PallHaraldsson commented Jun 19, 2018

Is Float rand more important than Int rand? It seems it may matter:

@JeffreySarnoff thanks for the links; still "favors PCG and notes a few problems with the xor__s." seems not conclusive nor comparing to latest variants.

Java's shiftmix64 is also interesting.

Also Google's
https://github.com/google/randen
as per AES comment above.

@PallHaraldsson
Copy link
Contributor Author

PallHaraldsson commented Jun 19, 2018

https://github.com/sunoru/RandomNumbers.jl/issues/1#issuecomment-233533191

@sunoru "Yes, I have considered accelerating some RNGs with GPU"

I guess AES is not available on GPUs so that may be a factor to NOT choose AES as default on CPUs? I.e. would we want same default on GPU and CPU?

@StefanKarpinski
Copy link
Sponsor Member

In general, it's fine (and often good) to drop some extra output bits. In PCG, IIRC, you split the internal output into external output bits and permutation bits in a way that can be adjusted, raising the possibility that we could use different output steps based on the same state to generate 64 bits of integer output versus 53 bits of floating-point output. Another approach would be to use different RNG streams for different bit sizes, which is appealing for being able to generate random 128-bit values, for example. (A pair of 128-bit streams may not be as good as a single 128-bit stream, for example, unless the two streams have coprime periods which is hard to arrange, so it can be quite helpful to have different sized streams that take advantage of their increased size to produce better streams.)

@PallHaraldsson PallHaraldsson changed the title Better default rand for 1.0. Xoroshiro256? Better default rand for 1.0. Xo(ro)shiro256**? Jun 19, 2018
@StefanKarpinski StefanKarpinski changed the title Better default rand for 1.0. Xo(ro)shiro256**? Better default RNG in the future? Jun 19, 2018
@PallHaraldsson
Copy link
Contributor Author

PallHaraldsson commented Jun 19, 2018

"The clear winner is xoroshiro128+" (seems outdated) @staticfloat thanks for coding it. Actually "xoshiro256**" seems to be their best; note dropped letters, I first thought a typo.

@sunoru
Copy link
Contributor

sunoru commented Jun 19, 2018

I'm sorry that I haven't updated that package for long. But maybe this is worth a look: http://sunoru.github.io/RandomNumbers.jl/latest/man/benchmark/

My personal choice is xoroshiro128+. Since some RNGs in PCG don't pass the big crush as its paper says, I don't know if PCG is as reliable as it sounds.

@StefanKarpinski
Copy link
Sponsor Member

PCG not passing big crush is indeed an issue. That seems like a strange disconnect.

@staticfloat
Copy link
Sponsor Member

@sunoru I was just looking at your RandomNumbers.jl page; that is some really nice work you've put together. IMO, discussion about a better RNG should be informed by your work as you've put in the time and effort to actually compare against quite a number of different PRNGs. And yes, it does appear true that xoroshiro128+ is the winner in all aspects so far; however I am not sure what the test for quality is nowadays; I've heard of BigCrush, DieHarder, TestU01, etc..... I'm sure Andreas can answer which is the test to pay attention to these days. :)

@JeffreySarnoff
Copy link
Contributor

The real issue with PCG is that it is not being well-tended. I don't know anything about why it got crushed -- if there were more effort behind that family, a remedy modification may have been made but there is none. So, going with new information: I still like PCG .. to use it for what it does best .. and with the failure, that's not this.

@chethega
Copy link
Contributor

If AES-CTR turns out to be fast enough on most architectures, then I think this would be absolutely fantastic as the default RNG.

This eradicates an entire class of security bugs and relieves people from the mental burden of considering the quality of the random stream and how to use it. And people who need something faster can still switch this out.

The only big isssue is architectures without hardware support for AES (then it would almost surely be too slow). So the bitter pill to swallow is that the random stream would become architecture dependent (e.g. older raspberry pi don't have aes acceleration and should fall back on something else).

@ViralBShah
Copy link
Member

I would be ok with having a different one on different architectures.

@chethega
Copy link
Contributor

After looking at http://www.cs.tufts.edu/comp/150FP/archive/john-salmon/parallel-random.pdf, I am more convinced that AES would be a really good default on supporting archs: Fast enough (faster than MT on a lot of CPUs, and does not eat as much L1), practically perfect quality, and no possibility of people abusing rand(UInt64) to generate session cookies and getting pwned for it.

@PallHaraldsson
Copy link
Contributor Author

PallHaraldsson commented Dec 27, 2018

@ViralBShah (on xoroshiro128** or you mean on their older RNG?) "I thought those RNG don't pass the RNG testsuite." I googled, and assume you mean:

https://github.com/andreasnoackjensen/RNGTest.jl/ that I found here: #1391

I think the discussion should continue here, not the startup time issue, while I'm not sure the default RNG needs to pass all tests (I'm not sure it does NOT do that), if you can easily opt into a better RNG.

At http://xoshiro.di.unimi.it/

there's both e.g. xoroshiro128** and xoroshiro128+ to look into.

and I only see a minor flaw on the latter:

Quality

This is probably the more elusive property of a PRNG. Here quality is measured using the powerful BigCrush suite of tests. BigCrush is part of TestU01, a monumental framework for testing PRNGs developed by Pierre L'Ecuyer and Richard Simard (“TestU01: A C library for empirical testing of random number generators”, ACM Trans. Math. Softw. 33(4), Article 22, 2007).
[..]
Beside BigCrush, we analyzed our generators using a test for Hamming-weight dependencies described in our paper. As we already remarked, our only generator failing the test (but only after 5 TB of output) is xoroshiro128+. [..]

@jdchn
Copy link

jdchn commented Dec 29, 2018

An architecture-dependent default PRNG might be a little awkward since one of the primary benefits of PRNG is reproducibility.

@mbauman mbauman added the RNG Random number generation and the Random stdlib label Dec 31, 2018
@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Oct 30, 2019

Actually, there is no cost in Julia to convert a UInt64 to Float64.

This is kind of true: the reinterpret itself is a no-op, but in order to do most integer operations on floating-point values in hardware, it's necessary to move the value from a floating-point register to an integer register (and back if you want to continue using as floating-point), which does take a cycle in each direction, so two cycles to go back and forth. So it's not free in practice.

@chethega
Copy link
Contributor

chethega commented Oct 30, 2019

I don't get how you people got that speed between xoshiro256++ and dsfmt is even comparable. With proper simd / avx2, xoshoro256++ is almost 4x faster than dsfmt for integer bulk generation, on my setup. No way in hell is the int->float conversion going to turn that into a loss.

On my machine I get down to ~0.14 cpb:

julia dsfmt 1024 x UInt64
  2.089 μs (0 allocations: 0 bytes)
ref impl, 1 x interleaved, 1024 x UInt64
  1.579 μs (0 allocations: 0 bytes)
ref impl, 4 x interleaved, 1024 x UInt64
  2.150 μs (0 allocations: 0 bytes)
julia 1 x interleaved 1 x SIMD, 1024 x UInt64
  4.336 μs (0 allocations: 0 bytes)
  1.838 μs (0 allocations: 0 bytes)
julia 1 x interleaved 2 x SIMD, 1024 x UInt64
  2.709 μs (0 allocations: 0 bytes)
  1.384 μs (0 allocations: 0 bytes)
julia 1 x interleaved 4 x SIMD, 1024 x UInt64
  1.487 μs (0 allocations: 0 bytes)
  709.241 ns (0 allocations: 0 bytes)
julia 1 x interleaved 8 x SIMD, 1024 x UInt64
  927.750 ns (0 allocations: 0 bytes)
  604.148 ns (0 allocations: 0 bytes)
julia 1 x interleaved 16 x SIMD, 1024 x UInt64

with beautiful code (allmost no branches, everything 256 bit wide; if I had a AVX512 machine, this would be even better).

CF https://gist.github.com/chethega/344a8fe464a4c1cade20ae101c49360a

Can anyone link what xoshiro code you benchmarked? Maybe I just suck at C?

@vigna
Copy link

vigna commented Oct 30, 2019

I don't get how you people got that speed between xoshiro256++ and dsfmt is even comparable. With proper simd / avx2, xoshoro256++ is almost 4x faster than dsfmt for integer bulk generation, on my setup. No way in hell is the int->float conversion going to turn that into a loss.

Wow, now you're making me shiver. 0.14 cycle/B? The vectorization is performed by the compiler or did you handcraft it (sorry, totally not proficient with Julia)?

@chethega
Copy link
Contributor

chethega commented Oct 30, 2019

I'm running an Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz, i.e. a pretty anemic laptop broadwell.

The vectorization was by hand, the code is in the gist, but small enough to post here:

using SIMD, BenchmarkTools
mutable struct xstate{N}
    s0::Vec{N, UInt64}
    s1::Vec{N, UInt64}
    s2::Vec{N, UInt64}
    s3::Vec{N, UInt64}
end
macro new(T)
    return Expr(:new, T)
end

function init(xs::xstate)
    r=rand(UInt8, sizeof(xs))
    GC.@preserve xs, r
        ccall(:memcpy, Cvoid, (Ptr{Nothing}, Ptr{Nothing}, Csize_t), pointer_from_objref(xs), pointer(r), sizeof(xs))
    nothing
end

@inline rotl(a, k) = (a<<k) | (a>>(64-k))

@inline function next_nostore((s0, s1, s2, s3))
    res = rotl(s0 + s3, 23) + s0
    t = s1 << 17
    s2 ⊻= s0
    s3 ⊻= s1
    s1 ⊻= s2
    s0 ⊻= s3
    s2 ⊻= t
    s3 = rotl(s3, 45)
    return (res, (s0, s1, s2, s3))
end

function XfillSimd_nostore(dst, xs::xstate{N}, num) where N
    lane = VecRange{N}(0)
    s = xs.s0, xs.s1, xs.s2, xs.s3
    @inbounds for i=1:N:num
        dst[lane+i], s = next_nostore(s)
    end
    xs.s0, xs.s1, xs.s2, xs.s3 = s
    nothing
end
N=1024; dst=zeros(UInt64, N);
xs = @new xstate{8}; init(xs)
@btime XfillSimd_nostore($dst, $xs, $N)

Note that this uses 8 independent RNGs for bulk generation: advance 8 states, generate 8 outputs, write 8 outputs to destination, repeat.

This is no issue, just needs some extra seeding logic (seed in bulk). In the julia code, this looks like 1 rng whose state consists of four <8 x UInt64> vectors.

PS. I also implemented the 5 lines of xoshiro256** in the gist. It is much slower, presumably due to the giant latency of the multiplication. The generated inner loop for 4 x xoshiro256++ is a beauty:

L144:
	vpsrlq	$41, %ymm4, %ymm6
	vpsllq	$23, %ymm4, %ymm4
	vpor	%ymm4, %ymm6, %ymm4
	vpaddq	%ymm0, %ymm4, %ymm4
	vmovdqa	%ymm5, %ymm0
	movq	(%rdi), %rcx
	vmovdqu	%ymm4, -8(%rcx,%rdx,8)
	cmpq	%rdx, %rax
	je	L266
	addq	$4, %rdx
	vpaddq	%ymm0, %ymm2, %ymm4
	vpsllq	$17, %ymm1, %ymm6
	vpxor	%ymm0, %ymm3, %ymm3
	vpxor	%ymm1, %ymm2, %ymm2
	vpxor	%ymm1, %ymm3, %ymm1
	vpxor	%ymm0, %ymm2, %ymm5
	vpxor	%ymm6, %ymm3, %ymm3
	vpsllq	$45, %ymm2, %ymm6
	vpsrlq	$19, %ymm2, %ymm2
	vpor	%ymm6, %ymm2, %ymm2
	jns	L144

@chethega
Copy link
Contributor

So, after consideration:

The proposed setup, if we wanted to switch to xoshiro256++, would be to reserve 5 cache-lines of RNG state (ach thread has its own completely independent generator). One CL is for sequential generation, and laid out like xstate{2}, with an implementation that only advances the first of the RNGs for 1-8 byte outputs, and advances both for 16-byte outputs. This is used in code that uses rand(some_small_bitstype).

Then, there are 4 cache-lines for bulk generations, laid out like xstate{8}.

These 4 cachelines aren't hot for code that doesn't use bulk generation, so nobody cares about state-size.

There sequential generator uses xstate{2} layout because I have no better idea what to do with the remaining half of the line, so we may well use it to speed up 16byte-wide generation; and there is no reason to compactify the layout of the primary generator (sequential generation of <= 8 byte numbers), since we can't use wide SIMD loads anyway.

The sequential generator only exists because: (1) sequential generation should only fill in a single cache-line of state, and (2) bulk generation wants to use wide vector loads/stores for the state. Code that uses both sequential and bulk generation will therefore fill 5 instead of 4 lines; doesn't matter, who cares.

The sequential generator should be properly inlined (verify that, and possibly add @inline). Unfortunately I see no good way of getting the compiler to optimize out the stores to xstate in tight loops; I assume that this is in order to ensure correct behaviour when the store traps? But we don't deal with segfaults anyway, so maybe there is some magic llvm annotation that tells llvm to not care about correct xstate after trapping? I am not opposed to writing the whole RNG in a giant llvmcall.

There is no reason to implement jump. We only need seed! and output generation.

How do we tell the allocator to give us 64-byte aligned mutable struct?

In total, I think this would be a good move, both for speed and in order to cut out the dependency. One would need to port a variant of my code over to Base-julia (i.e. without SIMD.jl) and verify that the resulting @code_native is OK on all supported archs, and that the 8 x wide SIMD doesn't hurt perf on weaker machines (i.e. check that the compiler doesn't spill registers). Someone with access to AVX512, POWER and various ARM should chime in at some point. Worst case we can stick with 4 x wide simd and save 2 cache lines. Downside is slightly lower perf with AVX2 and presumably massive perf degradation with AVX512. SIMD widths are increasing all the time, so being future-proof without breaking random stream would be nice.

On the other hand, if we change the default RNG at all, I would still prefer a cryptographic choice; if that is rejected and if scientific consensus says that xoshiro gives "good enough" streams, then that's good enough for me.

@rfourquet
Copy link
Member

I don't get how you people got that speed between xoshiro256++ and dsfmt is even comparable.

Hey, I just ported naïvely @vigna's code to Julia, I wanted to have an actualized performance comparison compared to last year and with xoshiro256++ specifically. It was good enough for me, and I currently don't have the skills to write simd-optimized code! :D

What you did is amazing, with your benchmark above, this gets 2.6 times faster than my array generation routine.
If this is what you asked, here is my code:

using Random

mutable struct Xoropp <: AbstractRNG
    s0::UInt64
    s1::UInt64
    s2::UInt64
    s3::UInt64
end

rotl(x::UInt64, k::UInt) = (x << k) | (x >> ((64 % UInt) - k))

function Base.rand(rng::Xoropp, ::Random.SamplerType{UInt64})
    s0, s1, s2, s3 = rng.s0, rng.s1, rng.s2, rng.s3

    result = rotl(s0 + s3, UInt(23)) + s0

    t = s1 << 17

    s2 ⊻= s0
    s3 ⊻= s1
    s1 ⊻= s2
    s0 ⊻= s3

    s2 ⊻= t

    s3 = rotl(s3, UInt(45))

    rng.s0, rng.s1, rng.s2, rng.s3 = s0, s1, s2, s3

    result
end


# array generation code: nothing magic, but somehow gets quite faster than the default
# code for array, which does `rand(rng, UInt64)` in a loop
function Random.rand!(rng::Xoropp, A::AbstractArray{UInt64},
                      ::Random.SamplerType{UInt64})
    s0, s1, s2, s3 = rng.s0, rng.s1, rng.s2, rng.s3

    @inbounds for i = eachindex(A)
        A[i] = rotl(s0 + s3, UInt(23)) + s0

        t = s1 << 17

        s2 ⊻= s0
        s3 ⊻= s1
        s1 ⊻= s2
        s0 ⊻= s3

        s2 ⊻= t

        s3 = rotl(s3, UInt(45))
    end
    rng.s0, rng.s1, rng.s2, rng.s3 = s0, s1, s2, s3
    A
end

# for Float64 generation
Random.rng_native_52(::Xoropp) = UInt64

I can't really comment on your "proposed setup" above, as I have not much ideas above cache lines and such.

@StefanKarpinski
Copy link
Sponsor Member

@chethega, if I'm following your description, it sounds like there would be two separate generator states in that situation—sequential and array. Does that mean that calling rand() would not advance the array generator state and calling rand(1000) would not advance the sequential generator state?

@chethega
Copy link
Contributor

chethega commented Oct 31, 2019

@chethega, if I'm following your description, it sounds like there would be two separate generator states in that situation—sequential and array. Does that mean that calling rand() would not advance the array generator state and calling rand(1000) would not advance the sequential generator state?

That is correct!

In even more detail, each thread would have 10 independent xoshiro instances. 1-64 bit sequential user requests would advance rng_1, 65+ bit sequential user requests would advance rng_1 and rng_2, and all bulk user requests would advance rng_3-rng_10. rng_1 and rng_2 would be laid out together in an xstate{2}, and the bulk generators would be laid out in an xstate{8}.

Hey, I just ported naïvely

Sorry, I didn't want to impugn your code. You do awesome work in keeping julia's random infrastructure together, and your APIs rock!

What you did is amazing

That's too much praise ;) I just read the assembly, noticed that it sucks, had the audacity to propose 10 interleaved xoshiro instances instead of a single one, and passed a Vec{8, UInt64} instead of a single UInt64 into the RNG function. Praise belongs more to @vigna for xoshiro and @eschnett for SIMD.jl. I guess we should also port the bulk generator to C, such that @vigna can use it as well. Would be a fun exercise in intel intrinsics (alternative: copy-paste the generated julia code and make it inline assembly. While at it, we can also clean it up: I see no reason that the hot loop has two conditional branches when we could do with a single one, probably some weird size minimization).

PS:

array generation code: nothing magic, but somehow gets quite faster than the default
code for array, which does rand(rng, UInt64) in a loop

That is because you hoisted writing and reading the state out of the loop. This is a super important optimization, and is the main property of my nostore variant. I feel fine with this for things with a pointer, but am unsure about AbstractArray: If the setindex! calls rand again, then we would generate duplicate random numbers. We don't know the setindex! implementation: Maybe the AbstractArray is really backed by a very complicated sparse array that internally uses randomized algorithms?

@rfourquet
Copy link
Member

rfourquet commented Nov 1, 2019

That is because you hoisted writing and reading the state out of the loop

I assumed so, but was surprised that the compiler doesn't do this seemingly easy optimization by itself. But your point about setindex! being allowed to interleave a call to the same generator helps me understand why this is not trivial.
(The faster version which assumes no weird setindex! should probably be made available to array implementations in some way).

Also, I realized that in my benchmarks above, I compared xoshiro++ to a local instance of MersenneTwister, not to the global RNG. In 1.3, with threading, its performance got a serious hit (by more than 2x), so it would be interesting to test xoshiro++ as the default global RNG.

@vigna
Copy link

vigna commented Nov 1, 2019

So, please help me: why won't this be vectorized by gcc or clang? Here XOSHIRO256_UNROLL=8 and I'm using -O3 -mavx2...

static inline uint64_t next(void) {
    uint64_t t[XOSHIRO256_UNROLL];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) result[i] = rotl(s0[i] + s3[i], R) + s0[i];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) t[i] = s1[i] >> A;

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s2[i] ^= s0[i];
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s3[i] ^= s1[i];
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s1[i] ^= s2[i];
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s0[i] ^= s3[i];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s2[i] ^= t[i];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s3[i] = rotl(s3[i], B);
    // This is just to avoid dead-code elimination
    return t[0] ^ t[XOSHIRO256_UNROLL-1];
}

@staticfloat
Copy link
Sponsor Member

staticfloat commented Nov 1, 2019

Try running clang with -Rpass-missed=loop-vectorize or -Rpass-analysis=loop-vectorize. That should print out debugging information showing which loops were not vectorized, which statements stopped vectorization, etc...

Reference: https://llvm.org/docs/Vectorizers.html#diagnostics

@vigna
Copy link

vigna commented Nov 1, 2019

I got this from clang:

./harness.c:83:11: remark: loop not vectorized: value that could not be identified as reduction is used outside the loop [-Rpass-analysis=loop-vectorize]
uint64_t e = -1;

But this is the external benchmarking loop. There's never a warning saying it's not optimizing the loops above. :(

From what I've read the problem would be that I'm using memory and not registers (i.e., variables aren't local).

@staticfloat
Copy link
Sponsor Member

Have you tried with -Rpass=loop-vectorize to see if it actually does think it's vectorizing the loop?

@chethega
Copy link
Contributor

chethega commented Nov 2, 2019

@vigna I don't know why your code didn't vectorize, but I am quite happy with clang's output for the following. The generated assembly is even nicer to read than julia's. Feel free to add this to your reference implementation, no attribution required.

#include <stdint.h>

#define XOSHIRO256_UNROLL 4

static inline uint64_t rotl(const uint64_t x, int k) {
	return (x << k) | (x >> (64 - k));
}

typedef struct {uint64_t s[4][XOSHIRO256_UNROLL];} xstate;

xstate global_state;

#define next_macro(result, s) { \
    uint64_t t[XOSHIRO256_UNROLL]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) result[i] = rotl(s[0][i] + s[3][i], 23) + s[0][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) t[i] = s[1][i] << 17; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[2][i] ^= s[0][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[3][i] ^= s[1][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[1][i] ^= s[2][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[0][i] ^= s[3][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[2][i] ^= t[i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[3][i] = rotl(s[3][i], 45); \
} \

void rngbulk(uint64_t* restrict dst, xstate* restrict state, unsigned int len){
	xstate slocal = *state;
	uint64_t result[XOSHIRO256_UNROLL];
	unsigned int len_trunc = len - (len % XOSHIRO256_UNROLL);
	for(int j = 0; j < len_trunc; j += XOSHIRO256_UNROLL){
		next_macro( (result) , (slocal.s) );
		for(int k=0; k<XOSHIRO256_UNROLL; k++) { dst[j+k] = result[k]; }
	}
	
	if(len_trunc != len){
		next_macro( (result) , (slocal.s) );
		for(int k = 0; k < len-len_trunc; k++){ dst[len_trunc + k] = result[k];	}
	}
	*state = slocal;
}

void rngbulk_globstate(uint64_t* restrict dst, unsigned int len){ rngbulk(dst, &global_state, len);}

edit: removed memory corruption if len is not divisble by XOSHIRO256_UNROLL.

@vigna
Copy link

vigna commented Nov 2, 2019

Ah OK! Just switching to s[4][XOSHIRO256_UNROLL] instead of s0[XOSHIRO256_UNROLL], etc. made my code vectorize with gcc 💪🏻.

Now I have to understand why clang does yours and not mine, but we're getting there...

@vigna
Copy link

vigna commented Nov 2, 2019

Please note that the >> A should be << A. 🤦🏻‍♂️

@vigna
Copy link

vigna commented Nov 2, 2019

And, most fundamental, it will not happen with -O3. It has to be -O2 -ftree-vectorize. There must be some other optimization that jumps in and prevents vectorization. Weird.

@etale-cohomology
Copy link

For me, GCC 8.3 generates fast(er) code with -O2 -ftree-vectorize -mavx2 and bad code with -O3 -ftree-vectorize -mavx2 for @vigna's XOSHIRO256_UNROLL (the bad code is about 33% slower).


With -O2 -ftree-vectorize -mavx2 I get:

4010ea:	c5 fd 6f 2d 8e 2f 00 	vmovdqa 0x2f8e(%rip),%ymm5        # 404080 <s>
  4010f1:	00 
  4010f2:	48 89 d8             	mov    %rbx,%rax
  4010f5:	c5 fd 6f 0d 43 30 00 	vmovdqa 0x3043(%rip),%ymm1        # 404140 <s+0xc0>
  4010fc:	00 
  4010fd:	c5 fd 6f 25 9b 2f 00 	vmovdqa 0x2f9b(%rip),%ymm4        # 4040a0 <s+0x20>
  401104:	00 
  401105:	c5 fd 6f 05 53 30 00 	vmovdqa 0x3053(%rip),%ymm0        # 404160 <s+0xe0>
  40110c:	00 
  40110d:	48 8d 93 00 00 01 00 	lea    0x10000(%rbx),%rdx
  401114:	c5 fd 6f 1d a4 2f 00 	vmovdqa 0x2fa4(%rip),%ymm3        # 4040c0 <s+0x40>
  40111b:	00 
  40111c:	c5 fd 6f 15 bc 2f 00 	vmovdqa 0x2fbc(%rip),%ymm2        # 4040e0 <s+0x60>
  401123:	00 
  401124:	0f 1f 40 00          	nopl   0x0(%rax)
  401128:	c5 d5 d4 f1          	vpaddq %ymm1,%ymm5,%ymm6
  40112c:	c5 bd 73 f2 11       	vpsllq $0x11,%ymm2,%ymm8
  401131:	c5 e5 ef c9          	vpxor  %ymm1,%ymm3,%ymm1
  401135:	48 83 c0 40          	add    $0x40,%rax
  401139:	c5 fd 7f 70 c0       	vmovdqa %ymm6,-0x40(%rax)
  40113e:	c5 dd d4 f0          	vpaddq %ymm0,%ymm4,%ymm6
  401142:	c5 ed ef c0          	vpxor  %ymm0,%ymm2,%ymm0
  401146:	c5 d5 ef 3d b2 2f 00 	vpxor  0x2fb2(%rip),%ymm5,%ymm7        # 404100 <s+0x80>
  40114d:	00 
  40114e:	c5 fd 7f 70 e0       	vmovdqa %ymm6,-0x20(%rax)
  401153:	c5 f5 ef ed          	vpxor  %ymm5,%ymm1,%ymm5
  401157:	c5 b5 73 f3 11       	vpsllq $0x11,%ymm3,%ymm9
  40115c:	c5 dd ef 35 bc 2f 00 	vpxor  0x2fbc(%rip),%ymm4,%ymm6        # 404120 <s+0xa0>
  401163:	00 
  401164:	c5 fd ef e4          	vpxor  %ymm4,%ymm0,%ymm4
  401168:	c5 c5 ef db          	vpxor  %ymm3,%ymm7,%ymm3
  40116c:	c4 c1 45 ef f9       	vpxor  %ymm9,%ymm7,%ymm7
  401171:	c5 7d 7f 0c 24       	vmovdqa %ymm9,(%rsp)
  401176:	c5 cd ef d2          	vpxor  %ymm2,%ymm6,%ymm2
  40117a:	c4 c1 4d ef f0       	vpxor  %ymm8,%ymm6,%ymm6
  40117f:	c5 7d 7f 44 24 20    	vmovdqa %ymm8,0x20(%rsp)
  401185:	c5 fd 7f 35 93 2f 00 	vmovdqa %ymm6,0x2f93(%rip)        # 404120 <s+0xa0>
  40118c:	00 
  40118d:	c5 cd 73 f1 2d       	vpsllq $0x2d,%ymm1,%ymm6
  401192:	c5 f5 73 d1 13       	vpsrlq $0x13,%ymm1,%ymm1
  401197:	c5 cd eb c9          	vpor   %ymm1,%ymm6,%ymm1
  40119b:	c5 cd 73 f0 2d       	vpsllq $0x2d,%ymm0,%ymm6
  4011a0:	c5 fd 7f 1d 18 2f 00 	vmovdqa %ymm3,0x2f18(%rip)        # 4040c0 <s+0x40>
  4011a7:	00 
  4011a8:	c5 fd 73 d0 13       	vpsrlq $0x13,%ymm0,%ymm0
  4011ad:	c5 fd 7f 15 2b 2f 00 	vmovdqa %ymm2,0x2f2b(%rip)        # 4040e0 <s+0x60>
  4011b4:	00 
  4011b5:	c5 cd eb c0          	vpor   %ymm0,%ymm6,%ymm0
  4011b9:	c5 fd 7f 2d bf 2e 00 	vmovdqa %ymm5,0x2ebf(%rip)        # 404080 <s>
  4011c0:	00 
  4011c1:	c5 fd 7f 25 d7 2e 00 	vmovdqa %ymm4,0x2ed7(%rip)        # 4040a0 <s+0x20>
  4011c8:	00 
  4011c9:	c5 fd 7f 3d 2f 2f 00 	vmovdqa %ymm7,0x2f2f(%rip)        # 404100 <s+0x80>
  4011d0:	00 
  4011d1:	c5 fd 7f 0d 67 2f 00 	vmovdqa %ymm1,0x2f67(%rip)        # 404140 <s+0xc0>
  4011d8:	00 
  4011d9:	c5 fd 7f 05 7f 2f 00 	vmovdqa %ymm0,0x2f7f(%rip)        # 404160 <s+0xe0>

But with -O3 -ftree-vectorize -mavx2 I get:

 4010bf:	48 8b 1d ea 2f 00 00 	mov    0x2fea(%rip),%rbx        # 4040b0 <s+0x30>
  4010c6:	48 8b 05 b3 2f 00 00 	mov    0x2fb3(%rip),%rax        # 404080 <s>
  4010cd:	4c 8b 05 6c 30 00 00 	mov    0x306c(%rip),%r8        # 404140 <s+0xc0>
  4010d4:	48 8b 3d 6d 30 00 00 	mov    0x306d(%rip),%rdi        # 404148 <s+0xc8>
  4010db:	48 89 9c 24 b0 00 00 	mov    %rbx,0xb0(%rsp)
  4010e2:	00 
  4010e3:	48 8b 1d 86 30 00 00 	mov    0x3086(%rip),%rbx        # 404170 <s+0xf0>
  4010ea:	48 89 84 24 b8 00 00 	mov    %rax,0xb8(%rsp)
  4010f1:	00 
  4010f2:	48 8b 05 8f 2f 00 00 	mov    0x2f8f(%rip),%rax        # 404088 <s+0x8>
  4010f9:	48 89 9c 24 a8 00 00 	mov    %rbx,0xa8(%rsp)
  401100:	00 
  401101:	48 8b 1d b0 2f 00 00 	mov    0x2fb0(%rip),%rbx        # 4040b8 <s+0x38>
  401108:	48 89 84 24 80 00 00 	mov    %rax,0x80(%rsp)
  40110f:	00 
  401110:	48 8b 05 79 2f 00 00 	mov    0x2f79(%rip),%rax        # 404090 <s+0x10>
  401117:	48 89 9c 24 a0 00 00 	mov    %rbx,0xa0(%rsp)
  40111e:	00 
  40111f:	48 8b 1d 52 30 00 00 	mov    0x3052(%rip),%rbx        # 404178 <s+0xf8>
  401126:	48 89 44 24 78       	mov    %rax,0x78(%rsp)
  40112b:	48 8b 05 66 2f 00 00 	mov    0x2f66(%rip),%rax        # 404098 <s+0x18>
  401132:	48 89 9c 24 98 00 00 	mov    %rbx,0x98(%rsp)
  401139:	00 
  40113a:	48 8b 1d 7f 2f 00 00 	mov    0x2f7f(%rip),%rbx        # 4040c0 <s+0x40>
  401141:	48 89 44 24 70       	mov    %rax,0x70(%rsp)
  401146:	48 8b 05 53 2f 00 00 	mov    0x2f53(%rip),%rax        # 4040a0 <s+0x20>
  40114d:	48 89 9c 24 f8 00 00 	mov    %rbx,0xf8(%rsp)
  401154:	00 
  401155:	48 8b 1d 6c 2f 00 00 	mov    0x2f6c(%rip),%rbx        # 4040c8 <s+0x48>
  40115c:	48 89 84 24 88 00 00 	mov    %rax,0x88(%rsp)
  401163:	00 
  401164:	48 8b 05 3d 2f 00 00 	mov    0x2f3d(%rip),%rax        # 4040a8 <s+0x28>
  40116b:	48 89 9c 24 c0 00 00 	mov    %rbx,0xc0(%rsp)
  401172:	00 
  401173:	48 8b 1d 56 2f 00 00 	mov    0x2f56(%rip),%rbx        # 4040d0 <s+0x50>
  40117a:	48 89 84 24 90 00 00 	mov    %rax,0x90(%rsp)
  401181:	00 
  401182:	48 8b 35 c7 2f 00 00 	mov    0x2fc7(%rip),%rsi        # 404150 <s+0xd0>
  401189:	48 8b 05 d8 2f 00 00 	mov    0x2fd8(%rip),%rax        # 404168 <s+0xe8>
  401190:	48 8b 0d c1 2f 00 00 	mov    0x2fc1(%rip),%rcx        # 404158 <s+0xd8>
  401197:	48 89 9c 24 c8 00 00 	mov    %rbx,0xc8(%rsp)
  40119e:	00 
  40119f:	48 8b 15 ba 2f 00 00 	mov    0x2fba(%rip),%rdx        # 404160 <s+0xe0>
  4011a6:	48 8b 1d 2b 2f 00 00 	mov    0x2f2b(%rip),%rbx        # 4040d8 <s+0x58>
  4011ad:	c7 44 24 1c 00 04 00 	movl   $0x400,0x1c(%rsp)
  4011b4:	00 
  4011b5:	48 89 44 24 68       	mov    %rax,0x68(%rsp)
  4011ba:	4c 8b 3d 3f 2f 00 00 	mov    0x2f3f(%rip),%r15        # 404100 <s+0x80>
  4011c1:	48 89 9c 24 e8 00 00 	mov    %rbx,0xe8(%rsp)
  4011c8:	00 
  4011c9:	48 8b 1d 10 2f 00 00 	mov    0x2f10(%rip),%rbx        # 4040e0 <s+0x60>
  4011d0:	4c 8b 35 31 2f 00 00 	mov    0x2f31(%rip),%r14        # 404108 <s+0x88>
  4011d7:	4c 8b 2d 32 2f 00 00 	mov    0x2f32(%rip),%r13        # 404110 <s+0x90>
  4011de:	48 89 9c 24 d0 00 00 	mov    %rbx,0xd0(%rsp)
  4011e5:	00 
  4011e6:	48 8b 1d fb 2e 00 00 	mov    0x2efb(%rip),%rbx        # 4040e8 <s+0x68>
  4011ed:	4c 8b 25 24 2f 00 00 	mov    0x2f24(%rip),%r12        # 404118 <s+0x98>
  4011f4:	4c 8b 1d 2d 2f 00 00 	mov    0x2f2d(%rip),%r11        # 404128 <s+0xa8>
  4011fb:	48 89 9c 24 d8 00 00 	mov    %rbx,0xd8(%rsp)
  401202:	00 
  401203:	48 8b 1d e6 2e 00 00 	mov    0x2ee6(%rip),%rbx        # 4040f0 <s+0x70>
  40120a:	4c 8b 15 1f 2f 00 00 	mov    0x2f1f(%rip),%r10        # 404130 <s+0xb0>
  401211:	4c 8b 0d 20 2f 00 00 	mov    0x2f20(%rip),%r9        # 404138 <s+0xb8>
  401218:	48 89 9c 24 e0 00 00 	mov    %rbx,0xe0(%rsp)
  40121f:	00 
  401220:	48 8b 1d d1 2e 00 00 	mov    0x2ed1(%rip),%rbx        # 4040f8 <s+0x78>
  401227:	48 89 9c 24 f0 00 00 	mov    %rbx,0xf0(%rsp)
  40122e:	00 
  40122f:	48 8b 1d ea 2e 00 00 	mov    0x2eea(%rip),%rbx        # 404120 <s+0xa0>
  401236:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  40123d:	00 00 00 
  401240:	48 8b 84 24 f8 00 00 	mov    0xf8(%rsp),%rax
  401247:	00 
  401248:	4c 33 bc 24 b8 00 00 	xor    0xb8(%rsp),%r15
  40124f:	00 
  401250:	4c 33 b4 24 80 00 00 	xor    0x80(%rsp),%r14
  401257:	00 
  401258:	4c 33 6c 24 78       	xor    0x78(%rsp),%r13
  40125d:	48 c1 e0 11          	shl    $0x11,%rax
  401261:	4c 33 64 24 70       	xor    0x70(%rsp),%r12
  401266:	48 33 9c 24 88 00 00 	xor    0x88(%rsp),%rbx
  40126d:	00 
  40126e:	48 89 44 24 58       	mov    %rax,0x58(%rsp)
  401273:	48 8b 84 24 c0 00 00 	mov    0xc0(%rsp),%rax
  40127a:	00 
  40127b:	4c 33 9c 24 90 00 00 	xor    0x90(%rsp),%r11
  401282:	00 
  401283:	4c 33 94 24 b0 00 00 	xor    0xb0(%rsp),%r10
  40128a:	00 
  40128b:	48 c1 e0 11          	shl    $0x11,%rax
  40128f:	4c 33 8c 24 a0 00 00 	xor    0xa0(%rsp),%r9
  401296:	00 
  401297:	48 33 bc 24 c0 00 00 	xor    0xc0(%rsp),%rdi
  40129e:	00 
  40129f:	48 89 44 24 50       	mov    %rax,0x50(%rsp)
  4012a4:	48 8b 84 24 c8 00 00 	mov    0xc8(%rsp),%rax
  4012ab:	00 
  4012ac:	48 33 b4 24 c8 00 00 	xor    0xc8(%rsp),%rsi
  4012b3:	00 
  4012b4:	48 33 8c 24 e8 00 00 	xor    0xe8(%rsp),%rcx
  4012bb:	00 
  4012bc:	48 c1 e0 11          	shl    $0x11,%rax
  4012c0:	4c 33 84 24 f8 00 00 	xor    0xf8(%rsp),%r8
  4012c7:	00 
  4012c8:	48 89 44 24 48       	mov    %rax,0x48(%rsp)
  4012cd:	48 8b 84 24 e8 00 00 	mov    0xe8(%rsp),%rax
  4012d4:	00 
  4012d5:	48 c1 e0 11          	shl    $0x11,%rax
  4012d9:	48 89 44 24 40       	mov    %rax,0x40(%rsp)
  4012de:	48 8b 84 24 d0 00 00 	mov    0xd0(%rsp),%rax
  4012e5:	00 
  4012e6:	48 c1 e0 11          	shl    $0x11,%rax
  4012ea:	48 89 44 24 38       	mov    %rax,0x38(%rsp)
  4012ef:	48 8b 84 24 d8 00 00 	mov    0xd8(%rsp),%rax
  4012f6:	00 
  4012f7:	48 c1 e0 11          	shl    $0x11,%rax
  4012fb:	48 89 44 24 30       	mov    %rax,0x30(%rsp)
  401300:	48 8b 84 24 e0 00 00 	mov    0xe0(%rsp),%rax
  401307:	00 
  401308:	48 c1 e0 11          	shl    $0x11,%rax
  40130c:	48 89 44 24 28       	mov    %rax,0x28(%rsp)
  401311:	48 8b 84 24 f0 00 00 	mov    0xf0(%rsp),%rax
  401318:	00 
  401319:	48 c1 e0 11          	shl    $0x11,%rax
  40131d:	48 89 44 24 20       	mov    %rax,0x20(%rsp)
  401322:	48 8b 84 24 d0 00 00 	mov    0xd0(%rsp),%rax
  401329:	00 
  40132a:	48 31 d0             	xor    %rdx,%rax
  40132d:	48 89 44 24 60       	mov    %rax,0x60(%rsp)
  401332:	48 8b 44 24 68       	mov    0x68(%rsp),%rax
  401337:	48 33 84 24 d8 00 00 	xor    0xd8(%rsp),%rax
  40133e:	00 
  40133f:	48 8b 94 24 a8 00 00 	mov    0xa8(%rsp),%rdx
  401346:	00 
  401347:	48 33 94 24 e0 00 00 	xor    0xe0(%rsp),%rdx
  40134e:	00 
  40134f:	4c 31 bc 24 f8 00 00 	xor    %r15,0xf8(%rsp)
  401356:	00 
  401357:	48 89 94 24 a8 00 00 	mov    %rdx,0xa8(%rsp)
  40135e:	00 
  40135f:	48 8b 94 24 98 00 00 	mov    0x98(%rsp),%rdx
  401366:	00 
  401367:	48 33 94 24 f0 00 00 	xor    0xf0(%rsp),%rdx
  40136e:	00 
  40136f:	4c 31 b4 24 c0 00 00 	xor    %r14,0xc0(%rsp)
  401376:	00 
  401377:	4c 31 ac 24 c8 00 00 	xor    %r13,0xc8(%rsp)
  40137e:	00 
  40137f:	4c 31 a4 24 e8 00 00 	xor    %r12,0xe8(%rsp)
  401386:	00 
  401387:	48 31 9c 24 d0 00 00 	xor    %rbx,0xd0(%rsp)
  40138e:	00 
  40138f:	4c 31 9c 24 d8 00 00 	xor    %r11,0xd8(%rsp)
  401396:	00 
  401397:	4c 31 94 24 e0 00 00 	xor    %r10,0xe0(%rsp)
  40139e:	00 
  40139f:	4c 31 8c 24 f0 00 00 	xor    %r9,0xf0(%rsp)
  4013a6:	00 
  4013a7:	48 89 94 24 98 00 00 	mov    %rdx,0x98(%rsp)
  4013ae:	00 
  4013af:	48 8b 54 24 60       	mov    0x60(%rsp),%rdx
  4013b4:	4c 31 84 24 b8 00 00 	xor    %r8,0xb8(%rsp)
  4013bb:	00 
  4013bc:	49 c1 c8 13          	ror    $0x13,%r8
  4013c0:	48 31 bc 24 80 00 00 	xor    %rdi,0x80(%rsp)
  4013c7:	00 
  4013c8:	48 c1 cf 13          	ror    $0x13,%rdi
  4013cc:	48 31 74 24 78       	xor    %rsi,0x78(%rsp)
  4013d1:	48 c1 ce 13          	ror    $0x13,%rsi
  4013d5:	48 31 4c 24 70       	xor    %rcx,0x70(%rsp)
  4013da:	48 c1 c9 13          	ror    $0x13,%rcx
  4013de:	48 31 94 24 88 00 00 	xor    %rdx,0x88(%rsp)
  4013e5:	00 
  4013e6:	48 8b 94 24 a8 00 00 	mov    0xa8(%rsp),%rdx
  4013ed:	00 
  4013ee:	48 31 84 24 90 00 00 	xor    %rax,0x90(%rsp)
  4013f5:	00 
  4013f6:	48 c1 c8 13          	ror    $0x13,%rax
  4013fa:	4c 33 7c 24 58       	xor    0x58(%rsp),%r15
  4013ff:	4c 33 74 24 50       	xor    0x50(%rsp),%r14
  401404:	4c 33 6c 24 48       	xor    0x48(%rsp),%r13
  401409:	48 89 44 24 68       	mov    %rax,0x68(%rsp)
  40140e:	48 8b 84 24 a8 00 00 	mov    0xa8(%rsp),%rax
  401415:	00 
  401416:	48 31 94 24 b0 00 00 	xor    %rdx,0xb0(%rsp)
  40141d:	00 
  40141e:	48 8b 94 24 98 00 00 	mov    0x98(%rsp),%rdx
  401425:	00 
  401426:	48 31 94 24 a0 00 00 	xor    %rdx,0xa0(%rsp)
  40142d:	00 
  40142e:	48 c1 c8 13          	ror    $0x13,%rax
  401432:	48 8b 54 24 60       	mov    0x60(%rsp),%rdx
  401437:	4c 33 64 24 40       	xor    0x40(%rsp),%r12
  40143c:	48 89 84 24 a8 00 00 	mov    %rax,0xa8(%rsp)
  401443:	00 
  401444:	48 8b 84 24 98 00 00 	mov    0x98(%rsp),%rax
  40144b:	00 
  40144c:	48 33 5c 24 38       	xor    0x38(%rsp),%rbx
  401451:	4c 33 5c 24 30       	xor    0x30(%rsp),%r11
  401456:	48 c1 ca 13          	ror    $0x13,%rdx
  40145a:	48 c1 c8 13          	ror    $0x13,%rax
  40145e:	4c 33 54 24 28       	xor    0x28(%rsp),%r10
  401463:	4c 33 4c 24 20       	xor    0x20(%rsp),%r9
  401468:	ff 4c 24 1c          	decl   0x1c(%rsp)
  40146c:	48 89 84 24 98 00 00 	mov    %rax,0x98(%rsp)
  401473:	00 
  401474:	0f 85 c6 fd ff ff    	jne    401240 <main+0x1d0>
  40147a:	c5 fa 7e 54 24 78    	vmovq  0x78(%rsp),%xmm2
  401480:	48 8b 44 24 68       	mov    0x68(%rsp),%rax
  401485:	c4 e3 e9 22 4c 24 70 	vpinsrq $0x1,0x70(%rsp),%xmm2,%xmm1
  40148c:	01 
  40148d:	c5 fa 7e 9c 24 b8 00 	vmovq  0xb8(%rsp),%xmm3
  401494:	00 00 
  401496:	c4 e3 e1 22 84 24 80 	vpinsrq $0x1,0x80(%rsp),%xmm3,%xmm0
  40149d:	00 00 00 01 
  4014a1:	c5 fa 7e a4 24 b0 00 	vmovq  0xb0(%rsp),%xmm4
  4014a8:	00 00 
  4014aa:	c5 fa 7e ac 24 88 00 	vmovq  0x88(%rsp),%xmm5
  4014b1:	00 00 
  4014b3:	c5 fa 7e b4 24 c8 00 	vmovq  0xc8(%rsp),%xmm6
  4014ba:	00 00 
  4014bc:	c5 fa 7e bc 24 f8 00 	vmovq  0xf8(%rsp),%xmm7
  4014c3:	00 00 
  4014c5:	c4 e3 7d 38 c1 01    	vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  4014cb:	c4 e3 d9 22 8c 24 a0 	vpinsrq $0x1,0xa0(%rsp),%xmm4,%xmm1
  4014d2:	00 00 00 01 
  4014d6:	c4 c1 f9 6e e5       	vmovq  %r13,%xmm4
  4014db:	c5 fd 7f 05 9d 2b 00 	vmovdqa %ymm0,0x2b9d(%rip)        # 404080 <s>
  4014e2:	00 
  4014e3:	c4 e3 d1 22 84 24 90 	vpinsrq $0x1,0x90(%rsp),%xmm5,%xmm0
  4014ea:	00 00 00 01 
  4014ee:	c4 c1 f9 6e ef       	vmovq  %r15,%xmm5
  4014f3:	c5 fa 7e 94 24 e0 00 	vmovq  0xe0(%rsp),%xmm2
  4014fa:	00 00 
  4014fc:	c5 fa 7e 9c 24 d0 00 	vmovq  0xd0(%rsp),%xmm3
  401503:	00 00 
  401505:	c4 e3 7d 38 c1 01    	vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  40150b:	c4 e3 c9 22 8c 24 e8 	vpinsrq $0x1,0xe8(%rsp),%xmm6,%xmm1
  401512:	00 00 00 01 
  401516:	c4 c1 f9 6e f2       	vmovq  %r10,%xmm6
  40151b:	c5 fd 7f 05 7d 2b 00 	vmovdqa %ymm0,0x2b7d(%rip)        # 4040a0 <s+0x20>
  401522:	00 
  401523:	c4 e3 c1 22 84 24 c0 	vpinsrq $0x1,0xc0(%rsp),%xmm7,%xmm0
  40152a:	00 00 00 01 
  40152e:	c4 e1 f9 6e fb       	vmovq  %rbx,%xmm7
  401533:	c4 e3 7d 38 c1 01    	vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  401539:	c4 e3 e9 22 8c 24 f0 	vpinsrq $0x1,0xf0(%rsp),%xmm2,%xmm1
  401540:	00 00 00 01 
  401544:	c4 e1 f9 6e d6       	vmovq  %rsi,%xmm2
  401549:	c5 fd 7f 05 6f 2b 00 	vmovdqa %ymm0,0x2b6f(%rip)        # 4040c0 <s+0x40>
  401550:	00 
  401551:	c4 e3 e1 22 84 24 d8 	vpinsrq $0x1,0xd8(%rsp),%xmm3,%xmm0
  401558:	00 00 00 01 
  40155c:	c4 c1 f9 6e d8       	vmovq  %r8,%xmm3
  401561:	c4 e3 7d 38 c1 01    	vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  401567:	c4 c3 d9 22 cc 01    	vpinsrq $0x1,%r12,%xmm4,%xmm1
  40156d:	c5 fa 7e a4 24 a8 00 	vmovq  0xa8(%rsp),%xmm4
  401574:	00 00 
  401576:	c5 fd 7f 05 62 2b 00 	vmovdqa %ymm0,0x2b62(%rip)        # 4040e0 <s+0x60>
  40157d:	00 
  40157e:	c4 c3 d1 22 c6 01    	vpinsrq $0x1,%r14,%xmm5,%xmm0
  401584:	c4 e1 f9 6e ea       	vmovq  %rdx,%xmm5
  401589:	c4 e3 7d 38 c1 01    	vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  40158f:	c4 c3 c9 22 c9 01    	vpinsrq $0x1,%r9,%xmm6,%xmm1
  401595:	c5 fd 7f 05 63 2b 00 	vmovdqa %ymm0,0x2b63(%rip)        # 404100 <s+0x80>
  40159c:	00 
  40159d:	c4 c3 c1 22 c3 01    	vpinsrq $0x1,%r11,%xmm7,%xmm0
  4015a3:	c4 e3 7d 38 c1 01    	vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  4015a9:	c4 e3 e9 22 c9 01    	vpinsrq $0x1,%rcx,%xmm2,%xmm1
  4015af:	c5 fd 7f 05 69 2b 00 	vmovdqa %ymm0,0x2b69(%rip)        # 404120 <s+0xa0>
  4015b6:	00 
  4015b7:	c4 e3 e1 22 c7 01    	vpinsrq $0x1,%rdi,%xmm3,%xmm0
  4015bd:	c4 e3 7d 38 c1 01    	vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  4015c3:	c4 e3 d9 22 8c 24 98 	vpinsrq $0x1,0x98(%rsp),%xmm4,%xmm1
  4015ca:	00 00 00 01 
  4015ce:	c5 fd 7f 05 6a 2b 00 	vmovdqa %ymm0,0x2b6a(%rip)        # 404140 <s+0xc0>
  4015d5:	00 
  4015d6:	c4 e3 d1 22 c0 01    	vpinsrq $0x1,%rax,%xmm5,%xmm0
  4015dc:	c4 e3 7d 38 c1 01    	vinserti128 $0x1,%xmm1,%ymm0,%ymm0
  4015e2:	c5 fd 7f 05 76 2b 00 	vmovdqa %ymm0,0x2b76(%rip)        # 404160 <s+0xe0>

@chriselrod
Copy link
Contributor

And, most fundamental, it will not happen with -O3. It has to be -O2 -ftree-vectorize. There must be some other optimization that jumps in and prevents vectorization. Weird.

Are you using gcc? Most often when I get vectorization with -O2 -ftree-vectorize but not -O3, I do get it with -O3 -fdisable-tree-cunrolli. This happens with manually unrolled code.


In VectorizedRNG.jl I have a vectorized PCG implementation. It is a lot slower than xoshiro. With AVX2, it is also about 50% slower than the default MersenneTwister, but it is faster with AVX512.
Perhaps also of interest is that I implemented a SIMD random normal generator using the Box-Muller algorithm. It is close to 1.5x and 4x faster than the default with AVX2 and AVX512, respectively. I suspect it can be improved further.

@PallHaraldsson
Copy link
Contributor Author

PallHaraldsson commented Feb 13, 2021

@chriselrod, despite me proposing a new PRNG, at least for some applications, e.g. for neural networks (there QRNG seems to have the best case, and for others strangely making worse?), NO PRNG may be good enough (if accuracy, not just speed, is your concern). It's at least good to have in mind.

On the effects of pseudorandom and quantum-random number generators in soft computing
https://link.springer.com/article/10.1007%2Fs00500-019-04450-0

We use a CPU and a QPU to generate random numbers for multiple machine learning techniques. Random numbers are employed in the random initial weight distributions of dense and convolutional neural networks, in which results show a profound difference in learning patterns for the two.
[..]
In 50 dense neural networks (25 PRNG/25 QRNG), QRNG increases over PRNG [e.g.]
QRNG exceeded PRNG for mental state EEG classification by + 2.82%. [..]
In CIFAR-10, the QRNG outperforms PRNG by + 0.92%. The n-random split of a Random Tree is enhanced towards and new Quantum Random Tree (QRT) model, which has differing classification abilities to its classical counterpart [..]
Finally, the QRT is ensembled into a Quantum Random Forest (QRF), which also has a noticeable effect when compared to the standard Random Forest (RF).
[..]
Although quantum, quantum-inspired and hybrid classical/quantum algorithms are explored, as well as the likewise methods for computing, the use of a Quantum Random Number Generator is rarely explored within a classical machine learning approach in which an RNG is required
[..]
Major issues arise with the possibility of backdoors, notably, for example, Intel’s pseudorandom generator which, after hijacking, allowed for complete control of a computer system for malicious intent Degabriele et al. (2016), Schneier (2007). The Intel issue was far from a lone incident, the RANDU system was cracked by the NSA for unprecedented access to the RSA BSAFE cryptographic library, as well as in 2006 when Debian OpenSSL’s random number generator was also cracked, leading to Debian being compromised for 2 years Markowsky (2014).

The paper is about QRNG (which could be a drop-in-replacemnt in Julia, but also an interesting overview of Quantum Machine Learning in general, with e.g.:

This section outlines some of the state-of-the-art applications of quantum theory in computing.
[..]
Quantum Perceptrons are a theoretical approach to deriving a quantum equivalent of a perceptron unit (neuron) within an Artificial Neural Network Schuld et al. (2014). [..]
A very limited extent of research suggests quantum effects in a network to be the possible source of consciousness Hameroff and Penrose (1996)

Offtopic, but intriguing (and maybe related if QRNG used?):
Quantum Image Processing and Its Application to Edge Detection: Theory and Experiment
https://arxiv.org/abs/1801.01465

@vks
Copy link

vks commented Feb 13, 2021

at least for some applications, e.g. for neural networks (there QRNG seems to have the best case, and for others strangely making worse?), NO PRNG may be good enough

I find this hard to believe, a cryptographically secure PRNG has to be undistinguishable from true randomness. It would be quite a statement to claim that no such PRNG exists.

@JeffBezanson
Copy link
Sponsor Member

fixed by #40546

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RNG Random number generation and the Random stdlib
Projects
None yet
Development

No branches or pull requests