New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplementation of Random using the Xoshiro256++ PRNG #10701
Conversation
The implementation in stdlib/random.ml is going to change.
Also add the ability to split the PRNGs (Random.split, Random.State.split)
Also: add trivial test for Random.split.
- Use binary digits of pi from Wolfram Alpha for the magic constants. - Test against Blackman and Vigna's reference C implementation with known initial state [| 1; 2; 4; 8 |]
Use multiply-and-add rather than rotate-and-xor to mix the integers from the array given to Random.full_init.
Some benchmarks: Times are in seconds for 50_000_000 repetitions, unless indicated. x86 64 bits:
ARM 32 bits:
|
Some implementation remarks: The internal state is an array of four 64-bit integers. (To support splitting, there are two such arrays, but only one changes when drawing numbers.) It is crucial for performance that the integers are unboxed. The current implementation uses a 1D bigarray. It's a bit wasteful of space, but has many qualities:
An alternative that I used in Pringo is an Abstract block or Custom block that carries the four 64-bit integers directly as payload. This saves some space, but raises some delicate alignment issues on 32-bit platforms. Plus marshaling etc. needs to be implemented by hand. I'm not sure which of the two approaches is best for JS-of-OCaml. Does JS-of-OCaml has a reasonably efficient implementation of bigarrays? |
@hhugo is the person to ask. But based on a brief glance it seems that |
I think they are by piggybacking on Int32Array. Note sure why |
It turns out that Memprof is also using Xoshiro, but I don't think this would be a good thing to share the implementations, since Memprof is using a specialized version to make it possible to use SIMD instructions. Apart from that, the current implementation of the |
If by "thread" you mean the current systhreads, the copy-and-jump-forward is supposed to be atomic w.r.t. preemption, so two splits in two different threads should not result in the same state. If by "thread" you mean Multicore OCaml domains, it is a race to use a PRNG state in two domains without synchronization (not just splitting, but also drawing numbers). Did I miss something? |
I was indeed thinking of Multicore domains, since this PR is to be merged in multicore OCaml. What I feel unexpected is that when you call split, I expect the two PRNGs to be completely independent. But actually, they have hidden shared state: the |
In other word, the following program is racy:
I think that this is unexpected. |
OK, I see your point. But I don't see how to implement splitting by jumping forward if we don't have a piece of shared state to advance. How to make sure we don't get the same PRNG if we split twice? |
One naive approach would be to halve the jump distance at each split:
The obvious problem is that it's relatively easy to run out of space by repeatedly splitting in an unbalanced way. (I would expect the Haskell people to have thought about this problem and come up with good solutions, since obviously they care about observable purity of the generator state.) |
As an exercise I tried to make a decently efficient implementation of the Xoshiro module in pure OCaml, and to my own surprise I managed to get less than 10% overhead compared to the C implementation on my laptop. |
Yes please! |
Here it is (sorry for the wall of text): module Xoshiro = struct
type state = Bytes.t
external get64u : Bytes.t -> int -> Int64.t = "%caml_bytes_get64u"
external set64u : Bytes.t -> int -> Int64.t -> unit = "%caml_bytes_set64u"
let[@inline] get s i = get64u s (i * 8)
let[@inline] set s i n = set64u s (i * 8) n
let[@inline] rotl x k =
Int64.(logor (shift_left x k) (shift_right_logical x (64 - k)))
let next (s : state) : Int64.t =
let s0 = ref (get s 0) in
let s1 = ref (get s 1) in
let s2 = ref (get s 2) in
let s3 = ref (get s 3) in
let result = Int64.add (rotl (Int64.add !s0 !s3) 23) !s0 in
let t = Int64.shift_left !s1 17 in
s2 := Int64.logxor !s2 !s0;
s3 := Int64.logxor !s3 !s1;
s1 := Int64.logxor !s1 !s2;
s0 := Int64.logxor !s0 !s3;
s2 := Int64.logxor !s2 t;
s3 := rotl !s3 45;
set s 0 !s0;
set s 1 !s1;
set s 2 !s2;
set s 3 !s3;
result
let long_jump_array = Bytes.create 32
let () =
set long_jump_array 0 0x76e15d3efefdcbbfL;
set long_jump_array 1 0xc5004e441c522fb3L;
set long_jump_array 2 0x77710069854ee241L;
set long_jump_array 3 0x39109bb02acbe635L;
()
let jump_array = Bytes.create 32
let () =
set jump_array 0 0x180ec6d33cfd0abaL;
set jump_array 1 0xd5a61266f0c9392cL;
set jump_array 2 0xa9582618e03fc9aaL;
set jump_array 3 0x39abdc4529b1661cL;
()
let[@inline] jump_gen arr (s : state) =
let s0 = ref 0L in
let s1 = ref 0L in
let s2 = ref 0L in
let s3 = ref 0L in
for i = 0 to ((Bytes.length arr) / 8) - 1 do
for b = 0 to 63 do
if Int64.logand (get arr i) (Int64.shift_left 1L b) <> 0L
then begin
s0 := Int64.logxor !s0 (get s 0);
s1 := Int64.logxor !s1 (get s 1);
s2 := Int64.logxor !s2 (get s 2);
s3 := Int64.logxor !s3 (get s 3);
()
end;
ignore (next s)
done
done;
set s 0 !s0;
set s 1 !s1;
set s 2 !s2;
set s 3 !s3;
()
let jump s = jump_gen jump_array s
let _long_jump s = jump_gen long_jump_array s
let init s vinit =
let mix = 6364136223846793005L in
set s 0 0L;
set s 1 0L;
set s 2 0L;
set s 3 0L;
for i = 0 to Array.length vinit - 1 do
set s (i mod 4) (Int64.add (Int64.mul (get s (i mod 4)) mix) (Int64.of_int vinit.(i)))
done;
for i = 0 to 3 do
if get s i = 0L then set s i (Int64.shift_left 1L i)
done;
()
let create () : state =
Bytes.create 32
let assign (dst: state) (src: state) =
Bytes.blit src 0 dst 0 32
let copy st = Bytes.copy st
end |
Perhaps we could expose the |
The jump-based implementation of Random.State.split doesn't quite match the standard notion of PRNG splitting. Better leave this for future work.
I agree I shot myself in the foot with this jump-based, not-quite-right implementation of splitting. I just pushed a new version of this PR that exposes no |
AFAIK they use the same trick that PRINGO's Chacha-based generator uses: from the current N-bits state, draw N pseudorandom bits and use that as the state for the new PRNG returned by SplitMix is not cryptographically strong, so it takes some extra steps on top of the construction above to avoid producing a weak PRNG by splitting. |
Strange choice. These primitives are very slow on machines that don't support unaligned memory accesses. What about a bigarray-based representation (as in my code) plus unchecked bigarray accesses? That should produce nice code too, while avoiding alignment issues. |
I forgot about that actually. A bigarray-based representation should work too; you would only have to change the By the way, since all of the accesses are actually aligned on 64 bits, I would have liked the compiler to notice it and generate good code even on architectures that don't support unaligned access (at least for the accesses with constant indices), but apparently this optimisation isn't done at the moment. Finally, while this code should be 100% compatible with Js_of_ocaml (as long as we're using the |
Indeed, I think this is better. However, I still have a small improvement to suggest: the current implementation adds an additional internal state, Also, even if jumping might not be useful on arbitrary state, it is neither very costly to provide it (the function already exists...). |
I have mixed feelings about exposing jump directly:
For the Multicore use-case, initially I thought that the spawning domain would split its generator, with the new state going to the spawned domain. With this approach it's not clear how to jump, because it is not the case that only the first domain will spawn new domains, we effectively have an arbitrary tree of spawns (so for example jumping by a fixed amount on domain creation, such as 2^128, means that sometimes two new domains will end up in the same state). |
To clarify: I think that the current API is probably fine, and I'm not worried about the Multicore use-case which I think should be the primary concern. I just think that it's worth making sure that we really want to give up on |
As suggested by @jhjourdan, it's simpler to take a copy of the current default state then jump it.
Nice simplification! Implemented in aa36f91 . |
That was my original plan, using a "long jump" (advancing by 2^192 steps) instead of a "jump" (2^128 steps), so that each domain can still jump to its heart's content.
Not sure a modification is needed in the runtime. I thought that domain-local variables can have an initializer function (in OCaml) that is called when a new domain is created, to initialize the domain-local variable. Here, the state of the "basic generator" would be domain-local, and would be initialized as you describe (lock global generator, copy it, longjump it, unlock), all in OCaml. |
I did not follow all the discussion, and I'm not an expert on this topic, so perhaps my question is really stupid, but : I thought that using PRNG with an explicit state and splitting (rather than re-initializing) should allow to get a deterministic behavior for pseudo-random numbers, even in presence of multi-domain parallelism; but using a global generator introduces a racy semantics (the pseudo-random numbers drawn by a given domain depend on when the domain was started, compared to other domains). Isn't it problematic? |
Yes indeed, I believe @alainfrisch is right: one good reason to avoid global synchronization is to make random draws depend only on the code executed on each domain, and not their scheduling. (The code executed on each domain may itself depend on the scheduling.) |
I'd like to better understand the point @alainfrisch is trying to make. The "(lock global generator, copy it, longjump it, unlock)" proposal is for the default behavior of programs that create new domains but don't do anything special to control the PRNG. It is a better alternative to reseeding the PRNG from system data each time a domain is created, like Multicore OCaml does today. (Better because faster and with stronger independence guarantees.) Neither approach is deterministic, but I don't think determinism can be achieved here at all given that the creation of domains is itself nondeterministic (depends on scheduling) in the general case. A program that wants to control multiple PRNGs deterministically can do it with jumps and the proposed API, there's often no need for splitting. Consider
The only thing you cannot do with jumps is an arbitrarily-deep deterministic tree of PRNGs. Was this the point being made? |
Indeed, I had in mind calling Now, I'm wondering why |
@xavierleroy There is a very natural implementation of PRNG initialization on In practice, the "deterministic trees of PRNGs" used for domains have relatively few nodes, but they may be very unbalanced (for example: the first domain creates all other domains). I would expect approaches based on "random jumps in the state space" to give very good statistical properties. |
There's a high risk of getting the same PRNG state by two apparently-different paths. E.g. if you have state A, jump it to get state B, then jump B to get state C, you end up with A = B. |
Note that Multicore OCaml currently gives no simple way to pass some data (here the split state) from the parent domain (the one that calls If you have such a mechanism, the "lock global generator, copy it, longjump it, unlock" sequence could be run in the parent domain instead of the spawned domain, making PRNG states a bit more deterministic. |
Yes, we would generalize the domain-local-state module to move from a |
I understand the risk, but forcing to use the global state is also risky; any third-party library can think it's a good idea to use that global state as well, making your code (which needs to behave deterministically) unreliable. Btw, just to illustrate that I'm a complete outsider to this discussion: what's the problem of implementing a split with |
Good statistical properties of the random numbers are more important than determinism. Don't force us into an inferior solution because you're obsessed with determinism.
This is essentially the Haskell / Pringo+Chacha solution I mentioned previously. It's rumoured to be OK statistically speaking if the PRNG is cryptographically strong. If it is not, like Xoshiro, there's always a risk that a long sequence of splits will converge to a not-at-all-random state of the PRNG. Speed is not the issue here; randomness is. |
Closing in favor of #10742. |
The Haskell people then thought about this some more and realised that the semi-formal argument wasn't good enough (or rather, assuming that the PRNG being used -- L'Ecuyer 1988 -- was cryptographically strong was too much of a stretch)! The handmade The solution we settled on was to switch to SplitMix: idontgetoutmuch/random#22. (This repo was used to collaborate on A few things to mention here:
|
@curiousleo thanks for the feedback! If you are interested in the OCaml-side discussion, I would recommend looking at RFCs#28 which is somehow the "meta-issue" for this PR and the followup #10742. (The discussion is spread between the three issues, which makes it a bit difficult to follow, but oh well.) Sebastiano Vigna, a co-author of Xoshiro, has a rather negative-sounding page about PCG (The wrap-up on PCG generators). This being said, Melissa O'Neil, the author of PCG generators, has a negative-sounding page about Xoshiro (Exploring Xoshiro's close-repeat flaws). I guess that it is difficult for non-experts to sort out what do believe about those generators :-) (For me the main conclusion of this cross-issue/PRs discussion is that we really want a |
Thank you! I found the RFC and other PR right after posting on this closed PR, where visibility is probably lowest. I intend to catch up on the discussions you linked to over the next few days.
I'm aware -- and I agree that it's tricky to figure out how significant those weaknesses are. Both are in use: Lua 5.4 switched to Xoshiro, numpy uses PCG by default.
Agreed again! If in the end people want to split their PRNGs, I think it's a good idea to put thought and effort into a good |
[ I am not proposing this PR for inclusion in OCaml 4.14. It is intended for Multicore OCaml and OCaml 5.00. ]
Following extensive experimentation (https://github.com/xavierleroy/pringo/) and discussions (ocaml/RFCs#28), this PR reimplements the Random standard library module using the Xoshiro256++ pseudo-random number generator by Blackman and Vigna.
The main benefit of the new PRNG is that it supports "jumping", i.e. advancing quickly very far into the pseudo-random sequence. (The current Random implementation does not support this.) Jumping will come very handy for Multicore OCaml, enabling each domain to maintain its own PRNG state in domain-local storage, with "jumping" being used to give a fresh PRNG state when a new domain is created.
Jumping also provides the ability to split a PRNG state in two independent PRNGs, something that can be quite useful for Quickcheck-style generation of random functions and streams.Other benefits of the Xoshiro256-based PRNG include
Two other PRNGs were tried in Pringo. SplitMix has a small internal state (2^64 bits) which probably requires reseeding after 2^32 draws or splits. Chacha is a lot of code for a PRNG (it's based on a cryptographic cipher) and is oriented towards generating byte streams, requiring even more buffering code to produce integers and FP numbers. In the end, Xoshiro wins.