Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The PCG implementation provided by Numpy has significant, dangerous self-correlation #16313

Closed
vigna opened this issue May 20, 2020 · 106 comments · Fixed by #18906
Closed

The PCG implementation provided by Numpy has significant, dangerous self-correlation #16313

vigna opened this issue May 20, 2020 · 106 comments · Fixed by #18906

Comments

@vigna
Copy link

vigna commented May 20, 2020

The PCG generator used by Numpy has a significant amount self-correlation. That is, for each sequence generated from a seed there is a large number of correlated, nonoverlapping sequences starting from other seeds. By "correlated" I mean that interleaving two such sequences and testing the result you obtain failures that did not appear in each sequence individually.

The probability that two generators out of large set of terminals get two of those sequences is nonnegligible. Why this happens from a mathematical viewpoint is well known but it is explained here in detail: http://prng.di.unimi.it/pcg.pgp (see "Subsequences within the same generator").

To show this problem directly, I wrote this simple C program reusing the Numpy code: http://prng.di.unimi.it/intpcgnumpy.c . The program takes two 128-bit states of two generators (with the same LCG constant or "stream") in the form of high and low bits, interleaves their output and writes it in binary form. Once we send it through PractRand, we should see no statistical failure, as the two streams should be independent. But if try to start from two states with the same 64 lower bits, you get:

./intpcgnumpy 0x596d84dfefec2fc7 0x6b79f81ab9f3e37b 0x8d7deae980a64ab0 0x6b79f81ab9f3e37b | stdbuf -oL ~/svn/c/xorshift/practrand/RNG_test stdin -tf 2 -te 1 -tlmaxonly -multithreaded
RNG_test using PractRand version 0.94
RNG = RNG_stdin, seed = unknown
test set = expanded, folding = extra

rng=RNG_stdin, seed=unknown
length= 128 megabytes (2^27 bytes), time= 2.2 seconds
  Test Name                         Raw       Processed     Evaluation
  BCFN(0+0,13-2,T)                  R= +27.6  p =  1.0e-13    FAIL
  BCFN(0+1,13-2,T)                  R= +68.0  p =  2.3e-34    FAIL !!!
  BCFN(0+2,13-3,T)                  R= +90.8  p =  8.8e-43    FAIL !!!
  BCFN(0+3,13-3,T)                  R=+120.6  p =  6.9e-57    FAIL !!!!
  DC6-6x2Bytes-1                    R=  +8.9  p =  4.0e-5   mildly suspicious
  DC6-5x4Bytes-1                    R= +15.7  p =  4.3e-9   very suspicious
  [Low1/8]BCFN(0+0,13-4,T)          R= +11.6  p =  4.9e-5   unusual
  ...and 1074 test result(s) without anomalies

You can even go lower—you just need the same 58 lower bits:

./intpcgnumpy 0x596d84dfefec2fc7 0x0579f81ab9f3e37b 0x8d7deae980a64ab0 0x6b79f81ab9f3e37b | stdbuf -oL ~/svn/c/xorshift/practrand/RNG_test stdin -tf 2 -te 1 -tlmaxonly -multithreaded

[...]
rng=RNG_stdin, seed=unknown
length= 32 gigabytes (2^35 bytes), time= 453 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/16]FPF-14+6/32:cross        R= +11.6  p =  4.0e-10   VERY SUSPICIOUS
  [Low1/32]FPF-14+6/32:cross        R= +16.5  p =  3.2e-14    FAIL
  [Low1/32]FPF-14+6/16:cross        R= +12.8  p =  3.8e-11   VERY SUSPICIOUS
  [Low1/64]FPF-14+6/64:cross        R=  +6.8  p =  4.8e-6   mildly suspicious
  [Low1/64]FPF-14+6/32:cross        R=  +6.0  p =  1.9e-5   unusual
  [Low1/64]FPF-14+6/16:cross        R=  +5.5  p =  5.8e-5   unusual
  [Low4/32]FPF-14+6/64:all          R=  +5.8  p =  5.9e-5   unusual
  [Low4/32]FPF-14+6/32:(0,14-0)     R=  +7.7  p =  1.0e-6   unusual
  [Low4/32]FPF-14+6/32:(1,14-0)     R=  +7.7  p =  9.1e-7   unusual
  [Low4/32]FPF-14+6/32:all          R=  +6.5  p =  1.3e-5   unusual
  [Low4/64]FPF-14+6/64:all          R=  +5.9  p =  5.1e-5   unusual
  [Low4/64]FPF-14+6/64:cross        R=  +8.2  p =  3.0e-7   suspicious
  [Low4/64]FPF-14+6/32:(0,14-0)     R=  +7.6  p =  1.0e-6   unusual
  [Low8/64]FPF-14+6/64:(0,14-0)     R= +17.0  p =  2.2e-15    FAIL
  [Low8/64]FPF-14+6/64:(1,14-0)     R=  +9.1  p =  5.1e-8   mildly suspicious
  [Low8/64]FPF-14+6/64:all          R= +12.7  p =  2.1e-11   VERY SUSPICIOUS
  [Low8/64]FPF-14+6/32:(0,14-0)     R= +12.8  p =  1.7e-11   VERY SUSPICIOUS
  [Low8/64]FPF-14+6/32:all          R= +11.0  p =  9.3e-10   VERY SUSPICIOUS
  ...and 1696 test result(s) without anomalies

Note that to get more the 50% probability that two generators start from two correlated seed (chosen at random) you need just about half a million generators starting at random (birthday paradox). And if you consider the probability that they do not exactly start from the same state, but have significant overlapping correlating sequences, you need much less.

Any sensible generator from the literature will not behave like that. You can choose adversarially any two starting states of MRG32k3a, SFC64, CMWC, xoshiro256++, etc., and as long as you generate nonoverlapping sequences you will not see the failures above. This is a major drawback that can pop up when a number of devices uses the generator and one assumes (as it should be) that pairwise those sequences should not show correlation. The correlation can induce unwanted behavior that is hard to detect.

Please at least document somewhere that the generator should not be used on multiple terminals or in a highly parallel environment.

The same can happen with different "streams", as the sequences generated by an LCG by changing the additive constant are all the same modulo a change of sign and an additive constant. You can see some discussion here: rust-random/rand#907 and a full mathematical discussion of the problem here: https://arxiv.org/abs/2001.05304 .

@mattip
Copy link
Member

mattip commented May 20, 2020

@imneme, @bashtage, @rkern would be the authorities here, but I think we have gone over this and it is why we preferred the SeedSequence.spawn interface over the jumped one. For instance there was this discussion when we were discussing the API. Please check the advice here https://numpy.org/devdocs/reference/random/parallel.html and suggest improvements as needed.

@rkern
Copy link
Member

rkern commented May 20, 2020

@mattip This has nothing to do with jumping.

@bashtage
Copy link
Contributor

I think in practice it is difficult to make wholesale changes, although improved documentation is always a good idea.

I would probably recommend AESCounter for anyone with AES-NI or SPECK128 for anyone without in highly parallel settings.

@rkern
Copy link
Member

rkern commented May 20, 2020

The same can happen with different "streams", as the sequences generated by an LCG by changing the additive constant are all the same modulo a change of sign and an additive constant.

Can you quantify this? I can replicate the failures using the same increment, but we seed the increment as well as the state, and I have not yet observed a failure with two different random increments. If the increments also have to be carefully constructed, then that would affect the practical birthday collision frequency.

https://gist.github.com/rkern/f46552e030e59b5f1ebbd3b3ec045759

❯ ./pcg64_correlations.py --same-increment | stdbuf -oL ./RNG_test stdin64 -tf 2 -te 1 -tlmaxonly -multithreaded
0x56b35656ede2b560587e4251568a8fed
0x93526034ed105e9e587e4251568a8fed
[
    {
        "bit_generator": "PCG64",
        "state": {
            "state": 115244779949650410574112983538102603757,
            "inc": 137507567477557873606783385380908979143
        },
        "has_uint32": 0,
        "uinteger": 0
    },
    {
        "bit_generator": "PCG64",
        "state": {
            "state": 195824235027336627448689568147458133997,
            "inc": 137507567477557873606783385380908979143
        },
        "has_uint32": 0,
        "uinteger": 0
    }
]
RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x4bf19f7b
test set = expanded, folding = extra

rng=RNG_stdin64, seed=0x4bf19f7b
length= 128 megabytes (2^27 bytes), time= 3.0 seconds
  Test Name                         Raw       Processed     Evaluation
  BCFN_FF(2+0,13-3,T)               R= +59.9  p =  3.8e-28    FAIL !!!       
  BCFN_FF(2+1):freq                 R= +89.0  p~=   6e-18     FAIL !         
  BCFN_FF(2+2):freq                 R= +39.6  p~=   6e-18     FAIL !         
  BCFN_FF(2+3):freq                 R= +14.6  p~=   6e-18     FAIL !         
  BCFN_FF(2+4):freq                 R= +10.3  p~=   5e-11   very suspicious  
  DC6-9x1Bytes-1                    R=  +7.1  p =  5.6e-4   unusual          
  DC6-6x2Bytes-1                    R= +18.9  p =  1.0e-10   VERY SUSPICIOUS 
  DC6-5x4Bytes-1                    R= +11.2  p =  1.4e-6   suspicious       
  [Low4/16]BCFN_FF(2+0):freq        R= +19.5  p~=   6e-18     FAIL !         
  [Low4/16]FPF-14+6/16:all          R=  +5.6  p =  1.0e-4   unusual          
  [Low4/16]FPF-14+6/4:all           R=  +5.9  p =  4.6e-5   unusual          
  [Low4/32]BCFN_FF(2+0):freq        R=  +6.5  p~=   2e-5    unusual          
  [Low8/32]BCFN_FF(2+0):freq        R= +15.1  p~=   6e-18     FAIL !         
  [Low8/32]FPF-14+6/32:all          R=  +8.4  p =  2.5e-7   very suspicious  
  [Low8/32]FPF-14+6/32:all2         R=  +9.0  p =  7.8e-5   unusual          
  [Low8/32]FPF-14+6/16:(0,14-0)     R= +12.4  p =  4.5e-11   VERY SUSPICIOUS 
  [Low8/32]FPF-14+6/16:all          R= +15.5  p =  5.2e-14    FAIL           
  [Low8/32]FPF-14+6/16:all2         R= +41.4  p =  2.6e-16    FAIL !         
  [Low8/32]FPF-14+6/4:(0,14-0)      R=  +6.9  p =  5.9e-6   unusual          
  [Low8/32]FPF-14+6/4:all           R=  +7.9  p =  6.6e-7   suspicious       
  ...and 871 test result(s) without anomalies

@vigna
Copy link
Author

vigna commented May 20, 2020

OK, I'll try again.

There are no multiple streams in an LCG with a power-of-2 modulus. Many believed it in the early days, and there are even long old papers claiming to do interesting stuff with those "streams", but it is has been known for decades that the orbits you obtain by changing the constants are all the same modulo an additive constant and possibly a sign change. The farthest I can trace it is

Mark J. Durst, Using linear congruential generators for parallel random number generation,
1989 Winter Simulation Conference Proceedings, IEEE Press, 1989, pp. 462–466.

So, I wrote another program http://prng.di.unimi.it/corrpcgnumpy.c in which you can set:

  • An initial state for a PRNG.
  • An initial state for another PRNG.
  • An arbitrary "stream constant" for the first PRNG.
  • An arbitrary "stream constant" for the second PRNG (they should be both even or both odd; this restriction can be removed with some additional fiddling).
  • A fixed number of lower bits that we will set adversarially in the second PRNG, essentially in such a way that it starts with the same bits of the first PRNG. The rest of the bits will be taken from the initial state for the second PRNG you have provided.

So this is exactly the setting of the first program, but you can also choose the constants.

./corrpcgnumpy 0x596d84dfefec2fc7 0x6b79f81ab9f3e37b 0xac9c8abfcb89f65f 0xe42e8dff1c46de8b 0x8d7deae9efec2fc7 0x6b79f81ab9f3e37b 0x06e13e5e8c92c843 0xf92e8346feee7a21 56 | stdbuf -oL ~/svn/c/xorshift/practrand/RNG_test stdin -tf 2 -te 1 -tlmaxonly -multithreaded

rng=RNG_stdin, seed=unknown
length= 4 gigabytes (2^32 bytes), time= 113 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/8]BCFN(0+0,13-1,T)          R= +27.2  p =  4.0e-14    FAIL
  [Low1/8]DC6-6x2Bytes-1            R= +10.9  p =  4.4e-6   suspicious
  [Low1/64]DC6-5x4Bytes-1           R=  -6.4  p =1-1.4e-4   unusual
  [Low8/64]FPF-14+6/64:(0,14-0)     R=  +8.4  p =  2.2e-7   mildly suspicious
  [Low8/64]FPF-14+6/64:all          R=  +8.7  p =  1.2e-7   suspicious
  [Low8/64]FPF-14+6/32:(0,14-0)     R= +10.2  p =  5.1e-9   suspicious
  [Low8/64]FPF-14+6/32:all          R=  +9.4  p =  2.7e-8   very suspicious
  [Low8/64]FPF-14+6/16:all          R=  +5.8  p =  6.4e-5   unusual
  ...and 1439 test result(s) without anomalies

So there are at least 2^72 correlated subsequences, no matter how you choose the "stream constants", exactly as in the same-constant case.

And we're given a ridiculous amount of slack to the generator: even if instead of the exact starting point I'm calculating you would use a state a little bit before or after, correlation would show up anyway. You can modify easily the program with an additional parameter to do that.

I repeat, once again: no existing modern generator from the scientific literature has this misbehavior (of course, a power-of-2 LCG has this behavior, but, for God's sake, that's not a modern generator).

@imneme
Copy link

imneme commented May 20, 2020

Sabastiano's critiques of PCG are addressed in this blog post from 2018.

The short version is that if you're allowed to contrive specific seeds, you can show “bad looking” behavior out of pretty much any PRNG. Notwithstanding his claim that PCG is somehow unique, actually PCG is pretty conventional — PCG's streams are no worse than, say, SplitMix's, which is another widely used PRNG.

@vigna
Copy link
Author

vigna commented May 20, 2020

That is entirely false. To prove me wrong, show two correlated nonoverlapping sequences from MRG32k3a or xoshiro256++.

@imneme
Copy link

imneme commented May 20, 2020

I never said non-overlapping. Show me a currently available test for xoshiro256++. that two seeds avoid overlap.

@imneme
Copy link

imneme commented May 20, 2020

In contrast, I do have a test for PCG that shows that the “correlations” you showed are essentially a form of overlap.

@vigna
Copy link
Author

vigna commented May 20, 2020

I can't fight FUD like "essentially" and "a form", but I modified http://prng.di.unimi.it/intpcgnumpy.c so that initially it iterates each PRNG 10 billion times, and exits with an error message if the generated sequence traverses the initial state of the other PRNG. This guarantees that the first 160GB of data into Practrand come from non-overlapping sequences:

./intpcgnumpy 0x596d84dfefec2fc7 0x0579f81ab9f3e37b 0x8d7deae980a64ab0 0x6c79f81ab9f3e37b | stdbuf -oL ~/svn/c/xorshift/practrand/RNG_test stdin -tf 2 -te 1 -tlmaxonly -multithreaded
[...]
rng=RNG_stdin, seed=unknown
length= 64 gigabytes (2^36 bytes), time= 926 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/8]FPF-14+6/64:(0,14-0)      R=  +8.8  p =  8.7e-8   mildly suspicious
  [Low1/8]FPF-14+6/64:all           R=  +6.3  p =  2.1e-5   unusual          
  [Low1/16]FPF-14+6/64:(0,14-0)     R=  +7.6  p =  1.1e-6   unusual          
  [Low1/16]FPF-14+6/64:(1,14-0)     R=  +8.3  p =  2.9e-7   mildly suspicious
  [Low1/16]FPF-14+6/64:all          R=  +8.0  p =  5.8e-7   suspicious       
  [Low1/16]FPF-14+6/32:all          R=  +7.1  p =  3.9e-6   mildly suspicious
  [Low1/64]FPF-14+6/32:cross        R=  +7.1  p =  2.6e-6   mildly suspicious
  [Low4/32]FPF-14+6/64:(0,14-0)     R= +13.5  p =  4.3e-12   VERY SUSPICIOUS 
  [Low4/32]FPF-14+6/64:all          R=  +9.0  p =  5.9e-8   very suspicious  
  [Low4/64]FPF-14+6/64:(0,14-0)     R= +11.4  p =  3.8e-10  very suspicious  
  [Low4/64]FPF-14+6/64:all          R=  +8.0  p =  5.3e-7   suspicious       
  [Low4/64]FPF-14+6/32:(0,14-0)     R= +10.3  p =  3.6e-9   suspicious       
  [Low4/64]FPF-14+6/32:all          R=  +6.1  p =  3.2e-5   unusual          
  [Low8/64]FPF-14+6/64:(0,14-0)     R= +18.6  p =  8.4e-17    FAIL           
  [Low8/64]FPF-14+6/64:(1,14-0)     R= +11.4  p =  3.9e-10  very suspicious  
  [Low8/64]FPF-14+6/64:(2,14-0)     R=  +8.3  p =  2.8e-7   mildly suspicious
  [Low8/64]FPF-14+6/64:all          R= +15.3  p =  6.9e-14    FAIL           
  [Low8/64]FPF-14+6/32:(0,14-0)     R=  +7.8  p =  7.1e-7   unusual          
  [Low8/64]FPF-14+6/32:(1,14-0)     R=  +7.2  p =  2.7e-6   unusual          
  [Low8/64]FPF-14+6/32:all          R=  +5.8  p =  6.9e-5   unusual          
  ...and 1786 test result(s) without anomalies

This particular initialization data has just 56 lower fixed bits, so one can generate 2^72 correlated sequences by flipping the higher bits. The statistical failures happen after just 64GB of data, showing that overlaps are not responsible for the correlation. It is possible that with other specific targeted choices overlap happens before 64GB, of course—this is a specific example. But it is now easy to check that overlapping is not the problem—the generator has just a lot of undesirable internal correlation.

@mattip
Copy link
Member

mattip commented May 20, 2020

Please respect the code of conduct. Try to keep your comments in tone with the directives to be "empathetic, welcoming, friendly, and patient" and "be careful in the words that we choose. We are careful and respectful in our communication".

@vigna
Copy link
Author

vigna commented May 27, 2020

I never said non-overlapping. Show me a currently available test for xoshiro256++. that two seeds avoid overlap.

Well, it's trivial: decide the length of the stream, iterate, and check that the two streams do not cross the initial state. It's the same code I used to show the correlated PCG streams in the program http://prng.di.unimi.it/intpcgnumpy.c do not overlap.

@vigna
Copy link
Author

vigna commented May 27, 2020

Notwithstanding his claim that PCG is somehow unique, actually PCG is pretty conventional — PCG's streams are no worse than, say, SplitMix's, which is another widely used PRNG.

IMHO, the self-correlation within PCG is much worse. There is no result for the additive generator underlying a SplitMix instance analogous to Durst's dramatic 1989 results about LCGs.

But the very mild problems of SplitMix are known, and JEP 356 will provide a new class of splittable generators, LXM, trying to address those problems. It would be time to move on and replace PCG, too, with something less flawed.

The underlying issue is known for both generators, and it is due to the lack of state mix. If you change bit k of the state of one of those generators, the change will never propagate below bit k. This does not happen in LCGs with prime modulus, in F₂-linear generators, CMWC generators, etc. All other generators try to mix their state as quickly and as much as possible.

Equating the problems of PCG and SplitMix is a red herring. SplitMix has a very simple underlying generator, just additive, but on top of that there is a scrambling function that is very powerful: it is Appleby's 64-bit finalizer of the MurmurHash3 hash function, which has been widely used in a number of contexts and has been improved by Stafford (http://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html). The constants of the function have been trained to have specific, measurable avalanching properties. Even changes in a small number of bits tend to spread over all the output. In other words, SplitMix stands on the shoulder of giants.

On the contrary, the LCG underlying PCG generators has the same lack-of-mixing problems, but the scrambling functions are just a simple sequence of arithmetic and logical operation assembled by the author without any theoretical or statistical guarantee. If they had been devised taking care of the fact that all sequences of the underlying LCG are the same modulo an additive constant and possibly a sign change, it would have been possible to address the problem.

But the author had no idea that the sequences were so easily derivable one from another. This can be easily seen from this statement in Section 4.2.3 of the PCG technical report (https://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf):

"Every choice of c results in a different sequence of numbers that has none of its pairs of successive outputs in common with another sequence."

This is taken as proof that the sequences are different, that is that the underlying LCG provides multiple streams. Durst's 1989 negative results about this topic do not appear anywhere in the paper. As I remarked earlier, by those results all such sequences are the same, modulo an additive constant and possibly a sign change (for LCGs with power-of-2 modulus of maximum potency, as it happens in PCG).

I'm sure not quoting Durst's results is a bona fide mistake, but the problem is that once you are convinced the underlying LCG you are using provides "streams" that are "different" in some sense, when they are not, you end up with a generator like PCG in which for each subsequence there are 2^72 non-overlapping, correlated subsequences, even if you change the "stream".

@rkern
Copy link
Member

rkern commented May 27, 2020

Thank you all for your input. For the moment, I am not interested in binary judgments like "PCG is good/bad". Please use your own forums for such discussions. What is on-topic here is what numpy will do, and that final judgment belongs to the numpy developers. We do appreciate the expertise that you all bring to this discussion, but I want to focus it on the underlying facts rather than the final judgement. I especially appreciate quantitative statements that give me an idea of the amount of headroom that we have. If my earlier judgments were wrong, it was because I jumped to the judgment too soon, so I would appreciate your assistance in avoiding that again. Thank you.

Note that to get more the 50% probability that two generators start from two correlated seed (chosen at random) you need just about half a million generators starting at random (birthday paradox).

@vigna Can you walk me through this calculation? The birthday collision calculation that I am familiar with gives the 50% chance of an n-bit collision at 2**(n/2) items (give or take a factor of 2) . Half a million is 2**19, so you seem to be claiming that the dangerous correlations start at around a 40-bit collision in the lower bits, but I have not seen evidence that this is practically observable. I have tested a pair sharing the lower 40 bits and got to 16 TiB in PractRand before cancelling the test. If you have observed a failure with a 40-bit collision, how many TiB did you have to test to see it?

I am convinced that changing the increment doesn't affect the probability of collision. Further discussion of the merits of "PCG streams" is off-topic. Using that discussion as an excuse to repeatedly hammer on "the author" is especially unwelcome and treads on our code of conduct. Persisting will mean that we will have to proceed without your input. Thank you.

@imneme This seems to be the related to the issues with jumping by a multiple of a large power of 2. When I construct a pair of PCG64 instances with the same increment and sharing the lower n bits, the distance that I calculate between the two is a multiple of 1 << n. It does appear that your stronger DXSM output function appears to resolve this manifestation as well. I've tested a pair of PCG64DXSM instances that share an increment and the lower 64-bits of state out to 2 TiB without issue.

@vigna
Copy link
Author

vigna commented May 27, 2020

OK, this is embarrassing: it was half a billion, not half a million. A single letter can make a big difference. I apologize for the slip.

But, as I said early, this is the probability of hitting exactly the same starting state, not the probability of a significant overlap of correlated subsequences. Personally I prefer to use PRNGs without correlated subsequences, since there are plenty of them, but, as you rightly say, the decision is only yours.

Fixing the scrambling function so that it has better mixing properties sounds like a perfectly reasonable solution.

My post was just meant to be a clarification of the structural differences between PCG and SplitMix, since a previous post claimed they had similar problems, and I don't think that is a correct statement. You cannot write a program like http://prng.di.unimi.it/corrpcgnumpy.c for SplitMix.

@imneme
Copy link

imneme commented May 27, 2020

@rkern, you asked:

@imneme This seems to be the related to the issues with jumping by a multiple of a large power of 2. When I construct a pair of PCG64 instances with the same increment and sharing the lower n bits, the distance that I calculate between the two is a multiple of 1 << n. It does appear that your stronger DXSM output function appears to resolve this manifestation as well. I've tested a pair of PCG64DXSM instances that share an increment and the lower 64-bits of state out to 2 TiB without issue.

Thanks for finding and linking back to the discussion thread from last year. Yes, as Sebastiano notes in his response,

Fixing the scrambling function so that it has better mixing properties sounds like a perfectly reasonable solution.

XSL-RR is at the weaker end of things. In contrast, both the original RXS-M output function from the PCG paper, and the new DXSM output function do more in the way of scrambling, and so don't show these kinds of issues. DXSM (added to the PCG source in this commit last year) was specifically designed to be stronger than XSL-RR but have similar time performance (c.f., RXS-M, which is slower). I tested DXSM fairly hard last year, but 67 days into the run we had an extended power outage that took down the server (the UPS battery drained) and ended the test run, but at that point it had proven itself pretty well in both normal testing (128 TB of output tested) and jumps of 2^48 (64 TB of output tested, since that runs slower).

If, even without designing DXSM, RXS-M would have taken care of the issue, one question is why I ever used the weaker XSL-RR permutation instead — why not always use very strong bit-scrambling in the output function? The answer is that it basically comes down to cycles. Speed matters to people, so you try to avoid doing a lot more scrambling than you need to.

This is an issue Sebastiano is familiar with, because his approach and mine have much in common. We each take a long established approach that would fail modern statistical tests (LCGs in my case, and Marsaglia's XorShift LFSRs in his) and add a scrambling output function to redeem it. We both strive to make that output function cheap, and we've both been caught out a little bit where deficiencies in the underlying generator that we're trying to mask with our output function nevertheless show through. In his case, it's linearity issues which have shown through.

But in somewhat recent work that pleased me a lot, he's also shown how many LFSR-based designs have hamming-weight issues (reflecting a longstanding concern of my own) that are inadequately masked by their output functions. His own generators pass his test, so that's something, but back in 2018 when I was looking at his new xoshiro PRNG it seemed to me that hamming-weight issues from the underling generator did make it through his output function. He's since revised xoshiro with a new output function, and I hope that does the trick (he made some other changes too, so perhaps that one also fixes the repeats issue, highlighted by this test program?).

As for his correlation programs, back in 2018 when he put a critique of PCG in on his website that included programs he'd written with various issues (like contrived seeding, etc.), I wrote a response that contained a bunch of similar programs for other long established PRNGs, including corrsplitmix2.c which creates correlated streams in SplitMix. I'm not quite sure what Sebastian means when his says it can't be done, but I will admit I haven't had a chance to look closely at his test program to see if his new one is substantively different from the ones he wrote a couple of years ago.

@matthew-brett
Copy link
Contributor

Please forgive my lack of understanding - but could I ask someone for a summary conclusion at this stage? Are there realisitic circumstances is which the default PCG is unsafe? What are the arguments for switching the default?

@bashtage
Copy link
Contributor

bashtage commented Jun 3, 2020

The easy path is to add some documentation about high (ultra-high?) -dimensional applications and the probability of correlated sequences.

A harder path would be to replace the output function which would break the stream. I'm not sure how strong a promise default_rng makes. The docs don't seem to warn that this may change so it would probably need a deprecation cycle to change. This would require adding in the new output function as a standalone bit generator (or configurable from PCG64, which would be more sensible) and then warning users that it will be changing after year XXXX/release Y.ZZ.

The hardest path would be to find a new default rng. It wasn't easy the first time and I don't think anything has changed to move the needle in any particular direction over the past 18 months.

@seberg
Copy link
Member

seberg commented Jun 3, 2020

I opened gh-16493 about the default_rng() modification, I do not think it is related to this issue, not even sure we have to discuss it, we probably already had set down the rules long ago and I just don't remember.


I do not claim to understand this discussion fully, but it seems there are two things to figure out:

  1. Being certain that we have enough headway, I have to trust Robert on this and right now it sounds to me like it should be fine to the best of our knowledge? (I.e. the probability of an actual collision is probably embarrassingly low even in environments magnitudes larger than anything NumPy may be used in? Whether or not it may warrant changing the default in the future is a different issue.)

  2. We state:

    and supports [...] as well as :math:2^{127} streams

    which, not knowing exactly where the number comes from, sounds like it may be a slight overstatement on our part and we could consider adjusted it slightly to be perfectly correct? Or link to some external resource giving additional details?

@rkern
Copy link
Member

rkern commented Jun 3, 2020

The easiest thing to do right now would be to add a PCG64DXSM BitGenerator, the variant of PCG64 with the "cheap multiplier" and stronger DXSM output function. I think everyone agrees that that's a step up from the XSL-RR output function we have now in our PCG64 implementation, performing better statistically without damage to runtime performance. It's a straightforward upgrade in the niche that PCG64 serves for the BitGenerators that we provide. I think we should add it alongside PCG64.

Incidentally, I prefer that it be a separate named BitGenerator rather than an option to the PCG64 constructor. Such options are great in randomgen, whose purpose is to provide a lot of different algorithms and variants, but for numpy, I think we want our selections to be "grab-and-go" as much as we can.

I don't think we really settled the policy for making changes to what default_rng() provides. When I proposed it, I brought up the notion that one of the reasons I preferred it to just putting its functionality into the Generator() constructor was that we could deprecate and move to a differently-named function if we needed to. However, at that time we were considering that default_rng() might need to expose a lot of details of the underlying BitGenerator, which we subsequently avoided. Because PCG64DXSM exposes the same API (.jumped() in particular) as PCG64, the only consideration we would have is that using as the new default would change the bitstream. I think it would be reasonable for us to follow the same timeline as any other modification to the stream coming from Generator methods per NEP 19 (i.e. on X.Y.0 feature releases). We can choose to be a little more cautious, if we want, and first expose PCG64DXSM as an available BitGenerator in 1.20.0 and document (but not warn(), too noisy to no effect) that default_rng() will change to using it in 1.21.0.

@bashtage
Copy link
Contributor

bashtage commented Jun 3, 2020

As part of adding the new BG it would be good to update the notes to PCG64 to start serving as a guide and to provide a rationale for preferring the newer variant.

@rkern
Copy link
Member

rkern commented Jun 3, 2020

  1. Being certain that we have enough headway, I have to trust Robert on this and right now it sounds to me like it should be fine to the best of our knowledge? (I.e. the probability of an actual collision is probably embarrassingly low even in environments magnitudes larger than anything NumPy may be used in? Whether or not it may warrant changing the default in the future is a different issue.)

That's probably a little too glib. It depends on how many streams, how much data you pull from each stream, and what your risk tolerance for a birthday collision is. I haven't gotten around to crunching that math into a comprehensible paragraph yet to make easy-to-follow recommendations yet, which is why I haven't revisited this Github issue in a while. I don't think it's a hair-on-fire issue that needs to be fixed right now, though.

@imneme
Copy link

imneme commented Jun 3, 2020

I'll write something longer later, but as I see it in this thread we've been retreading the ground we went over last year. Nothing has changed other than Sebastiano discovered that NumPy had shipped PCG. The analysis from the NumPy team last year was more in-depth, and considered more plausible scenarios.

@matthew-brett
Copy link
Contributor

My preference would be to upgrade the default as quickly as reasonably possible - just to reduce confusion. I mean, not wait for a deprection cycle.

@matthew-brett
Copy link
Contributor

@imneme - thanks much - I'd find a longer thing very useful.

@imneme
Copy link

imneme commented Jun 3, 2020

Probably the best blast from the past post about it is this one. I think it's certainly worth a read. You can scroll back up in the thread from there to see us talking about these issues. It was about PCG 32.

@imneme
Copy link

imneme commented Jun 4, 2020

I had in my head what I wanted to say, but I find looking at the posts from a year ago, I've said it all already, both here and elsewhere (my blog (2017), reddit, my blog again (2018), and in NumPy discussions, etc.)

About streams and their self-similarity (for which @rkern wrote a stream-dependence tester), I wrote last year:

As noted in my blog post that was mentioned earlier, PCG's streams have a lot in common with SplitMix's.

Regarding @mdickinson's graph, for every PRNG that allows you to seed its entire state, including counter-based cryptographic ones, we can contrive seedings where we'd have PRNGs whose outputs were correlated in some way (the easiest way to do so is to make PRNG states that are a short distance apart, but often we can do other things based on an understanding of how they work). And although PRNGs that don't allow full-state seeding can avoid this issue, doing so just introduces a new one, only providing practical access to a tiny fraction of their possible states.

The right way to think of streams is just more random state that needs to be seeded. Using small values like 1,2,3 is generally bad idea for any seeding purposes for any PRNG (because if everyone does favors these seeds, their corresponding initial sequences will be overrepresented).

We can choose not to call it a stream at all and just call it state. That's what Marsaglia did in XorWow. If you look at the code, the Weyl sequence counter doesn't interact with the rest of the state at all, and, like LCGs, and variations in the initial value really just amount to an added constant.

SplitMix's, PCG's and XorWow's streams are what we might call “stupid” streams. They constitute a trivial reparameterization of the generator. There is value in this, however. Suppose that without streams, our PRNG would have an interesting close repeat of 42, where 42 crops up several times in quick succession and only does this for 42 and no other number. With stupid “just an increment” or “just an xor” streams, we'll actually avoid hardwiring the weird repeat to 42; all numbers have a stream in which they are they weird repeat. (For this reason, the fix I'd apply to repair the close-repeat problems in Xoshiro 256 is to mix in a Weyl sequence.)

(I then went into more depth in this comment and in this one.)

I'd say that the key take-away is that for almost any PRNG, with a bit of time and energy you can contrive pathological seedings. PCG is in a somewhat unusual position that it has someone who enjoys working plausible looking seedings for PCG specifically that have a hand-crafted pathology (i.e., Sebastiano). As a result of that work, I've turned around and done the same for both his PRNGs and for other longstanding ones.

In general, you want to initialize the PRNG state with something that looks “kinda random”. That's pretty universal, even if people want to do otherwise. For example, for any LFSR PRNG (XorShift-style, Mersenne Twister, etc.), you must not initialize it with an all-zeros state because it'll just stay stuck there. But even states that are mostly-zero are often problematic for LFSRs (a phenomenon known as zeroland, and why the C++ folks made seed_seq.). If you want to play the “let's do some contrived seeding” game, it's not hard to create collection of initializations for 100 LFSRs and have them all be 1K away from hitting zero-land. The contrived initializations will all look innocent enough, but they'll all hit this weird drop in hamming weight at the same time.

If you're using a PCG generator that was initialized with reasonable entropy, it's fine. If you want to initialize it with junk like 1, 2, 3, that's actually problematic with any PRNG. And with PCG, 99.9% of the time, even using something kinda junky would be fine. It doesn't have anything like the kind of issues that LFSRs have.

But DXSM is certainly stronger. I think it's better or I wouldn't have made it, but it's worth having some perspective and realizing that in practice users aren't going to run into problems with the classic PCG64 implementation.

@rkern
Copy link
Member

rkern commented Jun 4, 2020

I'd like to separate out the criticism/defense of PCG64's streams (via the LCG increment) from the present discussion. While there's a certain duality involved due to the mathematics of the LCG, it's not the core issue that was originally brought up here. Also, there's more detail here to be considered than was present in Sebastiano's original critique, your response, or the old mega-thread. Perhaps the connection is more obvious to the experts who have spent more time on the math, but the practical consequences are now clearer to me, at least.

I'd say that the key take-away is that for almost any PRNG, with a bit of time and energy you can contrive pathological seedings.

Granted, but it's not the binary can/can't that drives the decision in front of me. If I draw too many numbers from a finite PRNG, eventually PractRand will suss it out. That binary fact doesn't invalidate that PRNG algorithm. Moving away from that binary and establishing the concept of headroom was one of the things that I really appreciated about the original PCG paper. Given an adversarially-generated pathology, we can take a look at how often that pathology could arise randomly from good entropy-seeding. I want to quantify that, and turn it into practical advice for users.

Given two states that share the lower 58 bits and the same increment (we'll put a pin in that), interleaving PCG64 XSL-RR instances from those states demonstrates practically observable failures in PractRand at around 32 GiB. I think it's reasonable to want to avoid that. So let's take that as our benchmark and look at how often that arises with good entropy-seeding. Fortunately, this adversarial scheme is amenable to probabilistic analysis (not all are so friendly). For n instances, the probability of any 2 sharing the same lower 58 bits is n**2 / 2**58, give or take a factor of 2 for double-counting. So at half a billion instances, odds are good that there is one such pairing that would fail PractRand if interleaved. Half a billion is a lot! In my judgment, we'll probably never see a numpy program that tries to create that many PCG64 instances. numpy would likely be the wrong tool, then.

I think it's also reasonable to want to avoid initial states whose subsequent draws will cross any of the lower-58-bit collision states of the other initial states. I'm still trying to think through the logic on that, but I think the length affects the probability linearly rather than quadratically. If I'm right, and I want to draw 16 GiB from each instance (2**28 draws), which is how much we drew from each of the pair that showed PractRand failures, then I can only work with about 2**15 instances, or about 32k, before it becomes quite likely to observe a crossing. That's still quite a lot of instances for Python! And the total amount of data generated is about half a petabyte, which is a lot! But it's on the horizon of practicality, and if I want to keep the probability low, not just below half, I have to go lower on one of those. I'm not particularly concerned by these numbers; I don't think any real numpy programs are likely to run into problems using PCG64 with the XSL-RR output function. But some applications may start to get close (large distributed reinforcement learning runs, for example).

Let's take that increment pin out and address it. I think it's fair to say that with the XSL-RR output function, also entropy-seeding the increment in addition to the state does not change this particular analysis. It seems that for any given pair of entropy-seeded increments, there's the same number of practically-colliding states. The concrete procedure for deliberately constructing those states looks more complicated than bashing in the same lower 58 bits, but it seems like the number of colliding states is the same, so the probability calculations remain the same. This isn't intrinsic to the PCG scheme in general. The DXSM output function appears to be strong enough that changing the increment (even with a simple +2) seems to be sufficient to resist even the worst-case state for the underlying LCG (when the distance metric gives 0), at least as far as I've bothered to test with PractRand.

I want to end by reiterating what we all seem to be in perfect agreement about: PCG64DXSM is a good idea. If nothing else, its improved statistical properties simplify the mental models that I feel compelled to document, and anything that means I have to write less documentation is good in my book.

@imneme
Copy link

imneme commented Jun 4, 2020

Streams are still somewhat relevant because the issue only shows up if we have the generators on the same stream.

But under what circumstances would they have the same lower 58 bits and be on the same stream? Is there a use case where this would happen?

The one somewhat realistic case I know of is the one we talked about last year (when we talked about jumped), and I talked about in this post which I linked to earlier.

@vigna
Copy link
Author

vigna commented Jun 6, 2020

BTW, to give a more tangible measure of how to improve the quality of the multipliers involved, I computed the spectral scores, from f₂ to f₈, of the current 64-bit multiplier used by PCG DXS and some alternative.

Spectral scores are the standard way to judge the goodness of a multiplier. 0 is bad, 1 is excellent. Each score describe how well are distributed the pairs, triples, 4-tuples, etc. in the output.

These seven numbers can be resumed in the classic measure, the minimum, or a weighted measure (the first score, plus the second divided by two, etc., normalized), to make the first scores more important, as suggested by Knuth in TAoCP, and these are the minimum and the weighted measure for the current multiplier:

0xda942042e4dd58b  0.633  0.778

There are much better 64-bit constants than that:

0xff37f1f758180525  0.761  0.875

If you go to 65 bits, essentially at the same speed (at least, for LCG128Mix is the same speed) you get a better weighted measure:

0x1d605bbb58c8abbfd  0.761  0.899 	

The reason is that 64-bit multipliers have an intrinsic limit to their f₂ score (≤0.93), which is as noted by Knuth is the most relevant:

0xda942042e4dd58b5  0.795
0xff37f1f758180525  0.928
0x1d605bbb58c8abbfd  0.992

So the first multiplier has a mediocre f₂ score. The second multiplier gets very close to the optimum for a 64-bit multiplier. The 65-bit multiplier does not have these limitations and has a score very close to 1, the best possible in general.

For completeness, here are all the scores:

 0xda942042e4dd58b5  0.794572 0.809219 0.911528 0.730396 0.678620 0.632688 0.639625
 0xff37f1f758180525  0.927764 0.913983 0.828210 0.864840 0.775314 0.761406 0.763689 
0x1d605bbb58c8abbfd  0.991889 0.907938 0.830964 0.837980 0.780378 0.797464 0.761493	

You can recompute these score or look for your own multiplier with the code Guy Steele and I distributed: https://github.com/vigna/CPRNG . The better multipliers are taken from the associated paper.

@tylov
Copy link

tylov commented Jun 10, 2020

PCG will probably be a fine default prng for numpy, but I don't think it will stand the test of time, as there are more promising, but less tested ways to do this. I propose one in the following.

The half chaotic SFC64 is one of the fastest of the statistically sound generators with a reasonable large minumum period. SFC64 has no jump functions, but can without speed overhead be extended to support 2^63 guarateed unique streams. Simply add a Weyl sequence with a user chosen additive constant k (must be odd), instead of just incrementing the counter by one. Each odd k produces a unique full period. It requires additional 64-bits of state to hold the Weyl constant:

typedef struct {uint64_t a, b, c, w, k;} sfcw64_t; // k = stream

static inline uint64_t sfcw64_next(sfcw64_t* s) {
    enum {LROT = 24, RSHIFT = 11, LSHIFT = 3};
    const uint64_t out = s->a + s->b + (s->w += s->k);
    s->a = s->b ^ (s->b >> RSHIFT);
    s->b = s->c + (s->c << LSHIFT);
    s->c = ((s->c << LROT) | (s->c >> (64 - LROT))) + out;
    return out;
}

A 320 bits state is sometimes undesirable, so I have tried to squeeze it down to using 256 bits again. Note the changed output function too, which better utilizes the Weyl sequence for bit mixing. It uses 128/128 bits chaotic/structured state, which seems to strike a good balance:
/EDIT: removed rotl64() from output func + cleanup, Aug. 6:

typedef struct {uint64_t a, b, w, k;} tylo64_t;

static inline uint64_t tylo64_next(tylo64_t* s) {
    enum {LROT = 24, RSHIFT = 11, LSHIFT = 3};
    const uint64_t b = s->b, out = s->a ^ (s->w += s->k);
    s->a = (b + (b << LSHIFT)) ^ (b >> RSHIFT);
    s->b = ((b << LROT) | (b >> (64 - LROT))) + out;
    return out;
}

This has currently passed 4 TB in PractRand testing without anomalies, and I briefly ran Vigna's Hamming-weight test without issues so far (although, passing these tests are no guarantee for near true random output, rather a test whether the prng is flawed or not).

Note: it is supposedly statistcally an advantage to use a (unique) random Weyl constant with roughly 50% of the bits set, but only further testing or analysis will reveal how significant this is.

/Edits: cleanups.

@charris
Copy link
Member

charris commented Jun 10, 2020

@tylo-work SFC64 is already in NumPy, along with Philox, this is about the default generator.

@tylov
Copy link

tylov commented Jun 10, 2020

Ok, I didn't know exactly which were implemented, so this is only about selecting the most suited overall from those? Fair enough, and thanks for clarifying.

I will try to test my proposed generator extensively to see how it stacks up with others, and so far it looks very good regarding speed, output quality, simplicity/size/portablity, and for massive parallel usage. But I would be happy if others tested it as well.

@bashtage
Copy link
Contributor

bashtage commented Jun 10, 2020

The version in NumPy is the standard variant which has k=1.

static NPY_INLINE uint64_t sfc64_next(uint64_t *s) {
const uint64_t tmp = s[0] + s[1] + s[3]++;
s[0] = s[1] ^ (s[1] >> 11);
s[1] = s[2] + (s[2] << 3);
s[2] = rotl(s[2], 24) + tmp;
return tmp;
}

@rkern
Copy link
Member

rkern commented Jun 10, 2020

I don't think we're reopening the discussion about the default PRNG from scratch. We have a very specific issue with our current PRNG and are looking at available, closely-related variants that address that specific issue. One of our concerns is that the current default PRNG exposes certain features of the PRNG, like jumpability, that the variant that replaces it must still expose. SFC64 (either ours or yours) doesn't have that feature.

It's possible that that @bashtage would be willing to accept a PR for randomgen to add your Weyl-stream variants of SFC64.

@charris
Copy link
Member

charris commented Jun 10, 2020

@tylo-work If you are interesting in parallel execution you might want to look at NumPy's SeedSequence implementation.

@vigna
Copy link
Author

vigna commented Jun 11, 2020

I don't think we're reopening the discussion about the default PRNG from scratch. We have a very specific issue with our current PRNG and are looking at available, closely-related variants that address that specific issue.

Assuming you want something PCG-DXS-like, there are further improvements you can do with just better constants (and a very marginal slowdown). For example, PCG-DXS will soon fail two distinct type of tests on two interleaved, correlated subsequences with the same lower 112 bits of state:

rng=PCGDXS_int112, seed=0x4d198651
length= 128 gigabytes (2^37 bytes), time= 5700 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/64]TMFn(0+2):wl             R= +57.3  p~=   2e-27     FAIL !!
  [Low8/64]FPF-14+6/64:(1,14-0)     R= +17.5  p =  8.0e-16    FAIL
  [other failures in the same tests]
  ...and 1893 test result(s) without anomalies

Note that we're talking about just ≈65536 correlated sequences—nothing to be afraid of.

But you can improve the generator by choosing a better multiplier, such as 0x1d605bbb58c8abbfd, and a better mixer, such as 0x9e3779b97f4a7c15. The first number is a 65-bit multiplier that has much better spectral scores. The second number is just the golden ratio in a 64-bit fixpoint representation, and this is known to have nice mixing properties (see Knuth TAoCP on multiplicative hashing); for example, it is used by the Eclipse Collections library to mix hash codes.

As a result, you fail just FPF for the same amount of data:

rng=PCG65-DXSϕ_int112, seed=0x4d198651
length= 128 gigabytes (2^37 bytes), time= 5014 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low8/64]FPF-14+6/64:(0,14-0)     R= +16.1  p =  1.5e-14    FAIL
  [other failures in the same test]
  ...and 1892 test result(s) without anomalies

In fact, if we go further at 2TB PCG-DXS fails three type of tests for the same interleaved, correlated subsequences:

rng=PCGDXS_int112, seed=0x4d198651
length= 2 terabytes (2^41 bytes), time= 53962 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/32]TMFn(0+0):wl             R= +50.2  p~=   4e-23     FAIL !!
  [Low8/64]FPF-14+6/64:(1,14-0)     R=+291.1  p =  4.7e-269   FAIL !!!!!!
  [Low8/64]Gap-16:B                 R= +19.5  p =  1.4e-16    FAIL !
  [other failures in the same tests]
  ...and 2153 test result(s) without anomalies

whereas PCG65-DXSϕ still fails just FPF:

rng=PCGDXS65ϕ_int112, seed=0x4d198651
length= 2 terabytes (2^41 bytes), time= 55280 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low8/64]FPF-14+6/64:(0,14-0)     R=+232.1  p =  2.0e-214   FAIL !!!!!!
  [other failures in the same test]
  ...and 2153 test result(s) without anomalies

Sooner or later, of course, also PCG65-DXSϕ will fail Gap and TMFn. But you need to see much more output than with PCG-DXS.

This is the complete code for PCG65-DXSϕ, which is just PCG-DXS with better constants:

#include <stdint.h>

__uint128_t x; // State
uint64_t c; // Additive constant

static inline uint64_t output(__uint128_t internal) {
    uint64_t hi = internal >> 64;
    uint64_t lo = internal;

    lo |= 1;
    hi ^= hi >> 32;
    hi *= 0x9e3779b97f4a7c15;
    hi ^= hi >> 48;
    hi *= lo;
    return hi;
}

static uint64_t inline next(void) {
    __uint128_t old_x = x;
    x = x *  ((__uint128_t)1 << 64 ^ 0xd605bbb58c8abbfd) + c;
    return output(old_x);
}

The marginal slowdown is due to an add instruction (caused by the 65-bit multiplier), and having two 64-bit constants to load.

I'm not endorsing generators of this kind in general, but PCG65-DXSϕ is measurably better than PCG-DXS at hiding correlation.

@tylov
Copy link

tylov commented Jun 11, 2020

@vigna, FYI, I also did some interleaving testing, and noticed that xoshiro256** failed rather quickly when creating 128 interleaved streams or more. With 256 it failed quickly. The point of the test is to check how well the PRNGs behave when each stream was initialized with some linear dependencies. Essentially, the state is initialized to s[0]=s[1]=s[2]=s[3] = k1 + stream*k2. Then 12 outputs are skipped, which is basically how sfc64 is initialized.

I realize, this is not the recommended initialization for xoshiro, but it is still interesting - and a little worrying - that the tests seemed fine for xoshiro with few interleaving streams, but failed with many.

seed: 1591888413
RNG_test using PractRand version 0.95
RNG = RNG_stdin64, seed = unknown
test set = core, folding = standard (64 bit)
...
rng=RNG_stdin64, seed=unknown
length= 2 gigabytes (2^31 bytes), time= 29.6 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/64]FPF-14+6/16:(1,14-1)     R=  +7.2  p =  3.7e-6   unusual
  [Low1/64]FPF-14+6/16:all          R=  +9.6  p =  1.8e-8   very suspicious
  ...and 261 test result(s) without anomalies

rng=RNG_stdin64, seed=unknown
length= 4 gigabytes (2^32 bytes), time= 55.5 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/64]FPF-14+6/16:(0,14-0)     R= +13.4  p =  4.7e-12   VERY SUSPICIOUS
  [Low1/64]FPF-14+6/16:(1,14-0)     R=  +9.4  p =  2.6e-8   suspicious
  [Low1/64]FPF-14+6/16:(2,14-1)     R=  +7.7  p =  1.3e-6   unusual
  [Low1/64]FPF-14+6/16:all          R= +17.4  p =  8.8e-16    FAIL !
  ...and 275 test result(s) without anomalies

I also tried to weaken the initialization for SFC64 and TYLO64 to skip only 2 outputs, however they still seemed OK.
On performance: xoshiro256** runs 33% slower on my machine than the two others. TYLO64 only updates 196 bits of state vars. Here is the test program:

int main()
{
    //FILE* f = freopen(NULL, "wb", stdout);  // Only necessary on Windows, but harmless.
    enum {THREADS = 256};
    uint64_t seed = 1591888413; // <- e.g. this fails. // (uint64_t) time(NULL); 
    fprintf(stderr, "seed: %lu\n", seed);

    static tylo64_t tyl[THREADS];
    static sfc64_t sfc[THREADS];
    static uint64_t xo[THREADS][4];

    for (size_t i = 0; i < THREADS; ++i) {
	tyl[i] = tylo64_seed(seed + (12839732 * i), 19287319823 * i);
	sfc[i] = sfc64_seed(seed + (12839732 * i));
	xo[i][0] = xo[i][1] = xo[i][2] = xo[i][3] = seed + (12839732 * i);
	for (int j=0; j<12; ++j) xoshiro256starstar_rand(xo[i]);
    }
    static uint64_t buffer[THREADS];
    size_t n = 1024 * 1024 * 256 / THREADS;
    
    while (1/*n--*/) {
        for (int i=0; i<THREADS; ++i) {
	    //buffer[i] = tylo64_rand(&tyl[i]);
	    //buffer[i] = sfc64_rand(&sfc[i]);
            buffer[i] = xoshiro256starstar_rand(xo[i]);
        }
        fwrite((void*) buffer, sizeof(buffer[0]), THREADS, stdout);
    }
    return 0;
}

I'll include some relevant header code:

typedef struct {uint64_t a, b, w, k;} tylo64_t; // k = stream

static inline uint64_t tylo64_rand(tylo64_t* s) {
    enum {LROT = 24, RSHIFT = 11, LSHIFT = 3};
    const uint64_t b = s->b, w = s->w, out = (s->a + w) ^ (s->w += s->k);
    s->a = (b + (b << LSHIFT)) ^ (b >> RSHIFT);
    s->b = ((b << LROT) | (b >> (64 - LROT))) + out;
    return out;
}

/* stream in range [0, 2^63) */
static inline tylo64_t tylo64_seed(const uint64_t seed, const uint64_t stream) {
    tylo64_t state = {seed, seed, seed, (stream << 1) | 1};
    for (int i = 0; i < 12; ++i) tylo64_rand(&state);
    return state;
}

static inline uint64_t rotl(const uint64_t x, int k) {
    return (x << k) | (x >> (64 - k));
}
static inline uint64_t xoshiro256starstar_rand(uint64_t* s) {
    const uint64_t result = rotl(s[1] * 5, 7) * 9;
    const uint64_t t = s[1] << 17;
    s[2] ^= s[0];
    s[3] ^= s[1];
    s[1] ^= s[2];
    s[0] ^= s[3];
    s[2] ^= t;
    s[3] = rotl(s[3], 45);
    return result;
}

@rkern
Copy link
Member

rkern commented Jun 11, 2020

@tylo-work I appreciate the analysis, but I really need this issue to stay focused. If you would like to continue that line of discussion, I encourage you to post your work in your own Github repo, and make one more post here inviting people here to it. Everyone else, please respond there. Thank you for your cooperation.

@charris
Copy link
Member

charris commented Jun 16, 2020

@imneme @rkern Time is running out for the 1.19 release.

@charris
Copy link
Member

charris commented Jun 19, 2020

@rkern Looks like PCG64DXSM won't make it into 1.19.0, I'll be releasing this weekend. If you could write the note about our change policy/upcoming changes that you mentioned above I would appreciate it.

@charris charris added this to the 1.20.0 release milestone Jun 19, 2020
@imneme
Copy link

imneme commented Jun 19, 2020

Sorry, I've been dealing with some other unrelated matters. Based on our discussion, I don't think a small delay is a big issue, since PCG64DXSM was planned as an alternate option, not a new default (for now, at least).

@bashtage
Copy link
Contributor

bashtage commented Nov 1, 2020

Now that 1.20 is starting up, is it time to revisit this and move to DXSM?

@seberg
Copy link
Member

seberg commented Nov 13, 2020

We would still have some time to do the move before branching, but it may be good to get a start on it within the next week or so. @bashtage I guess you have the PCG64DXSM ready to go and this mainly needs the decision to flip the switch on the default stream?

From what it see, it sounded like we should just do this for 1.20 if we have it readily available.

@charris
Copy link
Member

charris commented Nov 13, 2020

IIRC, we have been waiting for a reference that could be linked. But if the random number folks are happy with the change we should use it. Do we need any special code for windows?

@bashtage
Copy link
Contributor

It is just a different constant and a different scrambling function. Nothing more novel than what @rkern wrote for the original PCG64 implementation on Windows. I think the decision was to have a fully standalone PCG64DXSM rather than to share some code (for performance).

@bashtage
Copy link
Contributor

Would probably made sense to start from rkern's WIP branch.

@imneme
Copy link

imneme commented Nov 13, 2020

I said I would write some a blog post about it, which I think @rkern wanted, but I've been attending to some other matters and it hasn't happened yet (sorry). In the meantime, the DXSM permutation has been grinding away under test and continues to seem like an improvement over the original. From remarks earlier in the thread, I think @rkern might have liked an even stronger output permutation, but doing so either costs you speed or (if you cut corners to gain speed) adds trivial predictability.

@charris
Copy link
Member

charris commented Apr 30, 2021

@rkern, 1.21.0 will branch in mid May, it would be nice if the PCGDXSM variant was in it.

@rkern
Copy link
Member

rkern commented May 4, 2021

@charris I will work on it this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

12 participants