Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: randomgen #13163

Merged
merged 139 commits into from May 28, 2019
Merged

ENH: randomgen #13163

merged 139 commits into from May 28, 2019

Conversation

mattip
Copy link
Member

@mattip mattip commented Mar 20, 2019

A start at mMerging bashtage/randomgen into numpy, as part of NEP 19.

The original repo was cloned, moved to a subdirectory, and then merged into numpy, as documented in _randomgen/README-git.md.
Then I moved the code into numpy/random and the docs into doc/source/random and doc/source/papers.

Still very much a work in progress.

@mattip
Copy link
Member Author

mattip commented Mar 20, 2019

So far I have begun to extend the setup.py to build the modules, but our cythonize script is very primitive.

@rgommers
Copy link
Member

License related question: if we list this under "bundled libraries" with its own license, does it mean that randomgen will continue to be developed standalone? I thought it was meant solely for inclusion in NumPy.

@rkern
Copy link
Member

rkern commented Mar 20, 2019

It's intended for incorporation. randomgen as a separate package will be frozen after this gets merged.

@rgommers
Copy link
Member

It's intended for incorporation. randomgen as a separate package will be frozen after this gets merged.

Nice. If that NCSA license is absolutely necessary then I guess it's no problem, however for something developed for inclusion in NumPy and only being a part of NumPy going forward, just having it under the NumPy license would be preferred.

@mattip
Copy link
Member Author

mattip commented Mar 21, 2019

I had to move pcg32.pyx, pcg64.pyx, and the examples temporarily to randomgen_ignore.

The pcg cython files use compiler directives, which we will need to refactor since the source files generated will differ by platform, preventing distributing a single source-tarball.

The examples use from randomgen import ... or so, and our cythonize script runs from the directory containing the pyx file, so it cannot import that way.

@mattip
Copy link
Member Author

mattip commented Mar 21, 2019

@bashtage, @rkern: I have imported randomgen in a way that preserves all the commits from that repo, assuming we wish to keep the commit history. Is that OK? Are there things we should git rebase to collapse into a single commit?

@bashtage
Copy link
Contributor

@mattip I would think that it should be squashed at least some eventually. When picking, I think it would be nice to keep at least one commit from randomgen contributors in, is possible.

@mattip
Copy link
Member Author

mattip commented Mar 21, 2019

The current failures are due to cython using its own numpy.pxd rather than the one we ship along-side mtrand.pyx(since that is in a different subdir) which has the line cdef extern from "numpy/npy_no_deprecated_api.h": pass. Choices are:

@bashtage
Copy link
Contributor

I think a clear official #12284 would be the best path since it would resolve this ambiguity once and for all.

@mattip
Copy link
Member Author

mattip commented Mar 21, 2019

in d079759 I took the third path, which it turns out is necessary anyway to merge this work with #12284. It was enough on my machine to quiet all the warnings and build the c-extension modules via cython.

@mattip
Copy link
Member Author

mattip commented Mar 23, 2019

Can we add a FIXME here ...

@rkern please feel free to push to my branch, I would be happy for collaboration.

@charris
Copy link
Member

charris commented Mar 23, 2019

When picking, I think it would be nice to keep at least one commit from randomgen contributors in, is possible.

Could maybe add a Thanks file or similar listing all the contributors. Should be easy to generate with git.

@mattip
Copy link
Member Author

mattip commented Mar 23, 2019

The code is sourced from places other than git. For instance, this file has contributors that do not appear in the git history, and has a different license than the rest of randomgen

@bashtage
Copy link
Contributor

Many (all) of the C source files for the BRNGs have no git history since they were made available from the authors directly.

@bashtage
Copy link
Contributor

@rkern 's idea is clearly the right way to do compat in PCG64. I once tried to do it but gave up since the upside for me was small in randomgen. I found it a bit tricky since Cython understands uint128 and so using a fake type wasn't so simple. At one point I thought some small abstraction beyond a simple #ifdef was needed before I stopped trying.

@mattip
Copy link
Member Author

mattip commented Mar 23, 2019

which platforms do not support __unit128?

Does the strategy of breaking the single 128bit int into two 64 bit integers with high and low support both big- and little- endian systems transparently?

@bashtage
Copy link
Contributor

Microsoft compilers (of course).

@bashtage
Copy link
Contributor

I looked through it and I believe that it is endian independent (and it passed on ARM, but I'm not sure whether this is BE or LE, and it might depend). @rkern wrote it, so he would know better.

@mattip
Copy link
Member Author

mattip commented Mar 24, 2019

Tests that come with randomgen now pass.
I think next I should try to move randongen into the numpy.random namespace and drop mtrand. It seems

numpy.random.RandomState = randomgen.legacy.LegacyGenerator
numpy.random.mtrand = randomgen.legacy.LegacyGenerator()
for x in dir(mtrand):
    if x[0] != '_':
        locals()[x] = getattr(mtrand, x)

answers the demands of the NEP (selectively quoted below) except for the missing set_state and get_state. LegacyGenerator().__getstate__ returns a dictionary, where numpy.random.get_state returns a list. I could make that work by coercing the dict to a list and visa-versa for set_state.

Am I on the right track?

First, we will maintain API source compatibility just as we do with the rest of numpy. If we must make a breaking change, we will only do so with an appropriate deprecation period and warnings.
Second, breaking stream-compatibility in order to introduce new features or improve performance will be allowed with caution. ...
[The] legacy distributions class MUST be accessible under the name numpy.random.RandomState for backwards compatibility. All current ways of instantiating numpy.random.RandomState with a given state should instantiate the Mersenne Twister basic RNG with the same state. ... Instances of the legacy distributions class MUST respond True to isinstance(rg, numpy.random.RandomState) because there is current utility code that relies on that check. Similarly, old pickles of numpy.random.RandomState instances MUST unpickle correctly.
Specifically, the initial release of the new PRNG subsystem SHALL leave these (numpy.random.*) convenience functions as aliases to the methods on a global RandomState that is initialized with a Mersenne Twister basic RNG object. A call to numpy.random.seed() will be forwarded to that basic RNG object. In addition, the global RandomState instance MUST be accessible in this initial release by the name numpy.random.mtrand.

@bashtage
Copy link
Contributor

I used a property for state which seems more natural than a java-ish set/get paradigm. I also with with a dictionary which is much easier to handle in case of breaks since nothing in the dictionary is really a committment as long as future versions use different keys.

One or both of these should possibly be overruled.

Does this the method you have above produce an identical dir as current random? I can't remember what I did with ancient functions names like ranf that are only aliases

@mattip
Copy link
Member Author

mattip commented Mar 24, 2019

The aliases are set in __init__.py. The only missing attributes from dir(randomgen.legacy.LegacyGenerator()) are set_state and get_state.

@bashtage
Copy link
Contributor

One further issue about using the same BRNG between legacy and a modern generator -- the legacy state has an additional value containing the next gaussian, since it uses a polar transformation. This might have some implications for setting the state if the basic RNG is shared.

@mattip
Copy link
Member Author

mattip commented Mar 25, 2019

I am getting an non-terminating loop in random_zipf from distributions.c for certain conditions in tests

I modified the code to print the values in each loop, details are below. I am not sure of how to provide a better stopping condition when a == 2.0, X == -sys.maxint - 1, V is very large.

code ``` int64_t random_zipf(brng_t *brng_state, double a) { double T, U, V; int64_t X; double am1, b;

am1 = a - 1.0;
b = pow(2.0, am1);
do {
U = 1.0 - next_double(brng_state);
V = next_double(brng_state);
X = (int64_t)floor(pow(U, -1.0 / am1));
/* The real result may be above what can be represented in a int64.
* It will get casted to -sys.maxint-1. Since this is
* a straightforward rejection algorithm, we can just reject this value
* in the rejection condition below. This function then models a Zipf
* distribution truncated to sys.maxint.
*/
T = pow(1.0 + 1.0 / X, am1);
fprintf(stdout, "am1 %10.3g T %10.3g U %10.3g V %10.3g X %ld\n", T, U, V, X);
} while (((V * X * (T - 1.0) / (b - 1.0)) > (T / b)) || X < 1);
return X;
}

which prints

am1 1 T 0.188 U 0.518 V 1.8e+308 X -9223372036854775808
am1 1 T 0.524 U 0.319 V 1.8e+308 X -9223372036854775808
am1 1 T 0.638 U 0.816 V 1.8e+308 X -9223372036854775808
am1 1 T 0.867 U 0.56 V 1.8e+308 X -9223372036854775808
am1 1 T 0.841 U 0.822 V 1.8e+308 X -9223372036854775808
am1 1 T 0.525 U 0.189 V 1.8e+308 X -9223372036854775808
am1 1 T 0.735 U 0.15 V 1.8e+308 X -9223372036854775808
am1 1 T 0.839 U 0.546 V 1.8e+308 X -9223372036854775808
am1 1 T 0.582 U 0.753 V 1.8e+308 X -9223372036854775808
am1 1 T 0.0421 U 0.854 V 1.8e+308 X -9223372036854775808
am1 1 T 0.0792 U 0.67 V 1.8e+308 X -9223372036854775808
am1 1 T 0.31 U 0.436 V 1.8e+308 X -9223372036854775808
am1 1 T 0.951 U 0.889 V 1.8e+308 X -9223372036854775808
am1 1 T 0.862 U 0.641 V 1.8e+308 X -9223372036854775808
am1 1 T 0.967 U 0.285 V 1.8e+308 X -9223372036854775808
am1 1 T 0.0981 U 0.865 V 1.8e+308 X -9223372036854775808
am1 1 T 0.444 U 0.884 V 1.8e+308 X -9223372036854775808
am1 1 T 0.73 U 0.122 V 1.8e+308 X -9223372036854775808
...

@mattip
Copy link
Member Author

mattip commented Mar 25, 2019

It helps to debug the debugging code. The printf statement was off-by-one, fixing it shows a==NAN.

bashtage and others added 5 commits May 27, 2019 22:58
Remove traces of the three removed bit generators
Add lock to Cython examples
Attempt to avoid defining variables that are incorrect for some platforms
Pep8 fixes
Remove unused imports
Fix name error
* PERF: Reorder header for philox

Reorder header so that support uint128 is always used if avilable, irrespective of platform
@mattip
Copy link
Member Author

mattip commented May 27, 2019

hmm, getting a FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. It seems there is bitrot when merging with master

@mattip
Copy link
Member Author

mattip commented May 28, 2019

tests are passing

@seberg
Copy link
Member

seberg commented May 28, 2019

I would like to thank everyone for making this milestone possible. Especially Robert and Kevin who made all of randomgen possible. I am also very happy and impressed to see so many specialists weighing in!
Probably, there will be a few follow-up issues, but I trust that this is very high quality work and we have discussed the exact API in depth so that I believe it is hashed out very well.

I am looking forward to using the new API, thanks all!

@seberg seberg merged commit 22239d1 into numpy:master May 28, 2019
@bashtage
Copy link
Contributor

Great. How long does it take for this to make it out to projects that test using prerelease builds of NumPy? As fun as the API discussion is about Generator, my main concern is verifying that I didn't accidentally break something in RandomState.

@seberg
Copy link
Member

seberg commented May 28, 2019

Indeed, I do think we have a few projects testing against master so that should be fairly quick. The big test will be the rc releases probably.

@WarrenWeckesser
Copy link
Member

Would anyone object to a pull request that renames distributions-boxmuller.c in numpy/random/src/legacy/ to distributions-legacy.c? Or would that be needless churn? The file contains all the legacy distributions, not just the normal distribution.

When I first saw the file name, I didn't realize that it was the file I was looking for. It only took a quick look to figure it out, so this isn't a high priority, but I think a more accurate name would be helpful.

@bashtage
Copy link
Contributor

bashtage commented Jun 9, 2019

I switched in randomgen after @mattip started the process with legacy-distributions.c and .h

https://github.com/bashtage/randomgen/tree/master/randomgen/src/legacy

I also moved them to src/legacy which I'm not sure is totally necessary, so I'm +1 on the clarifications. IMO a git mv is minimal churn.

Perhaps a comment at the top would be helpful to explain that this is the file where hanged fucntions from distributions.c go to maintain the RandomState promise would be helpful too.

@WarrenWeckesser
Copy link
Member

OK, pull request is #13743

@mattip
Copy link
Member Author

mattip commented Jun 10, 2019

hanged fucntions from distributions.c

@rkern could you rephrase that request in the PR? Do you mean frozen versions of the functions?

@WarrenWeckesser
Copy link
Member

@mattip, it was @bashtage who made that comment.

@bashtage
Copy link
Contributor

Do you mean frozen versions of the functions?

Yes. I mean the canonical version that must be used in RandomState. legacy-distributions.c is the file where the canonical version should be moved to when the generator for a specific distribution in distributions.c is improved.

Does their energy intake deviate systematically from the recommended
value of 7725 kJ?

We ha
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the actual output not shown here? Is it because the indeterminate value causes problems with doctests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet