Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework random generator to use LCG #928

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

zvirja
Copy link
Member

@zvirja zvirja commented Nov 19, 2017

Fixes #789.

Current random generator suffers from a few issues which I tried to address in this PR:

  • it uses the !generated.Contains(newRnd) statement to guarantee the uniqueness of the produced values. This approach could lead to significant delays when latest elements in a sequence are generated - we should find those "remaining" numbers using random 😕
  • it leads to OOM exceptions when you create large number of items, because underlying collection grows too much.

To address those issues I used the Linear Congruential Generator which generates the unique values if constants match the specific criteria. It has very small memory footprint and offers a good generation performance. I used constants from this document, where they mentioned that values are pretty good. Also I've tested all the constants till 2^33 and proven that they indeed produce a unique sequence (there were a few wrong constants, so I fixed them). I haven't tested the constants above because my PC has no enough power for that (RAM and CPU cores).

The issue with LCG is that it always produces the same sequence for the given constants. To overcome that I used a few techniques:

  • I use the system random to generate a seed - the starting number;
  • we have a few generator constants for the same value range and we randomly pick the constant to use.

Those approaches allows to make the produced sequence even more random and non-repeating.

A few things we should to be aware of:

  • Under the hood generator uses the hardcoded constants, meaning that random value range is fixed (this.m-1). To adapt to the required range I ignore random values above the range (see a cycle in the GenerateNextValue() method). In average we ignore a half of the generated numbers as we have constants for each power of 2. So far I haven't noticed any issues with that during testing.
  • I haven't tested the distribution and randomless quality of the hardcoded constants (using SmallCrush, Diehard and or tests). I believe that it shouldn't be an issue, as we are not a crypt tool and the previous approach didn't guarantee any distribution neither. I reviewed values visually and they look pretty random 😅
  • I used Debug.Assert() a few times - that condition is never expected to happen logically, however left them for documentation purpose and as a sanity check.

No tests were modified (you might see a refactoring only).

Additionally I've referenced the System.ValueTuple package in the AutoFixture package, so we could use the tuples from C# 7. That shouldn't be an issue as we are still building v4. Also that could enable us to support tuples generation by AutoFixture if needed.

@moodmosaic Please take a look and let me know your opinion. I've spent a few weeks working on this, but it could happen that I overlooked something 😉

@zvirja zvirja added this to the 4.0 RTM milestone Nov 19, 2017
Copy link
Member

@moodmosaic moodmosaic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with LCG is that it always produces the same sequence for the given constants.

This is NOT an issue; that's how PRNGs are supposed to work 😉

  • given the same seed
  • you get back the same result
  • — so how do you have PRNGs generate random numbers, then?
  • you make sure to supply a totally random seed
  • and one way of implementing this is by splitting the seed each time you need to generate a number

I have A LOT feedback to provide here, as my experience from years with AutoFixture, and the issues I've seen by using FsCheck for years, are more or less the result of finding myself working in Hedgehog.

It really depends on you, @zvirja, which direction you'd like to take:

  • Keep around the existing random number generation (as well as the existing way of dealing with ranges (Add support for string based properties decorated with the RangeAttribute #919)), which is fine—after all, these are already there, from version 3—though, we'd have to be conservative with bugfixes, as we can easily end up putting a lipstick on a pig...
  • Fix the random number generation by decoupling the 'range' concept (constant, linear, exponential) from the PRNG itself, and use some reasonable seed, preferably SplitMix.

It's exciting that you took the time to tackle this very issue 👍 — I look forward to hearing your decision.

@zvirja
Copy link
Member Author

zvirja commented Nov 21, 2017

we'd have to be conservative with bugfixes, as we can easily end up putting a lipstick on a pig...

Well, this is approximately what I'm trying to do in this PR 😄 I don't change the overall architecture and
simply change the implementation to require much less memory and provide results with less delay.

It really depends on you, @zvirja, which direction you'd like to take:

Well, actually it's pretty hard for me to decide 😕 The idea of total rework looks attractive, however, I feel that it's probably better to keep the existing approach and simply improve it internally for now. That would decrease the amount of changes we have in v4, as we already applied a set of them. Also that will allow us to save some time and finally release v4 RTM. In v5 when we have more time, we could rework that and probably a few other things.

I feel a high level of uncertainty with this decision, however it looks like a good plan for now.

Does that make sense to you?

@moodmosaic
Copy link
Member

moodmosaic commented Nov 23, 2017

I understand that you threw a lot of effort on this, and it is highly appriciated 🎉


However,

  • Is it a simple change?
  • Is this set of changes warranted?

Why choose LCG over others, battle-tested ones, like

I understand that the current implementation suffers from mediocre performance, however,

  • we haven't run statistical tests (e.g. diehard) between the current and the proposed implementation.

That being said, it might be just as easy as

  • trying to 'fix/optimize' the current one, without throwing in a whole bunch of changes.
  • —or—by moving LCG to its own class, and then change the Fixture instance to use that one instead.

@zvirja
Copy link
Member Author

zvirja commented Nov 23, 2017

Why choose LCG over others, battle-tested ones, like

I'm not very strong in this particular area, therefore I relied on the Google. I didn't want to change the behavior of the existing generator, and the current one has a full cycle (no repetitions within the range). From what I got we have either LCG or LFSR which could produce such sequences. I ended up with the first one as it was documented to provide a bit better quality. Could the "TFGen" or "SplitMix" also produce non-repeating values withing the specified range? Otherwise, that change is breaking and it was not my intention.

we haven't run statistical tests (e.g. diehard) between the current and the proposed implementation.

Yep, it's the whole point. While we haven't tested, I 99% sure that the previous generator implementation suffered from the distribution issues due to the nature of its implementation. Therefore, when I decided to replace it with LCG, I assumed that quality is not the strong point here. Rather, we expect have something a bit random and that's enough. The LCG provided me with such values, so ti looked like a good replacement even if I haven't tested that function has good enough quality. The more important thing is that it preserves the full-cycle semantics as that is asserted in our unit tests.


That being said, it might be just as easy as

I don't see too much value in persisting the previous implementation as a new one provides the exactly same output. The observer will not see any difference after this change. The only visible difference might be the changed distribution of numbers, but our tool is not designed to generate random of the perfect quality, so I'm keeping the status quo.

That is how I see the situation and that's why I suggested to proceed with current non-optimal architecture and re-design it later in v5 in a more architectural way if we will.


Probably, from your experience my changes look like a vandalism or something totally amateurish and that might be so. However I didn't pursue a goal to build a generator with a good distribution. Rather I simply wanted to have:

  • full-cycle generator
  • which doesn't suffer from current issues
  • and generates pseudo-random sequence of "non-obvious" quality 😄.

From my perspective this PR achieved that goal.

If you see that my goal is not achieved or you see the issues I've introduced - please share your feedback, so we are aligned. I'd be really happy to learn this area and see why my proposal is not acceptable (if that is so), despite the fact that I've already invested a lot of resources into that.

@moodmosaic
Copy link
Member

You did a massive work here, and I think we can use this in place of the RandomNumericSequenceGenerator in Fixture.

Would you mind putting this in a new file named as LinearCongruentialNumberGenerator or LinearCongruentialNumericGenerator if we want to be consistent?

It's just that it feels like we're putting a lipstick on a pig trying to fix the existing one is all 😅


That's the same strategy we used with all sorts of new generators, we never deleted or modified the old ones, and instead we've added new ones. (SOLID can be append-only, too.)

@zvirja
Copy link
Member Author

zvirja commented Nov 30, 2017

The more I think about this PR and review the stuff, the more internal pain I start to feel 😅

My initial intent was to fix the current generator issues, without changing the observable behavior. I assumed that our clients will not notice any difference after the change, while internally we win performance. In fact, the public API wasn't changed, none of the test failed - I simply adjusted the implementation detail, which is safe to do even in bug fix releases.

Previously, you mentioned and now I that clearly see, that current implementation is.. not the best one. I also don't like the idea that RandomNumericSequenceGenerator has too much responsibilities both in implementation and usage. Because of that class contract is very fuzzy. For instance, if you request sbyte elements, you'll quickly start to see negative numbers due to overflow. As for the int type you would probably never get the negative numbers. This builder blows my mind each time I see it, because it pretends that it offers full cycle, however that will be only if you request "types" large enough to hold those values. I also don't see any reason in that contract when we offer the full cycle for sporadic member value requests. And so on 😧😟😢

I was fine to keep that architecture as-is and simply change the implementation. However, if you decline that idea and the only option is to add a new builder, that's another story. I don't want to add a shiny new builder which will be poisoned by the bad architecture from the beginning. We'll need to keep it in our code for a few major versions even if we have a good replacement. In that light, I'd vote to close this PR (keeping the implementation in branch to re-use it later) and do nothing for now. Later we'll design the better architecture (as you initially suggested) where we'll fix all the drawbacks and split responsibilities, so this issue will be automatically fixed. We can develop that in minor updates, as that will not lead to the breaking changes.

Therefore, let's see your final voice. If you see reasons why we cannot alter implementation of the RandomNumericSequenceGenerator, then let's close the PR - I don't want to spread this chaos across the project 😄

@moodmosaic
Copy link
Member

Yes, you probably didn't see any tests failing as we haven't done any statistical testing on what we currently have.

@zvirja
Copy link
Member Author

zvirja commented Dec 1, 2017

Yep, that is probably the only thing that could be affected. I don't know the project background well to know whether we actually promised the distribution quality to our clients. I was under the impression, that we didn't (as we don't have tests for that) so it's safe to change it.

Therefore, I keep it up to you to decide whether current change is acceptable or not, as you are more experienced on this area than me 😉

@moodmosaic
Copy link
Member

I'd rather keep your implementation in a new file as LcgNumericSequenceGenerator, and then change the Fixture class to use this one instead.

@moodmosaic
Copy link
Member

If you do that, I have a few comments for your implementation, but you'd have to move it to a new file first.

@zvirja
Copy link
Member Author

zvirja commented Dec 12, 2017

@moodmosaic In this case let's follow the other option you suggested - decouple random generator (and the range concept) from the RandomNumericSequenceGenerator, that is used to serve number requests.

I suggest to create the following interface in the Random namespace:

interface IRandomGenerator
{
    long GetNextValue();
}

and also probably:

interface IRangedRandomGenerator: IRandomGenerator
{
    (long value, (long min, long max) range) GetNextValueAndRange();
}

The latter one could be used by RandomRangedNumberGenerator, so we don't hardcode the range and use the default one.

With this approach we could:

  • implement different generators (i.e. Random based, LCG, exponential, etc);
  • easily customize fixture to use different random generator, as we simply need to change the strategy for the existing builder;
  • use generators directly when we need random values (e.g. by ElementsBuilder), rather than RandomNumericSequenceGenerator that provides less obvious API.

Is that design fine to you, or you see some drawbacks with it? Is that what you previously meant?

@moodmosaic
Copy link
Member

@zvirja, I'm not sure we need to expose (and maintain) interfaces right now. After all,

"clients [...] own the abstract interfaces"
-- Agile Principles, Patterns, and Practices, ch. 11

I think, all we really need is two functions, like those in here and a seed like this one.

The previous implementation sufferred from a few major issues.
The first issue is that it failed with OOM if large number
of numeric requests are issued. Another issue is that
generation time of the latest elements in a range
could take large amount of time, as it works in a loop.

The new implementation uses the Linear Congruential Generator
approach where the properties of prime numbers are used to
generate non-repeating sequence on a particular range. While
we are still generating numbers in a loop ignoring out-of-range
values, the predictability of this approach is much better
because of the values distribution. If distribution were uniform,
we'd ignore at most half of the generated numbers. In reality
the distibution is worse, however still good enough to rely on it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RandomNumericSequenceGenerator fails with OutOfMemory exception
3 participants