New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework random generator to use LCG #928
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with LCG is that it always produces the same sequence for the given constants.
This is NOT an issue; that's how PRNGs are supposed to work 😉
- given the same seed
- you get back the same result
- — so how do you have PRNGs generate random numbers, then?
- you make sure to supply a totally random seed
- and one way of implementing this is by splitting the seed each time you need to generate a number
I have A LOT feedback to provide here, as my experience from years with AutoFixture, and the issues I've seen by using FsCheck for years, are more or less the result of finding myself working in Hedgehog.
It really depends on you, @zvirja, which direction you'd like to take:
- Keep around the existing random number generation (as well as the existing way of dealing with ranges (Add support for string based properties decorated with the RangeAttribute #919)), which is fine—after all, these are already there, from version 3—though, we'd have to be conservative with bugfixes, as we can easily end up putting a lipstick on a pig...
- Fix the random number generation by decoupling the 'range' concept (constant, linear, exponential) from the PRNG itself, and use some reasonable seed, preferably SplitMix.
It's exciting that you took the time to tackle this very issue 👍 — I look forward to hearing your decision.
Well, this is approximately what I'm trying to do in this PR 😄 I don't change the overall architecture and
Well, actually it's pretty hard for me to decide 😕 The idea of total rework looks attractive, however, I feel that it's probably better to keep the existing approach and simply improve it internally for now. That would decrease the amount of changes we have in v4, as we already applied a set of them. Also that will allow us to save some time and finally release v4 RTM. In v5 when we have more time, we could rework that and probably a few other things. I feel a high level of uncertainty with this decision, however it looks like a good plan for now. Does that make sense to you? |
I understand that you threw a lot of effort on this, and it is highly appriciated 🎉 However,
Why choose LCG over others, battle-tested ones, like I understand that the current implementation suffers from mediocre performance, however,
That being said, it might be just as easy as
|
I'm not very strong in this particular area, therefore I relied on the Google. I didn't want to change the behavior of the existing generator, and the current one has a full cycle (no repetitions within the range). From what I got we have either LCG or LFSR which could produce such sequences. I ended up with the first one as it was documented to provide a bit better quality. Could the "TFGen" or "SplitMix" also produce non-repeating values withing the specified range? Otherwise, that change is breaking and it was not my intention.
Yep, it's the whole point. While we haven't tested, I 99% sure that the previous generator implementation suffered from the distribution issues due to the nature of its implementation. Therefore, when I decided to replace it with LCG, I assumed that quality is not the strong point here. Rather, we expect have something a bit random and that's enough. The LCG provided me with such values, so ti looked like a good replacement even if I haven't tested that function has good enough quality. The more important thing is that it preserves the full-cycle semantics as that is asserted in our unit tests.
I don't see too much value in persisting the previous implementation as a new one provides the exactly same output. The observer will not see any difference after this change. The only visible difference might be the changed distribution of numbers, but our tool is not designed to generate random of the perfect quality, so I'm keeping the status quo. That is how I see the situation and that's why I suggested to proceed with current non-optimal architecture and re-design it later in v5 in a more architectural way if we will. Probably, from your experience my changes look like a vandalism or something totally amateurish and that might be so. However I didn't pursue a goal to build a generator with a good distribution. Rather I simply wanted to have:
From my perspective this PR achieved that goal. If you see that my goal is not achieved or you see the issues I've introduced - please share your feedback, so we are aligned. I'd be really happy to learn this area and see why my proposal is not acceptable (if that is so), despite the fact that I've already invested a lot of resources into that. |
You did a massive work here, and I think we can use this in place of the Would you mind putting this in a new file named as It's just that it feels like we're putting a lipstick on a pig trying to fix the existing one is all 😅 That's the same strategy we used with all sorts of new generators, we never deleted or modified the old ones, and instead we've added new ones. (SOLID can be append-only, too.) |
The more I think about this PR and review the stuff, the more internal pain I start to feel 😅 My initial intent was to fix the current generator issues, without changing the observable behavior. I assumed that our clients will not notice any difference after the change, while internally we win performance. In fact, the public API wasn't changed, none of the test failed - I simply adjusted the implementation detail, which is safe to do even in bug fix releases. Previously, you mentioned and now I that clearly see, that current implementation is.. not the best one. I also don't like the idea that I was fine to keep that architecture as-is and simply change the implementation. However, if you decline that idea and the only option is to add a new builder, that's another story. I don't want to add a shiny new builder which will be poisoned by the bad architecture from the beginning. We'll need to keep it in our code for a few major versions even if we have a good replacement. In that light, I'd vote to close this PR (keeping the implementation in branch to re-use it later) and do nothing for now. Later we'll design the better architecture (as you initially suggested) where we'll fix all the drawbacks and split responsibilities, so this issue will be automatically fixed. We can develop that in minor updates, as that will not lead to the breaking changes. Therefore, let's see your final voice. If you see reasons why we cannot alter implementation of the |
Yes, you probably didn't see any tests failing as we haven't done any statistical testing on what we currently have. |
Yep, that is probably the only thing that could be affected. I don't know the project background well to know whether we actually promised the distribution quality to our clients. I was under the impression, that we didn't (as we don't have tests for that) so it's safe to change it. Therefore, I keep it up to you to decide whether current change is acceptable or not, as you are more experienced on this area than me 😉 |
I'd rather keep your implementation in a new file as |
If you do that, I have a few comments for your implementation, but you'd have to move it to a new file first. |
@moodmosaic In this case let's follow the other option you suggested - decouple random generator (and the range concept) from the I suggest to create the following interface in the interface IRandomGenerator
{
long GetNextValue();
} and also probably: interface IRangedRandomGenerator: IRandomGenerator
{
(long value, (long min, long max) range) GetNextValueAndRange();
} The latter one could be used by With this approach we could:
Is that design fine to you, or you see some drawbacks with it? Is that what you previously meant? |
The previous implementation sufferred from a few major issues. The first issue is that it failed with OOM if large number of numeric requests are issued. Another issue is that generation time of the latest elements in a range could take large amount of time, as it works in a loop. The new implementation uses the Linear Congruential Generator approach where the properties of prime numbers are used to generate non-repeating sequence on a particular range. While we are still generating numbers in a loop ignoring out-of-range values, the predictability of this approach is much better because of the values distribution. If distribution were uniform, we'd ignore at most half of the generated numbers. In reality the distibution is worse, however still good enough to rely on it.
3a33289
to
305c4ad
Compare
Fixes #789.
Current random generator suffers from a few issues which I tried to address in this PR:
!generated.Contains(newRnd)
statement to guarantee the uniqueness of the produced values. This approach could lead to significant delays when latest elements in a sequence are generated - we should find those "remaining" numbers using random 😕To address those issues I used the Linear Congruential Generator which generates the unique values if constants match the specific criteria. It has very small memory footprint and offers a good generation performance. I used constants from this document, where they mentioned that values are pretty good. Also I've tested all the constants till 2^33 and proven that they indeed produce a unique sequence (there were a few wrong constants, so I fixed them). I haven't tested the constants above because my PC has no enough power for that (RAM and CPU cores).
The issue with LCG is that it always produces the same sequence for the given constants. To overcome that I used a few techniques:
Those approaches allows to make the produced sequence even more random and non-repeating.
A few things we should to be aware of:
this.m-1
). To adapt to the required range I ignore random values above the range (see a cycle in theGenerateNextValue()
method). In average we ignore a half of the generated numbers as we have constants for each power of 2. So far I haven't noticed any issues with that during testing.Debug.Assert()
a few times - that condition is never expected to happen logically, however left them for documentation purpose and as a sanity check.No tests were modified (you might see a refactoring only).
Additionally I've referenced the
System.ValueTuple
package in the AutoFixture package, so we could use the tuples from C# 7. That shouldn't be an issue as we are still building v4. Also that could enable us to support tuples generation by AutoFixture if needed.@moodmosaic Please take a look and let me know your opinion. I've spent a few weeks working on this, but it could happen that I overlooked something 😉