Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frequency is inefficient and wrong #337

Open
treeowl opened this issue Nov 11, 2019 · 3 comments · May be fixed by #338
Open

frequency is inefficient and wrong #337

treeowl opened this issue Nov 11, 2019 · 3 comments · May be fixed by #338

Comments

@treeowl
Copy link
Contributor

treeowl commented Nov 11, 2019

Correctness

frequency is a little bit wrong because it relies on the sum of the frequencies being representable by an Int, but does not document this. So frequency [(maxBound `quot` 2, x), ((maxBound `quot` 4) * 3, y), (2034, z)] will do something hard to predict. The best fix is likely to use double-word integers in the calculations:

data DW = DW {hi :: !Word, lo :: !Word}

This is almost certainly available in a library somewhere, but it wouldn't be too hard to roll our own if necessary.

Efficiency

The argument list is traversed for every value generated. When the list is long, that's really lousy. The solution is to build a tree representing the input list, making lookup much faster. A simple hack using DW above would use an IntMap of IntMaps and some lookup functions like lookupLE, lookupGE, etc. There are almost certainly better specialized data structures available for the purpose, but I don't know if they'd be worth the trouble.

@treeowl
Copy link
Contributor Author

treeowl commented Nov 11, 2019

Another option to deal with correctness is to document the issue and throw an error when overflow is detected. A third is to change the type to use Int32 (or better, Word32) in the input, and Word64 internally.

@treeowl
Copy link
Contributor Author

treeowl commented Nov 11, 2019

The big advantage of sticking with Word64 internally, instead of a double word, is that things should be considerably simpler than otherwise.

treeowl added a commit to treeowl/haskell-hedgehog that referenced this issue Nov 11, 2019
* Make `frequency` detect negative frequencies and frequency
  total overflow.

* Rewrite `frequency` to build an `IntMap` to represent the
  frequency list, which greatly improves efficiency for long
  lists.

Fixes hedgehogqa#337.
@treeowl treeowl linked a pull request Nov 11, 2019 that will close this issue
treeowl added a commit to treeowl/haskell-hedgehog that referenced this issue Nov 11, 2019
* Make `frequency` detect negative frequencies, frequency
  total overflow, and zero total frequency.

* Rewrite `frequency` to build an `IntMap` to represent the
  frequency list, which greatly improves efficiency for long
  lists.

Fixes hedgehogqa#337.
treeowl added a commit to treeowl/haskell-hedgehog that referenced this issue Nov 11, 2019
* Make `frequency` detect negative frequencies, frequency
  total overflow, and zero total frequency.

* Rewrite `frequency` to build an `IntMap` to represent the
  frequency list, which greatly improves efficiency for long
  lists.

* Bump `containers` lower bound to 0.5.11.

Fixes hedgehogqa#337.
@jacobstanley
Copy link
Member

jacobstanley commented Nov 14, 2019

fwiw frequency and its semantics are copied from QuickCheck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants