Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smart analysis improvements #921

Open
t184256 opened this issue Apr 5, 2024 · 3 comments
Open

smart analysis improvements #921

t184256 opened this issue Apr 5, 2024 · 3 comments
Assignees

Comments

@t184256
Copy link
Collaborator

t184256 commented Apr 5, 2024

Feature request

Is your feature request related to a problem? Please describe

The feature works as described, in that when one uses defaults, this is equal to asking for one nanosecond CIs, and the resulting CIs are ~1ns wide.

The feature doesn't work as expected, in that when one might approach a value close to a nanosecond, yet larger than a nanosecond, keep collecting data, and never have their CIs improve because the feature rejects extra data.

Describe the solution you'd like

  1. estimating the amount of data to process must use random sampling in order to avoid the beginning of the results skewing the whole picture.
  2. there should be some overshoot coefficient introduced into it just in case the estimation got lucky and the whole run was bleaker (~4x?)
  3. the feature should log prominently into the report that it has discarded 87.95% of the data so that those who hit it and want to go deeper have a better chance of noticing why they hit the wall.

Describe alternatives you've considered

  • Not using the feature at all - not cool.
  • Disabling the feature by default and documenting one needs to overshoot - maybe that's a sensible option.

Additional context

@tomato42
Copy link
Member

tomato42 commented Apr 5, 2024

If we use random sampling for both the initial estimation and the later analysis, then the resulting CI should be really close to the specified desired value

@GeorgePantelakis
Copy link
Contributor

I do not think disabling by default will be good. I would say most regular users want okay-ish results in a short amount of time. The other features I do like. Also perhaps it will be useful to add an option to limit the data analyzed yourself. for example, if the script says that you need 230,000 points but you want to push it a bit more to be able to specify (via a flag) to use 250,000 points. What do you think about that?

@GeorgePantelakis
Copy link
Contributor

GeorgePantelakis commented May 16, 2024

Notes:

  • If there are fewer samples than needed for the first estimation just skip estimation, use whole sample.
  • Random sampling (in parallel, like Skillings-Mack)
  • For estimation create random 10k or 100k (take the tuple randomly with probability 1 in (total/100k))
  • From that, estimate what is the sample size we need to have the required estimate.
  • Add 10% to the size we calculated to make sure that we have a smaller CI (aim for CI 5% smaller than requested).
  • Do the rest of the steps as normal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants