Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More sophisticated sampling techniques #7

Open
oxinabox opened this issue Dec 7, 2020 · 0 comments
Open

More sophisticated sampling techniques #7

oxinabox opened this issue Dec 7, 2020 · 0 comments

Comments

@oxinabox
Copy link
Member

oxinabox commented Dec 7, 2020

from @glennmoy

The idea behind randomly/periodically selecting weekly blocks is that it allows use to adequately sample a 2 year period with some statistical guarantees about the proportional representation of weekdays/weekends and seasons within the validation and holdout sets.

The implication is that this provides (albeit somewhat weaker) guarantees about the distribution of the underlying grid state, seasonality effects, and our performance over the period.

In the ensembling squad this assumption was undermined by one Problem as our model performance varied a lot between years and one year happened to be sampled more than the other.

The current RandomSelector is therefore not robust enough to provide any guarantees about the statistics of our returns which are necessary to provide a reliable baseline against which we can compare optimised models.

This issue is more to document the concern and some possible avenues for taking this forward with different, and more sophisticated, selectors in future. Namely:

As a simple remedy to the above, we might have instead done something like

  • Cluster the dates by season
  • Within each cluster, sort the dates by their some difficulty measure
  • Systematically select dates (e.g every second date or in blocks of 7) to ensure (roughly) proportional statistics

This would retain the same seasonal guarantees as before, somewhat weakened the weekday/weekend guarantee, but at the benefit of more similar return statistics. This is just a simple example, perhaps there's an easier/better way to doing it.

Moreoever, if we ever wish to discriminate by other criteria, e.g. grid regimes, the example gets more complicated but the same principle applies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant