Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Sampling Interface #44

Open
ParadaCarleton opened this issue Oct 21, 2022 · 15 comments
Open

Better Sampling Interface #44

ParadaCarleton opened this issue Oct 21, 2022 · 15 comments

Comments

@ParadaCarleton
Copy link
Collaborator

The sampling interface is a bit of a mess at the moment. The way users are asked to specify sample sizes is clunky and unnatural, demanding that they perform unusual calculations to avoid shooting themselves in the foot. Sobol nets, for example, only exist for powers of two, but users can request any sample size. At the moment we just give users these sample sizes by truncating the sequence, even though these truncated sequences are not guaranteed to converge to the correct answer, and will typically perform worse than Monte Carlo point sets. For Faure nets, the situation is safer (we error if the sample size is inappropriate) but even more annoying for users (they have to calculate multiples of powers of prime numbers).

For digital nets, a better approach is outlined here, and used by Art Owen in his QMC packages. Rather than request a sample size, users can provide parameters for a point set's stratification properties (m), how many independent replications they'd like (λ), and a base (b), before providing them λ*b^m points. I'd also suggest a feature where users can set an approximate sample size by using a keyword, then receive a net with the smallest possible net larger than the requested sample size.

This would be a breaking change for sure.

@ChrisRackauckas
Copy link
Member

This makes a lot of sense.

@dmetivie
Copy link
Contributor

About making the API evolve, I had few of questions/comments:

  • Could we change the name LowDiscrepancySample for HaltonSample?
    I find it very confusing, as Low Discrepency is a generic term used to define all QMC sequences.
  • Could we allow the method sample(n, d::Integer, SamplingAlgorithm()) for [0,1]^d sampling without arbitrary bounds.
    This is for people only working in [0,1]^d make more sense. Plus, randomization methods are defined in this interval.
  • I wondered if some sequence like Faure, Halton, Sobol could have an option to be produced as Rational numbers (since they are often exactly defined with fraction //). On performance, I am not sure of the result, but it would for example allow exact binary expansion in base b).
  • Following this post (and ref there), could we allow alternative Sobol sample? Like S0b0lSample (with zero instead of o to highlight hat the zero is in the sequence, contrary to the SobolSample). The name is probably a bad idea!
    I have a piece of code from Owen website doing that.

@ParadaCarleton
Copy link
Collaborator Author

ParadaCarleton commented Nov 24, 2022

About making the API evolve, I had few of questions/comments:

  • Could we change the name LowDiscrepancySample for HaltonSample?
    I find it very confusing, as Low Discrepency is a generic term used to define all QMC sequences.
  • Could we allow the method sample(n, d::Integer, SamplingAlgorithm()) for [0,1]^d sampling without arbitrary bounds.
    This is for people only working in [0,1]^d make more sense. Plus, randomization methods are defined in this interval.
  • I wondered if some sequence like Faure, Halton, Sobol could have an option to be produced as Rational numbers (since they are often exactly defined with fraction //). On performance, I am not sure of the result, but it would for example allow exact binary expansion in base b).
  • Following this post (and ref there), could we allow alternative Sobol sample? Like S0b0lSample (with zero instead of o to highlight hat the zero is in the sequence, contrary to the SobolSample). The name is probably a bad idea!
    I have a piece of code from Owen website doing that.

I agree with all of these, and I think they'd be great additions! I was planning on making PRs for most of these myself. I think the Sobol Sample should just be centered, though--every point should be placed in the middle of the appropriate box, rather than at the start. So the first two points would be 1/4 and 3/4 instead of 0 and 1/2, for example. This is what FaureSample currently does.

@ChrisRackauckas
Copy link
Member

Could we change the name LowDiscrepancySample for HaltonSample?
I find it very confusing, as Low Discrepency is a generic term used to define all QMC sequences.

100% agreed. It's something one of the students did that I never ended up correcting. It's one of the reasons I wouldn't make a v1.0 on this library yet 😅.

Could we allow the method sample(n, d::Integer, SamplingAlgorithm()) for [0,1]^d sampling without arbitrary bounds. This is for people only working in [0,1]^d make more sense. Plus, randomization methods are defined in this interval.

Yes, though the sample(n, lb, ub, sampler) is still going to be the core algorithm that most people use. We can document sample(n, d::Integer, SamplingAlgorithm()) though, and it would serve as a nice abstraction to implement sample(n, lb, ub, sampler) generically just by rescaling the outputs of sample(n, d::Integer, SamplingAlgorithm()).

I wondered if some sequence like Faure, Halton, Sobol could have an option to be produced as Rational numbers (since they are often exactly defined with fraction //). On performance, I am not sure of the result, but it would for example allow exact binary expansion in base b).

The interface should be type-matching. If you use Rational for your lb and ub, then you should get Rational values for the points. This is one reason why a shorthand for sample(n, d::Integer, SamplingAlgorithm()) is scary: it's necessarily information dropping.

Following this JuliaMath/Sobol.jl#31 (comment) (and ref there), could we allow alternative Sobol sample? Like S0b0lSample (with zero instead of o to highlight hat the zero is in the sequence, contrary to the SobolSample). The name is probably a bad idea!

Or just make it an option?

@ParadaCarleton
Copy link
Collaborator Author

ParadaCarleton commented Nov 27, 2022

Could we change the name LowDiscrepancySample for HaltonSample?
I find it very confusing, as Low Discrepency is a generic term used to define all QMC sequences.

100% agreed. It's something one of the students did that I never ended up correcting. It's one of the reasons I wouldn't make a v1.0 on this library yet sweat_smile.

With regards to names, I actually think we should drop Sample from all the types; it's 6 extra letters but contains 0 new information.

Could we allow the method sample(n, d::Integer, SamplingAlgorithm()) for [0,1]^d sampling without arbitrary bounds. This is for people only working in [0,1]^d make more sense. Plus, randomization methods are defined in this interval.

Yes, though the sample(n, lb, ub, sampler) is still going to be the core algorithm that most people use. We can document sample(n, d::Integer, SamplingAlgorithm()) though, and it would serve as a nice abstraction to implement sample(n, lb, ub, sampler) generically just by rescaling the outputs of sample(n, d::Integer, SamplingAlgorithm()).

Yep, that makes sense! Although actually, it makes me think of a possibly-better interface. We might want users to specify what kind of region they want (e.g. Box(lb, ub), or Box(d, type=Float64) for the unit box), then map to said region if it's not a box; by default we can have a method that assumes the unit box.

I wondered if some sequence like Faure, Halton, Sobol could have an option to be produced as Rational numbers (since they are often exactly defined with fraction //). On performance, I am not sure of the result, but it would for example allow exact binary expansion in base b).

The interface should be type-matching. If you use Rational for your lb and ub, then you should get Rational values for the points. This is one reason why a shorthand for sample(n, d::Integer, SamplingAlgorithm()) is scary: it's necessarily information dropping.

I don't think that's a huge deal; we can add an optional type argument that defaults to Float64.

Following this stevengj/Sobol.jl#31 (comment) (and ref there), could we allow alternative Sobol sample? Like S0b0lSample (with zero instead of o to highlight hat the zero is in the sequence, contrary to the SobolSample). The name is probably a bad idea!

Or just make it an option?

I'm not sure we should include an option--there's really no reason to drop the initial 0 when we can use the centered Sobol samples instead, which have better discrepancy and don't have the problem of an initial 0. Including an option would just be a "shoot self in foot" argument.

@ChrisRackauckas
Copy link
Member

With regards to names, I actually think we should drop Sample from all the types; it's 6 extra letters but contains 0 new information.

It can cause namespacing issues when exported. And Sobol is already the module name of Sobol.jl

Box(d, type=Float64)

Passing DataTypes is almost always a bad idea because of how that plays with specialization heuristics. I mean, you can do it, but any function that isn't careful will lose inference.

Yep, that makes sense! Although actually, it makes me think of a possibly-better interface. We might want users to specify what kind of region they want (e.g. Box(lb, ub), or Box(d, type=Float64) for the unit box), then map to said region if it's not a box; by default we can have a method that assumes the unit box.

sample(n,region,sampler)?

@ParadaCarleton
Copy link
Collaborator Author

With regards to names, I actually think we should drop Sample from all the types; it's 6 extra letters but contains 0 new information.

It can cause namespacing issues when exported. And Sobol is already the module name of Sobol.jl

Hmm, maybe, but would that come up all that often? The majority of users probably don't have 2 separate packages for doing QMC. And using import QuasiMonteCarlo as QMC instead of using QuasiMonteCarlo isn't too much of a problem--QMC.Faure is still shorter than FaureSample.

As for Sobol, that shouldn't be a problem because we don't reexport it, right?

Box(d, type=Float64)

Passing DataTypes is almost always a bad idea because of how that plays with specialization heuristics. I mean, you can do it, but any function that isn't careful will lose inference.

Hmm, that's strange; I haven't had any problems with it before. Is Box{Float64}(d) any better?

Yep, that makes sense! Although actually, it makes me think of a possibly-better interface. We might want users to specify what kind of region they want (e.g. Box(lb, ub), or Box(d, type=Float64) for the unit box), then map to said region if it's not a box; by default we can have a method that assumes the unit box.

sample(n,region,sampler)?

Yep!

@dmetivie dmetivie mentioned this issue Nov 30, 2022
@ParadaCarleton
Copy link
Collaborator Author

Another suggestion--maybe it's better to use pointset or a similar name instead of sample? Partly this is so it doesn't conflict with sample, since not exporting sample is annoying for users. But mostly this is because I think sample gives the wrong impression that QMC is a drop-in replacement for Monte Carlo. Latin hypercube and Faure samples work like that, but not much else; Sobol samples in particular assume only the first few dimensions are relevant to your problem, so you have to choose your parametrization carefully or you'll get bad results (much worse than LHS).

@ChrisRackauckas
Copy link
Member

I agree pointset is a good name. I think what we want to do is update the interface and then claim it to be the v1.0 release, so now would be the time to break things if they need to break.

@ParadaCarleton
Copy link
Collaborator Author

ParadaCarleton commented Dec 15, 2022

@dcarrera Just wanted to ask about this because you opened an issue about it in Sobol.jl. Would you be interested in making a pull request implementing your suggested interface here in QMC.jl?

@dcarrera
Copy link

@ParadaCarleton Hey! I'm not sure I feel qualified to implement this. As I noted in that other issue on Sobol.jl, I looked at the implementations scipy.stats.qmc.Sobol and Art Owen's R package. Superficially it didn't LOOK complicated, but I could not figure out what the code was doing in either case. I think I just don't understand the Sobol algorithm ell enough. I mentioned in that other issue that the R package is BSD licensed. So someone that understands it could translate it to Julia. I just had another look at the R code:

https://artowen.su.domains/code/rsobol.R

I don't know R but I can see that there are two ways to scrambe the points: A "nested uniform" ("nestu") and "Matousek's linear scramble" ("mato"). A minimal Julia version only needs one way to scramble the points. I had a look, but I don't really understand how they work.

@ParadaCarleton
Copy link
Collaborator Author

ParadaCarleton commented Jul 21, 2023

@ChrisRackauckas do you think you or someone else could make a draft PR to Surrogates.jl to swap to the "iterator over points" interface? Now that eachcol makes it easy to return an iterator over points, I'd like to switch over sooner rather than later to avoid breaking more code than we have to.

@ChrisRackauckas
Copy link
Member

Yes sorry JuliaCon delays. I talked with @sharanry about it at JuliaCon. @sharanry did my description of the issue make sense? I'd like to get this prioritized a bit because right now it's the only big major bound on any SciML library (Surrogates.jl still doesn't allow the latest major on QuasiMonteCarlo, 6 months later) and it's somewhat a ticking timebomb.

@ChrisRackauckas
Copy link
Member

I meant @thazhemadam

@ParadaCarleton
Copy link
Collaborator Author

If @thazhemadam wants to talk on Slack about this I’d be happy to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants