Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long non-linear run times on large models #174

Open
jrawbits opened this issue Feb 28, 2022 · 2 comments
Open

Long non-linear run times on large models #174

jrawbits opened this issue Feb 28, 2022 · 2 comments

Comments

@jrawbits
Copy link
Collaborator

jrawbits commented Feb 28, 2022

Based on timing reported in log files from running multiple scenarios in the H-GAC models, a few of the modules appear to have undesirable algorithmic properties (in particular, with runtimes suggesting order N squared or worse). It is desirable to delve into how these modules do their work and see if there is some way to reduce the complexity - all of the "victims" are doing some kind of sampling and balancing toward target proportions, and we'll easily hit order N-squared if the process requires doing N samples N times. That may be inevitable, but perhaps there's a craftier sampling algorithm out there... I'll edit this issue with the names of the affected modules once I've downloaded the enormous set of results...

The modules I'm starting with are PredictWorkers, and CalculateVehicleOwnCost - the latter runtime collapses whenever pay-as-you-drive is set up in the scenario.

@jrawbits
Copy link
Collaborator Author

jrawbits commented Mar 3, 2022

The problem appears to be using the sample function in many places in R to generate a set of values that are randomly distributed according to a probability / proportion. The sample function without replacement (essentially classifying the stuff we're sampling from) is excruciatingly slow and scales very badly to big samples from big populations.

There is a much faster algorithm available and even though it appears to scale to cases with probabilities, it is really only valid as a "non-probability" sampler. To sample with probabilities and get something like the same answer requires iterating once over (on average) half the population for each required worker (removing any selected worker, then rescaling the remaining probabilities). I'll keep researching for a while, but for now I'm just going to pursue them module fixups mentioned in the next comment.

@jrawbits
Copy link
Collaborator Author

jrawbits commented Mar 8, 2022

Since I have the modules open, I'm making some other cleanups to the estimation and documentation builds (and getting rid of some R CMD check warnings along the way), and I'll put that up as a pull request soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant