New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[WIP] Allow NUTS to do eager evaluation on forward and backward trajectory in parallel #3103

Open

SteveBronder wants to merge 11 commits into develop from review1/speculative-nuts

Collaborator

SteveBronder commented Feb 17, 2022

Submission Checklist

Run unit tests: ./runTests.py src/test/unit
Run cpplint: make cpplint
Declare copyright holder and open-source license: see below

Summary

This is a WIP to allow NUTS to perform calculation of the forward and backwards trajectory in parallel.

Intended Effect

How to Verify

Side Effects

Documentation

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Sebastian Weber and Steve Bronder

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

Sebastian Weber and others added 11 commits

May 7, 2018 21:14


          Merge branch 'develop' of https://github.com/stan-dev/stan into develop

e4ab35b


          Merge branch 'develop' of https://github.com/stan-dev/stan into develop

54b927c


          Merge branch 'develop' of https://github.com/stan-dev/stan into develop

cc44118


          prototype speculative NUTS

586989f


          get rid of mutex when sampling uniforms for parallel nuts

5d9a09a


          clang format

7669f6d


          update with some comments

4cda135


          tests running for parallel adaptation

dfdc55a


          clang format

f18ae74


          update to not use get_num_theads() to detect thread level

ff668c9


          clang format

334d242

SteveBronder commented

View reviewed changes

Collaborator Author

SteveBronder left a comment

@wds15 had a couple Qs on the impl, it's extremely cool! I've always wanted to dive into the tbb graph stuff more and how you did things here helped that stuff make a lot more since. Big favor, but could you write out the actual graph? Like with a pen and paper, google slide, latex, whatever is most convenient for you. I'm just worried I'm missing some key component of the graph structure and want to sanity check I'm reading this correctly

src/stan/mcmc/hmc/nuts/base_parallel_nuts.hpp

+                      if (fwd_idx != 0) {
+                        // in this case this is not the starting node, we
+                        // connect this with its predecessor
+                        tbb::flow::make_edge(*(fwd_iter - 1), *fwd_iter);

Collaborator Author

SteveBronder Feb 17, 2022

Is this dependency on past forward iterations because we want to check if valid_subtree_fwd[fwd_idx - 1] is false and if so do no work? Kind of a wacky idea, but what if we made valid_parent a reference to a bool and passed it to extend_tree so we keep checking it at each successive call to extend_tree/build_tree? Then when a parent fails the children can stop as well

EDIT: I'll leave this comment in case I'm wrong but staring at build_tree idt its safe to do what I said above since z_fwd is modified at each depth

src/stan/mcmc/hmc/nuts/base_parallel_nuts.hpp

Comment on lines +342 to +343

		n_leapfrog += std::get<5>(subtree_result);
		sum_metro_prob += std::get<6>(subtree_result);

Collaborator Author

SteveBronder Feb 17, 2022

If the forward and backward update at the same time is there a chance for a race condition to happen here? This comment applies to most of the stuff here made outside of scope and assigned to. Maybe we should just have vector for these things and sum them up at the end?

src/stan/mcmc/hmc/nuts/base_parallel_nuts.hpp

Comment on lines +420 to +431

+                  if (fwd_direction[0]) {
+                    fwd_builder[0].try_put(tbb::flow::continue_msg());
+                    // the first turn is fwd, so kick off the bck walker if needed
+                    if (num_bck != 0) {
+                      bck_builder[0].try_put(tbb::flow::continue_msg());
+                    }
+                  } else {
+                    bck_builder[0].try_put(tbb::flow::continue_msg());
+                    if (num_fwd != 0) {
+                      fwd_builder[0].try_put(tbb::flow::continue_msg());
+                    }
+                  }

Collaborator Author

SteveBronder Feb 17, 2022

[Q] Could we just have one vector here to store all the builders? It looks like we have two because it lets us just do

tbb::flow::make_edge(*(fwd_iter - 1), *fwd_iter);

But could we just have one vector and keep track of which the index for the last forward or backward iteration. So the line that uses the edge above would turn into something like

// Before loop
tbb::concurrent_vector<tree_builder_t> builder;
builder.reserve(this->max_depth_);
// ...
        if (fwd_idx != 0) {
          // in this case this is not the starting node, we
          // connect this with its predecessor
          tbb::flow::make_edge(*(builder[fwd_idx]), *fwd_iter);
          // I _think_ this is safe as long as we reserve max_depth_ size
          //  of the vector before running 
          fwd_idx = std::distance(*(builder[0]), *fwd_iter);
        }

src/stan/mcmc/hmc/nuts/base_parallel_nuts.hpp

+                                           checks.back());
+                    }
+                    if (depth != 0) {
+                      tbb::flow::make_edge(checks[depth - 1], checks.back());

Collaborator Author

SteveBronder Feb 17, 2022

If each check is dependent on one another and each check is also dependent on the forward or backward iteration, are we essentially doing all these checks serially? If so could we just make it a function_node with a concurrency of 1?

src/stan/mcmc/hmc/nuts/base_parallel_nuts.hpp

+                  double accept_prob
+                      = sum_metro_prob / static_cast<double>(this->n_leapfrog_);
+                  this->z_.ps_point::operator=(z_sample);

Collaborator Author

SteveBronder Feb 17, 2022

TODO: Would be nice to have a move operator for ps point so this doesn't make a copy

Contributor

bob-carpenter Feb 17, 2022

This also doesn't need an explicit static cast. C++ automatically promotes the lower type to the higher type in binary arithmetic.

And shouldn't we be able to write line 447 as this->z.ps_point = z_sample; or even just z.ps_point = z_sample;? I thought that was the whole point of implementing operator=().

Contributor

wds15 commented Feb 17, 2022

I think I did pencil the graph structure down somewhere…let me search for it.

Contributor

bob-carpenter commented Feb 17, 2022

I only saw this in the email preview (in the same file as @SteveBronder is commenting on), I'd have coded this

fwd_idx == 0 ? true : valid_subtree_fwd[fwd_idx - 1];

as

fwd_idx == 0 || valid_subtree_fwd[fwd_idx - 1];

It's the same behavior because of the short-circuiting of ||.

Contributor

wds15 commented Feb 18, 2022

Here is what my iPad has in my notes app on this:

(ignore for now the red arrows..I explain them below)

So you see the left and the right legs of the trajectory which have yellow path's. Each blob is a point on the trajectory where something happens. Specifically at these points the state is being send along the arrows to the middle process in black, which is the "abort checker process". The blue and green blobs fire off that we explore 2^n (n being the current depth) steps deeper the trajectory in that direction. The checker process in the middle recieves the states which must be compared to evaluate the U-turn criterion.

The idea is that you lay out the full tree to it's maximum depth first (putting lambda's into the leaves doing the work when called) and then you let the TBB run the tree for you in full automation. The execution is canceled once the U-turn criterion is met. At that stage execution of the tree is stopped and the identified state is returned.

If I recall correctly, the red connections are only wired up if you want a serial execution of things. So I intentionally wire-up the graph differently in the case of running this sequentially. This is to help the TBB figure out easier what to do when running the graph. I recall that this was good for single-core performance.

Ok... now that's what is behind this. Does this help already or should I try to annotate the source with reference to the figure?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment