Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add sampling options of iterative and binary search #831

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hhorii
Copy link
Collaborator

@hhorii hhorii commented Jul 13, 2020

Summary

Optimization of Sampling in QubitVector

Details and comments

In original implementation, a loop for sampling is iterated based on sample count.
In this optimized implementation, a loop for sampling is iterated based on indices.
In OpenMP, threads execute the same iterations if loop conditions are the same in most case.
By using the same conditions with index construction and sampling, memory access is optimized if sampled values are randomly allocated in a qubitvector.

@hhorii hhorii force-pushed the sampling_opt branch 4 times, most recently from 3f9c6c3 to 8952b07 Compare July 14, 2020 13:10
@hhorii hhorii changed the title add optimization of sampling based on NUMA architecture WIP: add optimization of sampling based on NUMA architecture Jul 14, 2020
@hhorii hhorii changed the title WIP: add optimization of sampling based on NUMA architecture add optimization of sampling based on NUMA architecture Jul 14, 2020
@hhorii hhorii force-pushed the sampling_opt branch 9 times, most recently from da633b0 to 2f5d8f3 Compare July 21, 2020 16:11
@levbishop
Copy link
Member

levbishop commented Jul 22, 2020

Here's a toy demo based on the conditional binomial method as implemented in numpy, etc.
The single-threaded version is similar to the implementation in numpy. Performance depends on the binomial sampler used (on my laptop GSL seems faster than C++ std library), but is comparible to numpy.

// single-threaded
#include <iostream>
#include <random>

#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <chrono>
using namespace std::chrono;
int main()
{
    const int nq = 25, nshot = 800000;
    // const int nq=8, nshot=1000;

    std::default_random_engine generator;
    std::uniform_real_distribution<double> uniform(0.0, 1.0);
    double *probs = new double[1 << nq], totalprob = 0.0;
    int samples[nshot];

    // Generate (unnormalized) nq-qubit probability distr. Don't include this in timing
    for (int i = 0; i < 1 << nq; i++)
        totalprob += (probs[i] = uniform(generator));

    std::binomial_distribution<int> binom;
    gsl_rng *gslgen = gsl_rng_alloc(gsl_rng_taus);

    int s, offset = 0, r = nshot;
    auto start = high_resolution_clock::now();

    // Take nshot of samples from the above distribution, by conditional-binomial method:
    for (int j = 0; j < (1 << nq) - 1; j++)
    {
        // s = binom(generator, std::binomial_distribution<int>::param_type(r, probs[j]/totalprob));
        s = gsl_ran_binomial(gslgen, probs[j] / totalprob, r);
        r -= s;
        for (int k = 0; k < s; k++)
            samples[offset++] = j;
        if (!r)
            break;
        totalprob -= probs[j];
    }
    for (int k = 0; k < r; k++)
        samples[offset++] = (1 << nq) - 1;


    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<milliseconds>(stop - start);
    std::cout << duration.count() << std::endl;
    return 0;
}

For a parallel version you can divide the wavefn into roughly equal bins and then do a single-threaded multinomial sample to distribute samples between the bins before using OMP to sample within the bins. On my laptop the timing seems to scale well with the number of threads and is limited by the random number generation:

// multi-threaded
#include <iostream>
#include <random>

#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <omp.h>
#include <chrono>
using namespace std::chrono;
int main()
{
    const int nq = 25, nshot = 800000, nthread = 8;
    //const int nq = 8, nshot = 10000, nthread = 6;
    std::default_random_engine generator;
    std::uniform_real_distribution<double> uniform(0.0, 1.0);
    double *probs = new double[1 << nq], totalprob = 0.0, partialtotal[nthread];
    int samples[nshot];

    // Generate (unnormalized) nq-qubit probability distr, summing partial accumulated probability into nthread ~equal bins
    // Don't include this in timing (I assume that the accumulating partial probability was performed during last sweep over the wavefn before the measurement)
    int js[nthread + 1];
    js[0] = 0;
    for (int i = 0, t = 0; t < nthread; t++)
    {
        partialtotal[t] = 0.0;
        js[t + 1] = (t + 1) * (1 << nq) / nthread;
        for (; i < js[t + 1]; i++)
            partialtotal[t] += (probs[i] = uniform(generator));
        totalprob += partialtotal[t];
    }

    gsl_rng *gslgen = gsl_rng_alloc(gsl_rng_taus);
    // std::binomial_distribution<int> binom;
    auto start = high_resolution_clock::now();

    // Do a single-threaded multinomial to divide the shots among the bins 
    int r = nshot, offsets[nthread + 1];
    offsets[0] = 0;
    for (int t = 0; t < nthread - 1; t++)
    {
        int nsucc = gsl_ran_binomial(gslgen, partialtotal[t] / totalprob, r);
        offsets[t + 1] = offsets[t] + nsucc;
        r -= nsucc;
        totalprob -= partialtotal[t];
    }
    offsets[nthread] = nshot;



    // Do a parallel set of multinomials to sample from each of the bins in parallel. This should be the rate-limiting step
#pragma omp parallel num_threads(nthread)
    {
        int t = omp_get_thread_num();
        int off = offsets[t], r = offsets[t + 1] - offsets[t], jn = js[t + 1], s;
        gsl_rng *gslgen1 = gsl_rng_alloc(gsl_rng_taus);
        double partial = partialtotal[t];

        for (int j = js[t]; j < jn - 1; j++)
        {
            // s = binom(generator, std::binomial_distribution<int>::param_type(r, probs[j]/totalprob));
            s = gsl_ran_binomial(gslgen1, probs[j] / partial, r);
            r -= s;
            for (int k = 0; k < s; k++)
                samples[off++] = j;
            if (!r)
                break;
            partial -= probs[j];
        }
        for (int k = 0; k < r; k++)
            samples[off++] = jn - 1;
    }


    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<milliseconds>(stop - start);
    std::cout << duration.count() << std::endl;
    /*     for(int i=0; i<nshot; i++){
        std::cout << samples[i] << std::endl;
    } */
    return 0;
}

@hhorii hhorii force-pushed the sampling_opt branch 3 times, most recently from 044019b to 986a37f Compare August 26, 2020 02:17
@hhorii hhorii changed the title add optimization of sampling based on NUMA architecture add sampling options of iterative and binary search Aug 26, 2020
@hhorii hhorii requested a review from vvilpas as a code owner December 15, 2020 11:07
@chriseclectic chriseclectic added this to the Aer 0.8 milestone Feb 16, 2021
@chriseclectic chriseclectic removed this from the Aer 0.8 milestone Mar 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants