add sampling options of iterative and binary search #831

hhorii · 2020-07-13T15:39:37Z

Summary

Optimization of Sampling in QubitVector

Details and comments

In original implementation, a loop for sampling is iterated based on sample count.
In this optimized implementation, a loop for sampling is iterated based on indices.
In OpenMP, threads execute the same iterations if loop conditions are the same in most case.
By using the same conditions with index construction and sampling, memory access is optimized if sampled values are randomly allocated in a qubitvector.

levbishop · 2020-07-22T18:57:07Z

Here's a toy demo based on the conditional binomial method as implemented in numpy, etc.
The single-threaded version is similar to the implementation in numpy. Performance depends on the binomial sampler used (on my laptop GSL seems faster than C++ std library), but is comparible to numpy.

// single-threaded
#include <iostream>
#include <random>

#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <chrono>
using namespace std::chrono;
int main()
{
    const int nq = 25, nshot = 800000;
    // const int nq=8, nshot=1000;

    std::default_random_engine generator;
    std::uniform_real_distribution<double> uniform(0.0, 1.0);
    double *probs = new double[1 << nq], totalprob = 0.0;
    int samples[nshot];

    // Generate (unnormalized) nq-qubit probability distr. Don't include this in timing
    for (int i = 0; i < 1 << nq; i++)
        totalprob += (probs[i] = uniform(generator));

    std::binomial_distribution<int> binom;
    gsl_rng *gslgen = gsl_rng_alloc(gsl_rng_taus);

    int s, offset = 0, r = nshot;
    auto start = high_resolution_clock::now();

    // Take nshot of samples from the above distribution, by conditional-binomial method:
    for (int j = 0; j < (1 << nq) - 1; j++)
    {
        // s = binom(generator, std::binomial_distribution<int>::param_type(r, probs[j]/totalprob));
        s = gsl_ran_binomial(gslgen, probs[j] / totalprob, r);
        r -= s;
        for (int k = 0; k < s; k++)
            samples[offset++] = j;
        if (!r)
            break;
        totalprob -= probs[j];
    }
    for (int k = 0; k < r; k++)
        samples[offset++] = (1 << nq) - 1;


    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<milliseconds>(stop - start);
    std::cout << duration.count() << std::endl;
    return 0;
}

For a parallel version you can divide the wavefn into roughly equal bins and then do a single-threaded multinomial sample to distribute samples between the bins before using OMP to sample within the bins. On my laptop the timing seems to scale well with the number of threads and is limited by the random number generation:

// multi-threaded
#include <iostream>
#include <random>

#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <omp.h>
#include <chrono>
using namespace std::chrono;
int main()
{
    const int nq = 25, nshot = 800000, nthread = 8;
    //const int nq = 8, nshot = 10000, nthread = 6;
    std::default_random_engine generator;
    std::uniform_real_distribution<double> uniform(0.0, 1.0);
    double *probs = new double[1 << nq], totalprob = 0.0, partialtotal[nthread];
    int samples[nshot];

    // Generate (unnormalized) nq-qubit probability distr, summing partial accumulated probability into nthread ~equal bins
    // Don't include this in timing (I assume that the accumulating partial probability was performed during last sweep over the wavefn before the measurement)
    int js[nthread + 1];
    js[0] = 0;
    for (int i = 0, t = 0; t < nthread; t++)
    {
        partialtotal[t] = 0.0;
        js[t + 1] = (t + 1) * (1 << nq) / nthread;
        for (; i < js[t + 1]; i++)
            partialtotal[t] += (probs[i] = uniform(generator));
        totalprob += partialtotal[t];
    }

    gsl_rng *gslgen = gsl_rng_alloc(gsl_rng_taus);
    // std::binomial_distribution<int> binom;
    auto start = high_resolution_clock::now();

    // Do a single-threaded multinomial to divide the shots among the bins 
    int r = nshot, offsets[nthread + 1];
    offsets[0] = 0;
    for (int t = 0; t < nthread - 1; t++)
    {
        int nsucc = gsl_ran_binomial(gslgen, partialtotal[t] / totalprob, r);
        offsets[t + 1] = offsets[t] + nsucc;
        r -= nsucc;
        totalprob -= partialtotal[t];
    }
    offsets[nthread] = nshot;



    // Do a parallel set of multinomials to sample from each of the bins in parallel. This should be the rate-limiting step
#pragma omp parallel num_threads(nthread)
    {
        int t = omp_get_thread_num();
        int off = offsets[t], r = offsets[t + 1] - offsets[t], jn = js[t + 1], s;
        gsl_rng *gslgen1 = gsl_rng_alloc(gsl_rng_taus);
        double partial = partialtotal[t];

        for (int j = js[t]; j < jn - 1; j++)
        {
            // s = binom(generator, std::binomial_distribution<int>::param_type(r, probs[j]/totalprob));
            s = gsl_ran_binomial(gslgen1, probs[j] / partial, r);
            r -= s;
            for (int k = 0; k < s; k++)
                samples[off++] = j;
            if (!r)
                break;
            partial -= probs[j];
        }
        for (int k = 0; k < r; k++)
            samples[off++] = jn - 1;
    }


    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<milliseconds>(stop - start);
    std::cout << duration.count() << std::endl;
    /*     for(int i=0; i<nshot; i++){
        std::cout << samples[i] << std::endl;
    } */
    return 0;
}

hhorii requested review from atilag and chriseclectic as code owners July 13, 2020 15:39

hhorii force-pushed the sampling_opt branch 4 times, most recently from 3f9c6c3 to 8952b07 Compare July 14, 2020 13:10

hhorii changed the title ~~add optimization of sampling based on NUMA architecture~~ WIP: add optimization of sampling based on NUMA architecture Jul 14, 2020

hhorii force-pushed the sampling_opt branch from 775af7c to 341cd8f Compare July 14, 2020 14:25

hhorii changed the title ~~WIP: add optimization of sampling based on NUMA architecture~~ add optimization of sampling based on NUMA architecture Jul 14, 2020

yaelbh mentioned this pull request Jul 19, 2020

Reduce complexity of sample_measure #836

Closed

hhorii force-pushed the sampling_opt branch 9 times, most recently from da633b0 to 2f5d8f3 Compare July 21, 2020 16:11

hhorii force-pushed the sampling_opt branch 3 times, most recently from 044019b to 986a37f Compare August 26, 2020 02:17

hhorii changed the title ~~add optimization of sampling based on NUMA architecture~~ add sampling options of iterative and binary search Aug 26, 2020

hhorii force-pushed the sampling_opt branch from 986a37f to 6d00053 Compare August 27, 2020 02:55

add sampling variations

77c6039

hhorii force-pushed the sampling_opt branch from 6d00053 to 77c6039 Compare December 15, 2020 11:07

hhorii requested a review from vvilpas as a code owner December 15, 2020 11:07

chriseclectic added this to the Aer 0.8 milestone Feb 16, 2021

chriseclectic removed this from the Aer 0.8 milestone Mar 23, 2021

yaelbh mentioned this pull request Aug 18, 2022

[WIP] Improve speed of sample_counts from O(N) to O(1) Qiskit/qiskit#8547

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add sampling options of iterative and binary search #831

add sampling options of iterative and binary search #831

hhorii commented Jul 13, 2020

levbishop commented Jul 22, 2020 •

edited

add sampling options of iterative and binary search #831

Are you sure you want to change the base?

add sampling options of iterative and binary search #831

Conversation

hhorii commented Jul 13, 2020

Summary

Details and comments

levbishop commented Jul 22, 2020 • edited

levbishop commented Jul 22, 2020 •

edited