Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boost::histogram::axis::variant, allow users to choose between sorted_array+std::lower_bound and eytzinger_layout+eytzinger_binary_search #369

Open
jhcarl0814 opened this issue Dec 8, 2022 · 14 comments

Comments

@jhcarl0814
Copy link
Contributor

According to the test,

(boost::histogram::axis::variant is currently using std::upper_bound.)

comparison (x86-64 gcc (trunk), -std=c++23 -O3 -Wall -pedantic -pthread):

#pragma GCC optimize("O3")
#include <bits/stdc++.h>

constexpr size_t n = (1<<20);
constexpr size_t block_size = 64 / sizeof(int); // = 64 / 4 = cache_line_size / sizeof(int)
alignas(64) int a[n], b[n+1];

constexpr size_t query_count=(1<<20);
int targets[query_count];
int results_eytzinger[query_count];
int results_lower_bound[query_count];

int eytzinger(int i = 0, size_t k = 1) {
    if (k <= n) {
        i = eytzinger(i, 2 * k);
        b[k] = a[i++];
        i = eytzinger(i, 2 * k + 1);
    }
    return i;
}

int search(int x) {
    size_t k = 1;
    while (k <= n) {
        __builtin_prefetch(b + k * block_size);
        k = 2 * k + (b[k] < x);
    }
    k >>= __builtin_ffs(~k);
    return k;
}

int main()
{
    std::random_device rd;  //Will be used to obtain a seed for the random number engine
    std::mt19937 gen(rd()); //Standard mersenne_twister_engine seeded with rd()
    std::uniform_int_distribution<> distrib(std::numeric_limits<int>::min(), std::numeric_limits<int>::max());
    for (size_t i = 0; i != n; ++i)
    {
        a[i]=std::round(distrib(gen));
    }
    std::sort(std::begin(a),std::end(a));

    eytzinger();

    for (size_t i = 0; i != query_count; ++i)
    {
        targets[i]=std::round(distrib(gen));
    }

    {
        auto start = std::chrono::steady_clock::now();
        for (size_t i = 0; i != query_count; ++i)
        {
            results_eytzinger[i]=search(targets[i]);
        }
        auto end = std::chrono::steady_clock::now();
        std::chrono::duration<double> elapsed_seconds = end-start;
        std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
    }
    {
        auto start = std::chrono::steady_clock::now();
        for (size_t i = 0; i != query_count; ++i)
        {
            results_lower_bound[i]=std::lower_bound(std::begin(a),std::end(a),targets[i])-std::begin(a);
        }
        auto end = std::chrono::steady_clock::now();
        std::chrono::duration<double> elapsed_seconds = end-start;
        std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
    }
    for (size_t i = 0; i != query_count; ++i)
    {
        assert(a[results_lower_bound[i]]==b[results_eytzinger[i]]);   
    }
}

image

@HDembinski
Copy link
Collaborator

HDembinski commented Dec 8, 2022

Hi, thanks for bringing this up, this sounds pretty cool. Are you interested in contributing a patch? We could make a compile-time option for the builtin variant axis so that a user can switch between eytzinger and standard binary search.

Before we are doing this, some more research is necessary, however. We need some benchmarks that show the performance on "typical" distributions that one encounters in the wild. Uniform is a rather artificial distribution that is not usually encounted. Gaussian occurs often (as the prototype for "peaked" distributions). Then there are exponential and power-law distributions.

A platform-independent implementation in Boost cannot use GNU extensions. Can you write an implementation and benchmark it without the use of __builtin_prefetch and __builtin_ffs? An optimising compiler should replace naive code into specialized instructions automatically if they are available on the target platform.

@HDembinski
Copy link
Collaborator

HDembinski commented Dec 8, 2022

Please also keep in mind that an axis in practice is rarely larger than about 10000 entries. A typical axis has 100 bins. So we are talking about a situation where the sorted array typically fits into the L1 cache. Your "axis" array has ~1M entries, which is not realistic. I tried the benchmark with smaller numbers and still found that eytzinger was 1.5-4 faster.

@HDembinski
Copy link
Collaborator

Your implementation returns the value in the sorted array for eytzinger, but we need the position of the value instead of the value itself.

@HDembinski
Copy link
Collaborator

HDembinski commented Dec 8, 2022

Here is an implementation that is closer to reality. Link to Godbolt

#include <bits/stdc++.h>
#include <array>
#include <vector>
#include <boost/core/lightweight_test.hpp>
#include <iostream>

class eytzinger {
public:
    eytzinger(const std::vector<double>& a) :
        a_(a), b_(a.size() + 1), idx_(a.size() + 1) {
        init();
        idx_[(*this)(a.back()+1)] = a.size();
    }

    int operator()(int x) const {
        size_t k = 1;
        while (k <= a_.size()) {
            k = 2 * k + (b_[k] < x);
        }
        k >>= __builtin_ffs(~k);
        return idx_[k];
    }

private:
    int init(size_t i = 0, size_t k = 1) {
        if (k <= a_.size()) {
            i = init(i, 2 * k);
            idx_[k] = i;
            b_[k] = a_[i++];
            i = init(i, 2 * k + 1);
        }
        return i;
    }

    const std::vector<double>& a_;
    std::vector<double> b_;
    std::vector<int> idx_;
};

class stopwatch {
public:
    stopwatch(const char* message, double* duration = nullptr) : 
        message_{message},
        duration_{duration},
        start_{std::chrono::high_resolution_clock::now()} {
    }
    
    ~stopwatch() {
        std::chrono::duration<double> elapsed = std::chrono::high_resolution_clock::now() - start_;
        if (duration_) *duration_ = elapsed.count();
        if (message_[0])
            std::cout << message_ << elapsed.count() << std::endl;
    }

private:
    const char* message_;
    double* duration_;
    std::chrono::time_point<std::chrono::high_resolution_clock> start_;
};

constexpr size_t query_count=(1<<20);
int targets[query_count];
int results_eytzinger[query_count];
int results_lower_bound[query_count];

template <class Dist>
void run_tests(const char* title, Dist& dist) {
    std::cout << title << std::endl;

    std::mt19937 gen(1);

    size_t n_list[] = {2, 10, 100, 1000, 10000};

    for (const auto& n : n_list) {
        std::cout << "  n " << n << std::endl;
        std::vector<double> a(n);

        for (size_t i = 0; i != n; ++i)
            a[i] = dist(gen);

        std::sort(std::begin(a), std::end(a));

        for (size_t i = 0; i != query_count; ++i)
            targets[i] = dist(gen);

        double t1, t2;
        {
            eytzinger search(a);
            stopwatch _{"    eytzinger   elapsed time: ", &t1};
            for (size_t i = 0; i != query_count; ++i)
                results_eytzinger[i]=search(targets[i]);
        }
        {
            stopwatch _{"    lower_bound elapsed time: ", &t2};
            for (size_t i = 0; i != query_count; ++i)
                results_lower_bound[i]=std::lower_bound(std::begin(a), std::end(a), targets[i]) - std::begin(a);
        }

        std::cout << "    ratio                   : " << t2 / t1 << std::endl;

        BOOST_TEST_ALL_EQ(
            std::begin(results_lower_bound), 
            std::end(results_lower_bound),
            std::begin(results_eytzinger),
            std::end(results_eytzinger)
        );   
    }
}

int main()
{
    std::uniform_real_distribution<> uni;
    run_tests("uniform", uni);

    std::normal_distribution<> norm;
    run_tests("normal", norm);

    std::exponential_distribution<> expo;
    run_tests("exponential", expo);

    std::chi_squared_distribution<> chi2(5.0);
    run_tests("chi2", chi2);

    return boost::report_errors();
}
  • It encapsulates the eytzinger logic and the creating of the temporary array b in a class. The size of arrays a and b are not known at compile time, so we need to use vectors.
  • Bin edges are doubles in general, not integers.
  • eytzinger must return the index and not the bin edge. This costs us another array lookup, since the internal index k must be translated back into the an index in the original array. If there is a faster way to handle this, let me know.

The results seem fairly noisy on godbolt, so this should be checked for stability. Still, depending on the distribution, eytzinger seems a bit faster, from 20 % up to a factor 2.

@jhcarl0814
Copy link
Contributor Author

@HDembinski Thank you for the implementation. I'm trying to solve some of the problems while there are some open issues.

find first set

Here are some implementations of find_first_set from BitScan - Chessprogramming wiki and Bit Twiddling Hacks. I'm using the second one in other tests.
ffs (x86-64 gcc (trunk), -std=c++23 -O2 -Wall -pedantic -pthread):

#include <bits/stdc++.h>
#include <boost/core/lightweight_test.hpp>
#include <iostream>

int ffs1(uint64_t v)//Sean Eron Anderson, Jim Cole, Jason Cunningham https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightLinear
{
    int c;
    if (v)
    {
        v = (v ^ (v - 1)) >> 1;  // Set v's trailing 0s to 1s and zero rest
        for (c = 1; v; c++)
        {
            v >>= 1;
        }
    }
    else
    {
        c = 0;
    }
    return c;
}

int ffs2(uint64_t v)//Sean Eron Anderson, Bill Burdick https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightParallel
{
    if(v==0) return 0;
    int c = 64; // c will be the number of zero bits on the right
    v &= -int64_t(v);
    // if (v) c--;
    if (v & 0x00000000FFFFFFFF) c -= 32;
    if (v & 0x0000FFFF0000FFFF) c -= 16;
    if (v & 0x00FF00FF00FF00FF) c -= 8;
    if (v & 0x0F0F0F0F0F0F0F0F) c -= 4;
    if (v & 0x3333333333333333) c -= 2;
    if (v & 0x5555555555555555) c -= 1;
    return c;
}

int ffs3(uint64_t v)//Sean Eron Anderson, Matt Whitlock, Andrew Shapira https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightBinSearch
{
    if(v==0) return 0;
    int c;  // c will be the number of zero bits on the right,
            // so if v is 1101000 (base 2), then c will be 3
    // NOTE: if 0 == v, then c = 31.
    if (v & 0x1) 
    {
        // special case for odd v (assumed to happen half of the time)
        c = 1;
    }
    else
    {
        c = 2;
        if ((v & 0xffffffff) == 0) 
        {  
            v >>= 32;  
            c += 32;
        }
        if ((v & 0xffff) == 0) 
        {  
            v >>= 16;  
            c += 16;
        }
        if ((v & 0xff) == 0) 
        {  
            v >>= 8;  
            c += 8;
        }
        if ((v & 0xf) == 0) 
        {  
            v >>= 4;
            c += 4;
        }
        if ((v & 0x3) == 0) 
        {  
            v >>= 2;
            c += 2;
        }
        c -= v & 0x1;
    }
    return c;
}

int ffs4(uint64_t bb)//Martin Läuter (1997), Charles E. Leiserson, Harald Prokop, Keith H. Randall https://www.chessprogramming.org/BitScan#With_isolated_LS1B
{
    if(bb==0) return 0;
    constexpr int index64[64] = {
        0,  1, 48,  2, 57, 49, 28,  3,
        61, 58, 50, 42, 38, 29, 17,  4,
        62, 55, 59, 36, 53, 51, 43, 22,
        45, 39, 33, 30, 24, 18, 12,  5,
        63, 47, 56, 27, 60, 41, 37, 16,
        54, 35, 52, 21, 44, 32, 23, 11,
        46, 26, 40, 15, 34, 20, 31, 10,
        25, 14, 19,  9, 13,  8,  7,  6
    };
    constexpr uint64_t debruijn64 = 0x03f79d71b4cb0a89;
    return index64[((bb & -bb) * debruijn64) >> 58]+1;
}

int ffs5(uint64_t bb)//Kim Walisch (2012) https://www.chessprogramming.org/BitScan#With_separated_LS1B
{
    if(bb==0) return 0;
    constexpr int index64[64] = {
        0, 47,  1, 56, 48, 27,  2, 60,
        57, 49, 41, 37, 28, 16,  3, 61,
        54, 58, 35, 52, 50, 42, 21, 44,
        38, 32, 29, 23, 17, 11,  4, 62,
        46, 55, 26, 59, 40, 36, 15, 53,
        34, 51, 20, 43, 31, 22, 10, 45,
        25, 39, 14, 33, 19, 30,  9, 24,
        13, 18,  8, 12,  7,  6,  5, 63,
    };
    constexpr uint64_t debruijn64 = 0x03f79d71b4cb0a89;
    return index64[((bb ^ (bb-1)) * debruijn64) >> 58]+1;
}

class stopwatch {
public:
    stopwatch(const char* message, double* duration = nullptr) : 
        message_{message},
        duration_{duration},
        start_{std::chrono::high_resolution_clock::now()} {
    }
    
    ~stopwatch() {
        std::chrono::duration<double> elapsed = std::chrono::high_resolution_clock::now() - start_;
        if (duration_) *duration_ = elapsed.count();
        if (message_[0])
            std::cout << message_ << elapsed.count() << std::endl;
    }

private:
    const char* message_;
    double* duration_;
    std::chrono::time_point<std::chrono::high_resolution_clock> start_;
};

constexpr size_t query_count=(1<<23);
uint64_t targets[query_count];
// int results_ffs1[query_count];
// int results_ffs2[query_count];
// int results_ffs3[query_count];
// int results_ffs4[query_count];
// int results_ffs5[query_count];
int result_ffs1, result_ffs2, result_ffs3, result_ffs4, result_ffs5;

void run_tests() {
    std::uniform_int_distribution<uint64_t> dist(std::numeric_limits<uint64_t>::min(), std::numeric_limits<uint64_t>::max());
    std::mt19937 gen(1);
    for (size_t i = 0; i != query_count; ++i)
        targets[i] = dist(gen);

    double t1, t2, t3, t4, t5;
    {
        stopwatch _{"ffs1 elapsed time: ", &t1};
        for (size_t i = 0; i != query_count; ++i)
            // results_ffs1[i]=ffs1(targets[i]);
            result_ffs1+=ffs1(targets[i]);
    }
    {
        stopwatch _{"ffs2 elapsed time: ", &t2};
        for (size_t i = 0; i != query_count; ++i)
            // results_ffs2[i]=ffs2(targets[i]);
            result_ffs2+=ffs2(targets[i]);
    }
    {
        stopwatch _{"ffs3 elapsed time: ", &t3};
        for (size_t i = 0; i != query_count; ++i)
            // results_ffs3[i]=ffs3(targets[i]);
            result_ffs3+=ffs3(targets[i]);
    }
    {
        stopwatch _{"ffs4 elapsed time: ", &t4};
        for (size_t i = 0; i != query_count; ++i)
            // results_ffs4[i]=ffs4(targets[i]);
            result_ffs4+=ffs4(targets[i]);
    }
    {
        stopwatch _{"ffs5 elapsed time: ", &t5};
        for (size_t i = 0; i != query_count; ++i)
            // results_ffs5[i]=ffs5(targets[i]);
            result_ffs5+=ffs5(targets[i]);
    }

    // BOOST_TEST_ALL_EQ(
    //     std::begin(results_ffs1), 
    //     std::end(results_ffs1),
    //     std::begin(results_ffs2),
    //     std::end(results_ffs2)
    // );
    // BOOST_TEST_ALL_EQ(
    //     std::begin(results_ffs1), 
    //     std::end(results_ffs1),
    //     std::begin(results_ffs3),
    //     std::end(results_ffs3)
    // );
    // BOOST_TEST_ALL_EQ(
    //     std::begin(results_ffs1), 
    //     std::end(results_ffs1),
    //     std::begin(results_ffs4),
    //     std::end(results_ffs4)
    // );
    // BOOST_TEST_ALL_EQ(
    //     std::begin(results_ffs1), 
    //     std::end(results_ffs1),
    //     std::begin(results_ffs5),
    //     std::end(results_ffs5)
    // );
    BOOST_TEST_EQ(result_ffs1,result_ffs2);
    BOOST_TEST_EQ(result_ffs1,result_ffs3);
    BOOST_TEST_EQ(result_ffs1,result_ffs4);
    BOOST_TEST_EQ(result_ffs1,result_ffs5);
}

int main()
{
    run_tests();

    return boost::report_errors();
}

prefetch

Did not find a cross-platform solution.

alignment

replacing

std::vector<T> b_;

with

#include<boost/align/aligned_allocator.hpp>
#include<new>
std::vector<T, boost::alignment::aligned_allocator<T, std::hardware_constructive_interference_size>> b_;

should improve the result. But the test environment is too unstable to observe the difference.

convert from breadth-index to inorder-index

Currently the lookup table method is the fastest.
I implemented a converting function hoping it can save the space of the index lookup table. But as there are too many variables and branches, the code is only half as fast as std::lower_bound. (std::lower_bound becomes much much slower when adding one more 0 at the end of n, but as you said we should also take care of the situations when n is not large.) I will come back later to see if this is optimizable.
using eytzinger_one_based_to_inorder_zero_based in place of lookup table (x86-64 gcc (trunk), -std=c++23 -O2 -Wall -pedantic -pthread):

#include<iostream>
#include<algorithm>
#include<vector>
#include<random>
#include<new>
#include<chrono>
#include<boost/align/aligned_allocator.hpp>
#include<boost/core/lightweight_test.hpp>
#include<climits>

int ffs(uint64_t v)//Sean Eron Anderson, Bill Burdick https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightParallel
{
    if(v==0) return 0;
    int c = 64; // c will be the number of zero bits on the right
    v &= -int64_t(v);
    // if (v) c--;
    if (v & 0x00000000FFFFFFFF) c -= 32;
    if (v & 0x0000FFFF0000FFFF) c -= 16;
    if (v & 0x00FF00FF00FF00FF) c -= 8;
    if (v & 0x0F0F0F0F0F0F0F0F) c -= 4;
    if (v & 0x3333333333333333) c -= 2;
    if (v & 0x5555555555555555) c -= 1;
    return c;
}

int find_last_set(size_t v)
{
	return CHAR_BIT*sizeof(size_t)-__builtin_clzl(v);
}
// int find_last_set(uint64_t v)//Sean Eron Anderson https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
// {
// 	if(v==0) return 0;
// 	int r = 1;
// 	while (v >>= 1)
// 	    r++;
// 	return r;
// }
size_t eytzinger_one_based_to_inorder_zero_based(size_t tree_element_count, size_t tree_row_count, size_t eytzinger_one_based, size_t eytzinger_one_based_digit_count)
{
    if(tree_row_count==0)
        std::unreachable();
    size_t result=0;
    while(tree_row_count!=1)
    {
        size_t bottom_row_element_count=tree_element_count-((1zu<<(tree_row_count-1))-1);
        // size_t left_subtree_element_count=(1zu<<(tree_row_count-2))-1 + std::min(bottom_row_element_count,1zu<<(tree_row_count-2));
        size_t left_subtree_element_count=(1zu<<(tree_row_count-2))-1 + ((bottom_row_element_count-1)&(1zu<<(tree_row_count-2))?1zu<<(tree_row_count-2):bottom_row_element_count);
        size_t right_subtree_element_count=tree_element_count-1-left_subtree_element_count;
        if(eytzinger_one_based==1)
        {
            result+=left_subtree_element_count;
            break;
        }
        else if((eytzinger_one_based & (1zu<<(eytzinger_one_based_digit_count-2)))==0)
        {
            std::tie(tree_element_count,tree_row_count,eytzinger_one_based,eytzinger_one_based_digit_count)=std::forward_as_tuple(
                left_subtree_element_count,
                tree_row_count-1,
                (eytzinger_one_based&~(1zu<<(eytzinger_one_based_digit_count-1)))|(1zu<<(eytzinger_one_based_digit_count-2)),
                eytzinger_one_based_digit_count-1);
        }
        else if((eytzinger_one_based & (1zu<<(eytzinger_one_based_digit_count-2)))!=0)
        {
            result+=left_subtree_element_count+1;
            // std::tie(tree_element_count,tree_row_count,eytzinger_one_based,eytzinger_one_based_digit_count)=std::forward_as_tuple(
            //     right_subtree_element_count,
            //     bottom_row_element_count>1zu<<(tree_row_count-2)?tree_row_count-1:tree_row_count-2,
            //     eytzinger_one_based&~(1zu<<(eytzinger_one_based_digit_count-1)),
            //     eytzinger_one_based_digit_count-1);
            std::tie(tree_element_count,tree_row_count,eytzinger_one_based,eytzinger_one_based_digit_count)=std::forward_as_tuple(
                right_subtree_element_count,
                ((bottom_row_element_count-1)&(1zu<<(tree_row_count-2)))?tree_row_count-1:tree_row_count-2,
                eytzinger_one_based&~(1zu<<(eytzinger_one_based_digit_count-1)),
                eytzinger_one_based_digit_count-1);
        }
        else
            std::unreachable();
    }
    return result;

    // if(tree_row_count==0)
    //     std::unreachable();
    // else if(tree_row_count==1)
    //     return 0;
    // else if(tree_row_count>=2)
    // {
    //     size_t bottom_row_element_count=tree_element_count-((1zu<<(tree_row_count-1))-1);
    //     size_t left_subtree_element_count=(1zu<<(tree_row_count-2))-1 + std::min(bottom_row_element_count,1zu<<(tree_row_count-2));
    //     size_t right_subtree_element_count=tree_element_count-1-left_subtree_element_count;
    //     if(eytzinger_one_based==0)
    //         std::unreachable();
    //     else if(eytzinger_one_based==1)
    //         return left_subtree_element_count;
    //     else if((eytzinger_one_based & (1zu<<(eytzinger_one_based_digit_count-2)))==0)
    //     {
    //         return eytzinger_one_based_to_inorder_zero_based(
    //             left_subtree_element_count,
    //             tree_row_count-1,
    //             (eytzinger_one_based&~(1zu<<(eytzinger_one_based_digit_count-1)))|(1zu<<(eytzinger_one_based_digit_count-2)),
    //             eytzinger_one_based_digit_count-1);
    //     }
    //     else if((eytzinger_one_based & (1zu<<(eytzinger_one_based_digit_count-2)))!=0)
    //     {
    //         return left_subtree_element_count+1+eytzinger_one_based_to_inorder_zero_based(
    //             right_subtree_element_count,
    //             bottom_row_element_count>1zu<<(tree_row_count-2)?tree_row_count-1:tree_row_count-2,
    //             eytzinger_one_based&~(1zu<<(eytzinger_one_based_digit_count-1)),
    //             eytzinger_one_based_digit_count-1);
    //     }
    //     else
    //         std::unreachable();
    // }
    // else
    //     std::unreachable();
}

template<typename T>
struct eytzinger {
    eytzinger(const std::vector<T>& a) :
        a_(a), b_(a.size() + 1)
        // , idx_(a.size() + 1)
    {
        init();
        // idx_[0] = a.size();
    }

    auto operator()(auto const& x) const {
        size_t k = 1;
        while (k <= a_.size()) {
            k = 2 * k + (b_[k] < x);
        }
        k >>= ffs(~k);
        if(k==0)
            return a_.end();
        else
            return a_.begin()+eytzinger_one_based_to_inorder_zero_based(a_.size(),find_last_set(a_.size()),k,find_last_set(k));
        // return a_.begin()+idx_[k];
    }

    size_t init(size_t i = 0, size_t k = 1) {
        if (k <= a_.size()) {
            i = init(i, 2 * k);
            // idx_[k] = i;
            b_[k] = a_[i++];
            i = init(i, 2 * k + 1);
        }
        return i;
    }

    const std::vector<T>& a_;
    std::vector<T, boost::alignment::aligned_allocator<T, std::hardware_constructive_interference_size>> b_;
    // std::vector<size_t> idx_;
};

class stopwatch {
public:
    stopwatch(const char* message, double* duration = nullptr) : 
        message_{message},
        duration_{duration},
        start_{std::chrono::high_resolution_clock::now()} {
    }
    
    ~stopwatch() {
        std::chrono::duration<double> elapsed = std::chrono::high_resolution_clock::now() - start_;
        if (duration_) *duration_ = elapsed.count();
        if (message_[0])
            std::cout << message_ << elapsed.count() << std::endl;
    }

private:
    const char* message_;
    double* duration_;
    std::chrono::time_point<std::chrono::high_resolution_clock> start_;
};

template <class Dist>
void run_tests(const char* title, Dist& dist, size_t query_count) {
    std::cout << title << std::endl;

    std::mt19937 gen(1);

    for (size_t n : {2, 10, 100, 1000, 10000}) {
        std::cout << "  n " << n << std::endl;
        std::vector<double> data(n);
        std::vector<double> targets(query_count);
        std::vector<size_t> results_eytzinger(query_count);
        std::vector<size_t> results_lower_bound(query_count);

        for (size_t i = 0; i != n; ++i)
            data[i] = dist(gen);

        std::sort(std::begin(data), std::end(data));

        for (size_t i = 0; i != query_count; ++i)
            targets[i] = dist(gen);

        double t1, t2;
        {
            eytzinger search(data);
            stopwatch _{"    eytzinger   elapsed time: ", &t1};
            for (size_t i = 0; i != query_count; ++i)
                results_eytzinger[i]=search(targets[i]) - std::begin(data);
        }
        {
            stopwatch _{"    lower_bound elapsed time: ", &t2};
            for (size_t i = 0; i != query_count; ++i)
                results_lower_bound[i]=std::lower_bound(std::begin(data), std::end(data), targets[i]) - std::begin(data);
        }

        std::cout << "    ratio                   : " << t2 / t1 << std::endl;

        BOOST_TEST_ALL_EQ(
            std::begin(results_lower_bound), 
            std::end(results_lower_bound),
            std::begin(results_eytzinger),
            std::end(results_eytzinger)
        );   
    }
}

int main() {
    std::uniform_real_distribution<> uni;
    run_tests("uniform", uni, 1<<20);

    std::normal_distribution<> norm;
    run_tests("normal", norm, 1<<20);

    std::exponential_distribution<> expo;
    run_tests("exponential", expo, 1<<20);

    std::chi_squared_distribution<> chi2(5.0);
    run_tests("chi2", chi2, 1<<20);

    return boost::report_errors();
}

@HDembinski
Copy link
Collaborator

HDembinski commented Dec 12, 2022

Thanks for the additional research on this. Please try use C++14 when you write the benchmarks, because we need to support C++14 in Boost.Histogram. I think we can do without prefetch in general and try to figure out a way to enable it when it is available on the target platform. It shouldn't make a big difference, because the sorted array is likely going to fit into the L1 cache in this applications, so there is no prefetching happening. I think the alignas trick goes into a similar direction and may not be needed here, although using Boost.Align is fine, in principle, too.

The main issue seems to be ffs. If we do not use __builtin_ffs, the performance drops too much to generally replace lower_bound for all n. Am I reading this correctly?

@jhcarl0814
Copy link
Contributor Author

@HDembinski Thank you for the advice and addressing the prefetch and alignment issues.

find first set (problem: compiler does not replace software based implementation with instruction(s) the intrinsic could get)

Sorry I was so stupid that I did not find a fast ffs. I just found out that ffs is already inside boost (boost::core::countr_zero and boost::multiprecision::lsb) and std (c++20 std::countr_zero). They have almost the same speed as __builtin_ffs, so we don't need to worry about compiler not identifying and replacing the code.
boost::core::countr_zero(v)+1 and boost::multiprecision::lsb(v)+1 are as fast as __builtin_ffs(v) (x86-64 gcc (trunk), -std=c++14 -O2 -Wall -pedantic -pthread)

#include <bits/stdc++.h>
#include <boost/core/lightweight_test.hpp>
#include <iostream>
#include<boost/core/bit.hpp>
#include<boost/multiprecision/integer.hpp>

int ffs0(uint64_t v)
{
    return __builtin_ffs(v);
}
int ffs1(uint64_t v)//Sean Eron Anderson, Jim Cole, Jason Cunningham https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightLinear
{
    int c;
    if (v)
    {
        v = (v ^ (v - 1)) >> 1;  // Set v's trailing 0s to 1s and zero rest
        for (c = 1; v; c++)
        {
            v >>= 1;
        }
    }
    else
    {
        c = 0;
    }
    return c;
}

int ffs2(uint64_t v)//Sean Eron Anderson, Bill Burdick https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightParallel
{
    if(v==0) return 0;
    int c = 64; // c will be the number of zero bits on the right
    v &= -int64_t(v);
    // if (v) c--;
    if (v & 0x00000000FFFFFFFF) c -= 32;
    if (v & 0x0000FFFF0000FFFF) c -= 16;
    if (v & 0x00FF00FF00FF00FF) c -= 8;
    if (v & 0x0F0F0F0F0F0F0F0F) c -= 4;
    if (v & 0x3333333333333333) c -= 2;
    if (v & 0x5555555555555555) c -= 1;
    return c;
}

int ffs3(uint64_t v)//Sean Eron Anderson, Matt Whitlock, Andrew Shapira https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightBinSearch
{
    if(v==0) return 0;
    int c;  // c will be the number of zero bits on the right,
            // so if v is 1101000 (base 2), then c will be 3
    // NOTE: if 0 == v, then c = 31.
    if (v & 0x1) 
    {
        // special case for odd v (assumed to happen half of the time)
        c = 1;
    }
    else
    {
        c = 2;
        if ((v & 0xffffffff) == 0) 
        {  
            v >>= 32;  
            c += 32;
        }
        if ((v & 0xffff) == 0) 
        {  
            v >>= 16;  
            c += 16;
        }
        if ((v & 0xff) == 0) 
        {  
            v >>= 8;  
            c += 8;
        }
        if ((v & 0xf) == 0) 
        {  
            v >>= 4;
            c += 4;
        }
        if ((v & 0x3) == 0) 
        {  
            v >>= 2;
            c += 2;
        }
        c -= v & 0x1;
    }
    return c;
}

int ffs4(uint64_t bb)//Martin Läuter (1997), Charles E. Leiserson, Harald Prokop, Keith H. Randall https://www.chessprogramming.org/BitScan#With_isolated_LS1B
{
    if(bb==0) return 0;
    constexpr int index64[64] = {
        0,  1, 48,  2, 57, 49, 28,  3,
        61, 58, 50, 42, 38, 29, 17,  4,
        62, 55, 59, 36, 53, 51, 43, 22,
        45, 39, 33, 30, 24, 18, 12,  5,
        63, 47, 56, 27, 60, 41, 37, 16,
        54, 35, 52, 21, 44, 32, 23, 11,
        46, 26, 40, 15, 34, 20, 31, 10,
        25, 14, 19,  9, 13,  8,  7,  6
    };
    constexpr uint64_t debruijn64 = 0x03f79d71b4cb0a89;
    return index64[((bb & -bb) * debruijn64) >> 58]+1;
}

int ffs5(uint64_t bb)//Kim Walisch (2012) https://www.chessprogramming.org/BitScan#With_separated_LS1B
{
    if(bb==0) return 0;
    constexpr int index64[64] = {
        0, 47,  1, 56, 48, 27,  2, 60,
        57, 49, 41, 37, 28, 16,  3, 61,
        54, 58, 35, 52, 50, 42, 21, 44,
        38, 32, 29, 23, 17, 11,  4, 62,
        46, 55, 26, 59, 40, 36, 15, 53,
        34, 51, 20, 43, 31, 22, 10, 45,
        25, 39, 14, 33, 19, 30,  9, 24,
        13, 18,  8, 12,  7,  6,  5, 63,
    };
    constexpr uint64_t debruijn64 = 0x03f79d71b4cb0a89;
    return index64[((bb ^ (bb-1)) * debruijn64) >> 58]+1;
}

int ffs6(uint64_t v)//Using de Bruijn Sequences to Index 1 in a Computer Word by Charles E. Leiserson, Harald Prokof, and Keith H. Randall https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightModLookup
{
    if(v==0) return 0;
    return boost::core::countr_zero(v)+1;
}

int ffs7(uint64_t v)
{
    if(v==0) return 0;
    return boost::multiprecision::lsb(v)+1;
}

class stopwatch {
public:
    stopwatch(const char* message, double* duration = nullptr) : 
        message_{message},
        duration_{duration},
        start_{std::chrono::high_resolution_clock::now()} {
    }
    
    ~stopwatch() {
        std::chrono::duration<double> elapsed = std::chrono::high_resolution_clock::now() - start_;
        if (duration_) *duration_ = elapsed.count();
        if (message_[0])
            std::cout << message_ << elapsed.count() << std::endl;
    }

private:
    const char* message_;
    double* duration_;
    std::chrono::time_point<std::chrono::high_resolution_clock> start_;
};

constexpr size_t query_count=(1<<23);
uint64_t targets[query_count];
// int results_ffs0[query_count];
// int results_ffs1[query_count];
// int results_ffs2[query_count];
// int results_ffs3[query_count];
// int results_ffs4[query_count];
// int results_ffs5[query_count];
// int results_ffs6[query_count];
// int results_ffs7[query_count];
int result_ffs0, result_ffs1, result_ffs2, result_ffs3, result_ffs4, result_ffs5, result_ffs6, result_ffs7;

void run_tests() {
    std::uniform_int_distribution<uint64_t> dist(std::numeric_limits<uint64_t>::min(), std::numeric_limits<uint64_t>::max());
    std::mt19937 gen(1);
    for (size_t i = 0; i != query_count; ++i)
        targets[i] = dist(gen);

    double t0, t1, t2, t3, t4, t5, t6, t7;
    {
        stopwatch _{"ffs0 elapsed time: ", &t0};
        for (size_t i = 0; i != query_count; ++i)
            // results_ffs0[i]=ffs0(targets[i]);
            result_ffs0+=ffs0(targets[i]);
    }
    {
        stopwatch _{"ffs1 elapsed time: ", &t1};
        for (size_t i = 0; i != query_count; ++i)
            // results_ffs1[i]=ffs1(targets[i]);
            result_ffs1+=ffs1(targets[i]);
    }
    {
        stopwatch _{"ffs2 elapsed time: ", &t2};
        for (size_t i = 0; i != query_count; ++i)
            // results_ffs2[i]=ffs2(targets[i]);
            result_ffs2+=ffs2(targets[i]);
    }
    {
        stopwatch _{"ffs3 elapsed time: ", &t3};
        for (size_t i = 0; i != query_count; ++i)
            // results_ffs3[i]=ffs3(targets[i]);
            result_ffs3+=ffs3(targets[i]);
    }
    {
        stopwatch _{"ffs4 elapsed time: ", &t4};
        for (size_t i = 0; i != query_count; ++i)
            // results_ffs4[i]=ffs4(targets[i]);
            result_ffs4+=ffs4(targets[i]);
    }
    {
        stopwatch _{"ffs5 elapsed time: ", &t5};
        for (size_t i = 0; i != query_count; ++i)
            // results_ffs5[i]=ffs5(targets[i]);
            result_ffs5+=ffs5(targets[i]);
    }
    {
        stopwatch _{"ffs6 elapsed time: ", &t6};
        for (size_t i = 0; i != query_count; ++i)
            // results_ffs6[i]=ffs6(targets[i]);
            result_ffs6+=ffs6(targets[i]);
    }
    {
        stopwatch _{"ffs7 elapsed time: ", &t7};
        for (size_t i = 0; i != query_count; ++i)
            // results_ffs7[i]=ffs7(targets[i]);
            result_ffs7+=ffs7(targets[i]);
    }

    // BOOST_TEST_ALL_EQ(
    //     std::begin(results_ffs1), 
    //     std::end(results_ffs1),
    //     std::begin(results_ffs2),
    //     std::end(results_ffs2)
    // );
    // BOOST_TEST_ALL_EQ(
    //     std::begin(results_ffs1), 
    //     std::end(results_ffs1),
    //     std::begin(results_ffs3),
    //     std::end(results_ffs3)
    // );
    // BOOST_TEST_ALL_EQ(
    //     std::begin(results_ffs1), 
    //     std::end(results_ffs1),
    //     std::begin(results_ffs4),
    //     std::end(results_ffs4)
    // );
    // BOOST_TEST_ALL_EQ(
    //     std::begin(results_ffs1), 
    //     std::end(results_ffs1),
    //     std::begin(results_ffs5),
    //     std::end(results_ffs5)
    // );
    BOOST_TEST_EQ(result_ffs0,result_ffs1);
    BOOST_TEST_EQ(result_ffs0,result_ffs2);
    BOOST_TEST_EQ(result_ffs0,result_ffs3);
    BOOST_TEST_EQ(result_ffs0,result_ffs4);
    BOOST_TEST_EQ(result_ffs0,result_ffs5);
    BOOST_TEST_EQ(result_ffs0,result_ffs6);
    BOOST_TEST_EQ(result_ffs0,result_ffs7);
}

int main()
{
    run_tests();

    return boost::report_errors();
}

convert from breadth-index to inorder-index (in a complete binary tree) (problem: Is it possible to remove the lookup table/compute the index on the fly, while still be faster than std::lower_bound?)

The lookup table method is already faster than std::lower_bound for (size_t n : {2, 10, 100, 1000, 10000}): using a lookup table (x86-64 gcc (trunk), -std=c++14 -O2 -Wall -pedantic -pthread).

Until now I haven't find a fast way to convert from eytzinger index to original index (using a divide and conquer loop (x86-64 gcc (trunk), -std=c++14 -O2 -Wall -pedantic -pthread)).
I'm wondering if it is possible to tamper with the problem, e.g. change from an array-based complete binary tree to an array-based tree where left subtree node count - right subtree node count == 0 or 1, because the tree will still be compact but there will be less arithmetic operations when converting the index:

  • if node_count == 1, eytzinger_one_based_index should be 1, inorder_zero_based_index results in 0;
  • if node_count == 2k+0 (left_subtree_node_count is k, right_subtree_node_count is k-1) or node_count == 2k+1 (left_subtree_node_count is k, right_subtree_node_count is k): (computing left_subtree_node_count and right_subtree_node_count is much simpler: no std::min, only one decrement operation)
    • if eytzinger_one_based_index is 1, inorder_zero_based_index results in left_subtree_node_count;
    • if eytzinger_one_based_index begins with "10", inorder_zero_based_index results in ... (recursion);
    • if eytzinger_one_based_index begins with "11", inorder_zero_based_index results in ... (recursion);

The only work left is to figure out how to build the tampered version of eytzinger layout and search in the tampered version of eytzinger layout.

@HDembinski
Copy link
Collaborator

HDembinski commented Dec 12, 2022

This is nice and impressive work. I am surprised to see that ff6 and ff7 are even seem to be faster than ff0 on average. If we can use boost::core::countr_zero(v) + 1 that is ideal, because boost::core is already a dependency. I was not aware that this function exists in boost::core.

@HDembinski
Copy link
Collaborator

@jhcarl0814 So this is cool, how do we proceed?

I think the results are promising enough that we should do the next step, which is to replace the standard search in axis::variable with the eytzinger search in a PR. Once this is done, we should run some more benchmarks with the new code under realistic conditions. This should give us the data to decide whether the new search should be the default or whether we need to add an option so that a user can select the trade-off they prefer.

Are you interested in submitting a PR? You put so much research into this, so you should be the main author of the PR to give you the credit.

@HDembinski
Copy link
Collaborator

HDembinski commented Dec 22, 2022

The place to construct the data structures is about here:
https://github.com/boostorg/histogram/blob/develop/include/boost/histogram/axis/variable.hpp#L101
We expect that the user provides a sorted list (and throw if that's not the case).

This function computes the index for a real argument:
https://github.com/boostorg/histogram/blob/develop/include/boost/histogram/axis/variable.hpp#L163
There are some complications, because an axis can be circular, and there is a special treatment if the value ends up on the edge of the last bin in some cases. Also note that the code uses upper_bound and then subtracts 1, because we want the intervals to be inclusive on the lower edge and exclusive on the upper edge (except for the last bin in some situations).

Finally, there is also a function to update the axis, because an axis can grow. This should update the eytzinger structure. We don't have to make the update super efficient, because this is not called very often.
https://github.com/boostorg/histogram/blob/develop/include/boost/histogram/axis/variable.hpp#L175

You can read on the concept of an axis type here:
https://www.boost.org/doc/libs/master/libs/histogram/doc/html/histogram/concepts.html#histogram.concepts.Axis
which explains in words what the interface is supposed to do.

@jhcarl0814
Copy link
Contributor Author

@HDembinski Sorry for being late. (The company I work for was testing the new workflow in the past week (make a cms website in 1 week by 1 person given an html5 template) so I didn't make time.)
I'm very interested in submitting a PR! Thank you very much for the links and hints! I'm starting to work on it.

@HDembinski
Copy link
Collaborator

HDembinski commented Dec 22, 2022

Don't worry, I also have a job and cannot always react immediately :). I am looking forward to it. For reference, I removed the experimental parts in your code and removed the vector a from the eytzinger struct. We only need to hold a reference to a during construction of the tree. I replaced the call to std::lower_bound with the real thing, a call to the index function of axis::variable. I need to subtract 1 from the index in eytzinger to get the same behavior.
https://godbolt.org/z/M74e7WeWv

@jhcarl0814
Copy link
Contributor Author

@HDembinski The k = 2 * k + !(x < b_[k]); // double negation to handle nan correctly you used in 8ab311f not only covered nan but also covered bin edges.

using targets = data exposes the problem (i.e. lower_bound - 1 can not replace upper_bound - 1, in other words, lower_bound can not replace upper_bound), previously targets = <random> doesn't expose the problem because double has so many bits that it's almost impossible for <random> to generate exact bin edges;

using targets = data and k = 2 * k + !(x < b_[k]); fixes the problem;

@HDembinski
Copy link
Collaborator

I know that it works also for bin edges, I had that tested in an intermediate version of the local benchmark. Let's continue the discussion in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants