Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uses Eigen's internal multiindexing for multi indexing on vectors and matrices #3250

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

SteveBronder
Copy link
Collaborator

Submission Checklist

  • Run unit tests: ./runTests.py src/test/unit
  • Run cpplint: make cpplint
  • Declare copyright holder and open-source license: see below

Summary

Uses Eigen 3.4's new non-contiguous indexing expressions to have the rvalue() functions that use multi_index return an expression of multi indexing instead of eagerly creating a new matrix from the indices.

In the Stan language doing a loop like the code in (a) is faster than doing a multi index like in (b) because we have to make a giant matrix in (b) for the multi index.

(a)

for (i in 1:N)
  mu[i] = beta * X[idx[i]];

(b)

mu = beta * x[idx]

Eigen 3.4 introduced an expression for non-contiguous indexing which we can utilize to not have a big matrix creation in (b).

Intended Effect

Make vectorized multi indexing as fast as the loop

How to Verify

Tests pass for current rvalue tests with no changes

Side Effects

Nope

Documentation

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Simons Foundation

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions - might be easier to explain next week over a board.

Have you fed this into your benchmarking framework to make sure it has the desired effect?

src/stan/model/indexing/rvalue.hpp Show resolved Hide resolved
src/stan/model/indexing/rvalue.hpp Show resolved Hide resolved
src/stan/model/indexing/rvalue.hpp Show resolved Hide resolved
src/test/unit/model/indexing/rvalue_test.cpp Outdated Show resolved Hide resolved
@WardBrian
Copy link
Member

@SteveBronder did you get a chance to try this in your stan-benchmark playground?

@SteveBronder
Copy link
Collaborator Author

I'm waiting till I have the math pr for this up as well so that I can benchmark this and the new matrix type all at once

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
arma/arma.stan 0.2 0.19 1.06 5.87% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.01 0.01 1.1 8.95% faster
gp_regr/gen_gp_data.stan 0.02 0.02 1.07 6.34% faster
gp_regr/gp_regr.stan 0.11 0.1 1.06 5.62% faster
sir/sir.stan 77.44 75.91 1.02 1.98% faster
irt_2pl/irt_2pl.stan 3.82 3.71 1.03 2.93% faster
eight_schools/eight_schools.stan 0.05 0.05 1.06 5.86% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.26 0.24 1.08 7.3% faster
pkpd/one_comp_mm_elim_abs.stan 18.31 17.54 1.04 4.21% faster
garch/garch.stan 0.46 0.44 1.04 3.4% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.76 2.74 1.01 0.7% faster
arK/arK.stan 1.62 1.59 1.02 2.0% faster
gp_pois_regr/gp_pois_regr.stan 2.53 2.41 1.05 4.99% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 9.13 8.88 1.03 2.72% faster
performance.compilation 174.23 178.9 0.97 -2.68% slower
Mean result: 1.0426783571790479

Jenkins Console Log
Blue Ocean
Commit hash: 934e17704b703a32c4c29f9ab5ae1913c4a58a57


Machine information No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping: 4
CPU MHz: 3308.200
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 1.3 MiB
L1i cache: 1.3 MiB
L2 cache: 40 MiB
L3 cache: 55 MiB
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed: Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities

G++:
g++ (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Clang:
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SteveBronder and I discussed some benchmarking he did (which hopefully he can post a nice graph of) which showed that this change did not immediately lead to the improvements we thought it might - that will require some specializations in the arena matrix representation.

But, it also didn't make things any worse, so I'm approving this on its own now.

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Apr 24, 2024

Running some benchmarks idt we should merge this PR yet. The benchmark can be found here. I took the code that brms generates and compared the new and current multi indexing to a loop as well as to a raw call from Eigen that does the same thing.

The plot has facets by the number of group (M) and plots vectors of size N against the average time for each combination of M and N. The indices are set up so that each inner index does the worst case scenario of random access to elements in the vectors.

Each line represents

  1. Code generated from a for loop in Stan
  2. Code generated from a multi index in Stan with the current multi index code
  3. Code generated from a multi index in Stan with the new multi index code
  4. Code generated from a multi index in Stan with the new multi index code, removing any checks
  5. Using plain eigen to do the multi index

The results were very suprising to me

image

Looping is very consistent with linear time, but it looks like our multi indexing has a kink in it once we cross 50K observations or so irrespective of the number of groups. The main reason the new and current multi indexing is slower is because of the checks, which you can see from the nocheck time

The loop does not have to do any checks. For multi index we have to check that the index is valid within the range of the size of the vector. For the current multi index we do that check as we are creating the index

        return plain_type_t<EigVec>::NullaryExpr(
            idx.ns_.size(), [name, &idx, &v_ref](Eigen::Index i) {
              math::check_range("vector[multi] indexing", name, v_ref.size(),
                                idx.ns_[i]);
              return v_ref.coeff(idx.ns_[i] - 1);
            });

but in the new code we do that check outside of the loop

  for (auto idx_i : idx.ns_) {
    math::check_range("vector[multi] indexing", name, v_size, idx_i);
  }

I think iterating over the index for the check is causing the main slowdown. The kink in the graph after 50K most likely happens because the index and vectors no longer fit in the lower levels of cache. Having that check inside or outside of the index creation most likely explains why this PRs multi indexing is slower than the current multi indexing

The nocheck benchmark does a few things that I think we could do in Stan, from most to least performance gain:

  1. Removes the check. This could be done with a flag defined at compile time. We could also do something smart in the compiler if we wanted to
    • Since we know the size of the containers we will be indexing into we could actually have a check in the constructor that validates the size once. Then we know that the index will be correct for all the future operations.
  2. Removes multi_index. The multi_index class is really just a holder for a vector with nothing else, but constructing that class requires a copy of the indexing vector. We would need to write rvalue() functions that accept vectors as the index and then the compiler would need to generate the right code.
  3. Removes the deep_copy(mu) normallly generated by the code. We don't manipulate mu here so it is safe to remove. When and when not to deep copy is kind of tricky. We want the deep copy when
    • The left hand side and right hand side use the same object (mu here for the +=)
    • The right hand side accesses the data used in the lhs in a non-linearly increasing and/or non-sequential order
      • (ex mu[1:3] += mu[{1, 1, 1}] * b where the first indexed value changes after the first assignment)

I think 1:3 are very feasible and the speed gain seems very worth it

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Apr 24, 2024

But my end recommendation is we close this PR without merge. I can write up a seperate PR that does 1 and 2 from the list above. We can add those signatures and then add the necessary components from the compiler.

The good news is we have a reasonable path forward that should give a magnitude speedup to any brms model that does mixed effects!

@WardBrian
Copy link
Member

I am still not sure if the benchmark you posted represents @bob-carpenter's original case he was concerned with, since it doesn't ever end up calling a lpdf. The concern as I understand it was when these matrices are constructed for use in a distribution, one way of writing it ends up leading to an un-vectorized lpdf, which goes against the other advice we give

@SteveBronder
Copy link
Collaborator Author

Doing the multi index inside an lpdf would give the same effects

@SteveBronder SteveBronder reopened this Apr 25, 2024
@SteveBronder
Copy link
Collaborator Author

Actually @WardBrian I want to reopen this. Running the benchmarks on my mac, just turning on STAN_NO_RANGE_CHECKS with the new code gives us a very good speedup.

image

@bob-carpenter
Copy link
Contributor

The plots make it look like the Eigen approach is the best. Am I missing something?

Also, what specifically are you evaluating? I see the label on one of the plots involves multi-indexing with an array and then elementwise multiplication? Usually we just do the indexing for random effects, so I'm not sure where the elementwise multiply is coming from. Normally it'd just be foo[idxs] + bar[idxs2] + ... for varying effects.

I had thought part of the problem with multi-indexing was creating an intermediate, but it's now returning a template expression created by binding.

template <typename EigVec, require_eigen_vector_t<EigVec>* = nullptr>
inline auto rvalue(EigVec&& v, const char* name, const index_multi& idx) {
  return stan::math::make_holder(
      [name, &idx](auto& v_ref) {
        return plain_type_t<EigVec>::NullaryExpr(
            idx.ns_.size(), [name, &idx, &v_ref](Eigen::Index i) {
              math::check_range("vector[multi] indexing", name, v_ref.size(),
                                idx.ns_[i]);
              return v_ref.coeff(idx.ns_[i] - 1);
            });
      },
      stan::math::to_ref(v));
}

Instead, it looks like the issue is creating the nested NullaryExpr for every single argument. Won't that be at least as expensive as the range check?

With the Eigen approach, how do you avoid the - 1 for indexing?

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Apr 25, 2024

The plots make it look like the Eigen approach is the best. Am I missing something?

The code for all of the benchmarks are here. The Eigen benchmark is just doing

Eigen::Matrix<double, -1, 1> mu = r_1_1(J_1_map).array() * Z_1_1.array() +
     r_1_2(J_1_map).array() * Z_1_2.array();

Which is a baseline for "If we remove all of the stan overhead like checks, indexing from 1, expression safety, etc. what is the best possible performance we could get?"

Also, what specifically are you evaluating?

I have all the benchmark code in the link above and the comment above (comment here) gives all the info for each benchmark. But tldr each line is for

  1. (loop) Code generated from a for loop in Stan
  2. (orig) Code generated from a multi index in Stan with the current multi index code
  3. (new) Code generated from a multi index in Stan with the new multi index code
  4. (newnocheck) Code generated from a multi index in Stan with the new multi index code, removing any checks
  5. (eigen) Using plain eigen to do the multi index

I see the label on one of the plots involves multi-indexing with an array and then elementwise multiplication? Usually we just do the indexing for random effects, so I'm not sure where the elementwise multiply is coming from. Normally it'd just be foo[idxs] + bar[idxs2] + ... for varying effects.

The elemenwise multiplication and multi index is what brms generates for models that do random effects like y ~ (effect | group)

Instead, it looks like the issue is creating the nested NullaryExpr for every single argument. Won't that be at least as expensive as the range check?

The Nullary expression is just done once and is then returned so it should not be expensive. After all these benchmarks I'm really convinced the slowdown is from the range check. In the graph above removing the range check gives about a magnitude of performance speedup. You can see in the graph in the comment above all of the stan versions that do the range check sit near one another, but removing that range check with STAN_NO_RANGE_CHECK we get much closer to the baseline of a plain Eigen call

With the Eigen approach, how do you avoid the - 1 for indexing?

It just uses a zero based index instead of Stan's 1 based index since it's a baseline of how fast this operation is minus any Stan changes.

@bob-carpenter
Copy link
Contributor

Thanks for all the clarification. That makes a lot more sense now.

The elemenwise multiplication and multi index is what brms generates for models that do random effects like y ~ (effect | group)

Aha! A varying slope model. I just couldn't think of where we'd want to do a multi-index and then elementwise multiply. In any case, it sounds like the Eigen baseline isn't something that'd be workable in Stan due to the indexing-from-1 problem.

Sorry, but no insight about how to fix it easily. It does seem like the current Stan approach is allocating a new class for each index. Isn't that causing a lot of overhead or is the overhead all compiled away somehow? I would've thought you could get away with just one level of indirection here.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
arma/arma.stan 0.2 0.18 1.08 7.32% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.01 0.01 1.09 8.42% faster
gp_regr/gen_gp_data.stan 0.02 0.02 1.06 5.29% faster
gp_regr/gp_regr.stan 0.11 0.11 0.97 -2.6% slower
sir/sir.stan 80.44 75.15 1.07 6.58% faster
irt_2pl/irt_2pl.stan 4.15 3.95 1.05 4.69% faster
eight_schools/eight_schools.stan 0.05 0.05 1.05 4.53% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.26 0.25 1.02 1.54% faster
pkpd/one_comp_mm_elim_abs.stan 18.57 18.35 1.01 1.23% faster
garch/garch.stan 0.48 0.47 1.03 3.2% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.86 2.85 1.0 0.34% faster
arK/arK.stan 1.67 1.66 1.01 0.6% faster
gp_pois_regr/gp_pois_regr.stan 2.59 2.6 1.0 -0.29% slower
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 9.39 9.44 0.99 -0.53% slower
performance.compilation 181.91 186.82 0.97 -2.7% slower
Mean result: 1.026985752701417

Jenkins Console Log
Blue Ocean
Commit hash: 13017193216fe150e9a8fadbc7b3f9cb71f8a58d


Machine information No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping: 4
CPU MHz: 2400.000
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 1.3 MiB
L1i cache: 1.3 MiB
L2 cache: 40 MiB
L3 cache: 55 MiB
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed: Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities

G++:
g++ (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Clang:
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

@SteveBronder
Copy link
Collaborator Author

Np!

In any case, it sounds like the Eigen baseline isn't something that'd be workable in Stan due to the indexing-from-1 problem.

Totally agree. I just have it there as a sort of "if we could do the most performance thing what would that look". It's not a feasible answer but moreso for comparison

It does seem like the current Stan approach is allocating a new class for each index. Isn't that causing a lot of overhead or is the overhead all compiled away somehow?

It does make a new class but that class is an expression so there won't be any actual allocation from value. There's a small amount of overhead but most of it is compiled away

I think we should merge this PR and if people want speed they can define the macro to remove the range checks. Idt there's any other way for us to make multi indexing fast

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants