Store block-offset list as an std::unordered_map #336

mtfishman · 2020-03-18T16:20:25Z

This PR implements a suggestion by Alex to use an std::unordered_map to store the block-offset list in the QDense storage instead of an std::vector.

In profiles, we have seen the function offsetOf take up a significant amount of runtime (for example, around 10% for the 2D Hubbard model with momentum conservation around the cylinder). With std::vector storage, the offset of the block is determined by searching the sorted std::vector of pairs of blocks and offsets, and this search is what is taking up a lot of time. Using std::unordered_map, finding the offsets is an O(1) operation (at the expense of a higher memory overhead).

Here is a comparison for a 4x4 Hubbard model calculation with particle number, spin and momentum around the cylinder conserved. On v3 (with 12 OMP threads), we get:

    vN Entropy at center bond b=8 = 3.038161828684
    Eigs at center bond b=8: 0.1626 0.1609 0.1593 0.1591 0.0265 0.0258 0.0133 0.0131 0.0093 0.0093
    Largest link dim during sweep 5/5 was 9998
    Largest truncation error: 6.43277e-09
    Energy after sweep 5/5 is -18.867716016279
    Sweep 5/5 CPU time = 8m, 22.6s (Wall time = 1m, 6.9s)

while on this branch we get:

    vN Entropy at center bond b=8 = 3.038161829366
    Eigs at center bond b=8: 0.1626 0.1609 0.1593 0.1591 0.0265 0.0258 0.0133 0.0131 0.0093 0.0093
    Largest link dim during sweep 5/5 was 9998
    Largest truncation error: 6.43277e-09
    Energy after sweep 5/5 is -18.867716015839
    Sweep 5/5 CPU time = 12m, 0s (Wall time = 1m, 1.9s)

The speedup is not huge, but not insignificant (about 7.5%).

The main concern with this design is a possible increase in memory overhead. I did some heap profiling with gperftools of this example and did not see any noticeable difference. To me, this makes sense, since the memory overhead of this design would scale linearly in the number of blocks, so it should not be noticeable in the large bond dimension limit (where presumably you see block sizes growing faster than the number of blocks, and the memory scaling with the block sizes is square in the bond dimension for an MPS). So I think for memory limited calculations, I wouldn't expect this to have a significant change on what bond dimension can be reached. It is definitely something to keep in mind, though.

… block

mtfishman · 2020-03-18T16:41:58Z

Also in this PR, I added an improvement to the OMP multithreading of QDense contraction that I have discussed with Nils. The way the multithreading is done now, block contractions with the same output block are done in serial on a certain thread, to avoid race conditions. For the case when there is only one output block (like contracting to a scalar), this is not ideal, and clearly can be improved by parallelizing over the block contractions and then adding them up at the end. This improvement doesn't make a significant difference for DMRG calculations I have seen (since cases like that are an insignificant cost to other contractions that are done), but at least for toy cases it improves things a lot:

#include "itensor/all.h"
#include <bits/stdc++.h>
#include <chrono>

using namespace itensor;

int
main()
    {
    auto n = 300;
    auto i = Index(QN(0),n,QN(1),n,QN(2),n,QN(3),n,
                   QN(4),n,QN(5),n,QN(6),n,QN(7),n,
                   QN(8),n,QN(9),n,QN(10),n,QN(11),n,"i");
    auto A = randomITensor(QN(),i,dag(prime(i)));
    auto t1 = std::chrono::high_resolution_clock::now();
    for(auto i : range(10000))
        A*dag(A);
    auto t2 = std::chrono::high_resolution_clock::now();
    auto time_taken = std::chrono::duration<double>(t2-t1).count();
    std::cout << "Time taken by program is: " << std::fixed << time_taken << " sec " << std::setprecision(9) << std::endl;
    }

On v3 we get:

➜  OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 ./run
Time taken by program is: 2.623507 sec 
➜  OMP_NUM_THREADS=2 MKL_NUM_THREADS=1 ./run
Time taken by program is: 2.571339 sec 
➜  OMP_NUM_THREADS=4 MKL_NUM_THREADS=1 ./run
Time taken by program is: 2.702704 sec

(so no improvement with threads) and on this branch we get:

➜  OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 ./run
Time taken by program is: 2.633502 sec 
➜  OMP_NUM_THREADS=2 MKL_NUM_THREADS=1 ./run
Time taken by program is: 1.387145 sec 
➜  OMP_NUM_THREADS=4 MKL_NUM_THREADS=1 ./run
Time taken by program is: 0.796539 sec

so a nearly perfect speedup.

I also tried out sorting the block contractions by the cost of contractions (not just for this single block output case, but for any type of contraction). However, I actually saw that it either did not help or made things a little bit slower. The code to sort by the contraction cost starts here and is currently commented out, we can try it out in the future.

emstoudenmire · 2020-03-18T20:04:09Z

Sounds really good. As long as the switch to unordered_map doesn't make the code significantly more complicated then I'm for it (otherwise need to think through it since the speedup is modest).

That's great about the optimization when many contractions end up in the same block. Is that handled in a general way now? Or is it just a special mode when there's exactly one output block?

I'll do a detailed review of the code this time before we merge - hopefully get back to you later tonight or tomorrow.

mtfishman · 2020-03-18T21:32:23Z

Sounds really good. As long as the switch to unordered_map doesn't make the code significantly more complicated then I'm for it (otherwise need to think through it since the speedup is modest).

The code is not any more complicated. If anything, things like adding ITensors is a lot simpler since the block offset lists need to merged, and that is easier with std::unordered_map since the blocks don't need to be sorted (so was just a simple .insert call). Most of the changes were just replacing std::vector functions with the closest std::unordered_map function.

That's great about the optimization when many contractions end up in the same block. Is that handled in a general way now? Or is it just a special mode when there's exactly one output block?

It is just a special case when there is exactly one output block. When Alex and I did some analysis of DMRG, it seemed like the most expensive contractions didn't involve multiple contractions to the same block, but it would definitely be good to think about optimizing cases in between some more (where there is more than one output block, and multiple contractions to the same blocks).

I'll do a detailed review of the code this time before we merge - hopefully get back to you later tonight or tomorrow.

Great, it is definitely a fairly significant design change, so I will appreciate someone else looking over it. No rush though, I asked Alex and Daniel to try it out in there codes to see if it helps them but they can just use this branch.

mtfishman added 9 commits March 12, 2020 17:49

Add special case in parallel contract for single output block

c2c073b

Add sorting by cost of block contractions

350b883

Change BlockOffsets storage to unordered_map

e49c358

getBlockLoc slow, turn off for now

5f23efa

Bring back undef intitializer in QDense contract

6ac8344

Add back multithreading to getContractOffsets

0b6a26a

Add back reading and writing QDense

149272d

Add special case for OMP threading for contraction with single output…

7b1dfcd

… block

Small fix to single output block OMP

2e8f6ff

mtfishman changed the title ~~Unordered map~~ Store block-offset list as an std::unordered_map Mar 18, 2020

mtfishman mentioned this pull request Apr 10, 2020

Optimizations for block sparse operations ITensor/ITensors.jl#266

Closed

4 tasks

mtfishman mentioned this pull request Dec 4, 2020

Use dictionary of keys sparse storage ITensor/NDTensors.jl#54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store block-offset list as an std::unordered_map #336

Store block-offset list as an std::unordered_map #336

mtfishman commented Mar 18, 2020

mtfishman commented Mar 18, 2020

emstoudenmire commented Mar 18, 2020

mtfishman commented Mar 18, 2020

Store block-offset list as an std::unordered_map #336

Are you sure you want to change the base?

Store block-offset list as an std::unordered_map #336

Conversation

mtfishman commented Mar 18, 2020

mtfishman commented Mar 18, 2020

emstoudenmire commented Mar 18, 2020

mtfishman commented Mar 18, 2020