Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store block-offset list as an std::unordered_map #336

Open
wants to merge 9 commits into
base: v3
Choose a base branch
from

Conversation

mtfishman
Copy link
Member

This PR implements a suggestion by Alex to use an std::unordered_map to store the block-offset list in the QDense storage instead of an std::vector.

In profiles, we have seen the function offsetOf take up a significant amount of runtime (for example, around 10% for the 2D Hubbard model with momentum conservation around the cylinder). With std::vector storage, the offset of the block is determined by searching the sorted std::vector of pairs of blocks and offsets, and this search is what is taking up a lot of time. Using std::unordered_map, finding the offsets is an O(1) operation (at the expense of a higher memory overhead).

Here is a comparison for a 4x4 Hubbard model calculation with particle number, spin and momentum around the cylinder conserved. On v3 (with 12 OMP threads), we get:

    vN Entropy at center bond b=8 = 3.038161828684
    Eigs at center bond b=8: 0.1626 0.1609 0.1593 0.1591 0.0265 0.0258 0.0133 0.0131 0.0093 0.0093
    Largest link dim during sweep 5/5 was 9998
    Largest truncation error: 6.43277e-09
    Energy after sweep 5/5 is -18.867716016279
    Sweep 5/5 CPU time = 8m, 22.6s (Wall time = 1m, 6.9s)

while on this branch we get:

    vN Entropy at center bond b=8 = 3.038161829366
    Eigs at center bond b=8: 0.1626 0.1609 0.1593 0.1591 0.0265 0.0258 0.0133 0.0131 0.0093 0.0093
    Largest link dim during sweep 5/5 was 9998
    Largest truncation error: 6.43277e-09
    Energy after sweep 5/5 is -18.867716015839
    Sweep 5/5 CPU time = 12m, 0s (Wall time = 1m, 1.9s)

The speedup is not huge, but not insignificant (about 7.5%).

The main concern with this design is a possible increase in memory overhead. I did some heap profiling with gperftools of this example and did not see any noticeable difference. To me, this makes sense, since the memory overhead of this design would scale linearly in the number of blocks, so it should not be noticeable in the large bond dimension limit (where presumably you see block sizes growing faster than the number of blocks, and the memory scaling with the block sizes is square in the bond dimension for an MPS). So I think for memory limited calculations, I wouldn't expect this to have a significant change on what bond dimension can be reached. It is definitely something to keep in mind, though.

@mtfishman
Copy link
Member Author

Also in this PR, I added an improvement to the OMP multithreading of QDense contraction that I have discussed with Nils. The way the multithreading is done now, block contractions with the same output block are done in serial on a certain thread, to avoid race conditions. For the case when there is only one output block (like contracting to a scalar), this is not ideal, and clearly can be improved by parallelizing over the block contractions and then adding them up at the end. This improvement doesn't make a significant difference for DMRG calculations I have seen (since cases like that are an insignificant cost to other contractions that are done), but at least for toy cases it improves things a lot:

#include "itensor/all.h"
#include <bits/stdc++.h>
#include <chrono>

using namespace itensor;

int
main()
    {
    auto n = 300;
    auto i = Index(QN(0),n,QN(1),n,QN(2),n,QN(3),n,
                   QN(4),n,QN(5),n,QN(6),n,QN(7),n,
                   QN(8),n,QN(9),n,QN(10),n,QN(11),n,"i");
    auto A = randomITensor(QN(),i,dag(prime(i)));
    auto t1 = std::chrono::high_resolution_clock::now();
    for(auto i : range(10000))
        A*dag(A);
    auto t2 = std::chrono::high_resolution_clock::now();
    auto time_taken = std::chrono::duration<double>(t2-t1).count();
    std::cout << "Time taken by program is: " << std::fixed << time_taken << " sec " << std::setprecision(9) << std::endl;
    }

On v3 we get:

➜  OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 ./run
Time taken by program is: 2.623507 sec 
➜  OMP_NUM_THREADS=2 MKL_NUM_THREADS=1 ./run
Time taken by program is: 2.571339 sec 
➜  OMP_NUM_THREADS=4 MKL_NUM_THREADS=1 ./run
Time taken by program is: 2.702704 sec 

(so no improvement with threads) and on this branch we get:

➜  OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 ./run
Time taken by program is: 2.633502 sec 
➜  OMP_NUM_THREADS=2 MKL_NUM_THREADS=1 ./run
Time taken by program is: 1.387145 sec 
➜  OMP_NUM_THREADS=4 MKL_NUM_THREADS=1 ./run
Time taken by program is: 0.796539 sec 

so a nearly perfect speedup.

I also tried out sorting the block contractions by the cost of contractions (not just for this single block output case, but for any type of contraction). However, I actually saw that it either did not help or made things a little bit slower. The code to sort by the contraction cost starts here and is currently commented out, we can try it out in the future.

@mtfishman mtfishman changed the title Unordered map Store block-offset list as an std::unordered_map Mar 18, 2020
@emstoudenmire
Copy link
Contributor

Sounds really good. As long as the switch to unordered_map doesn't make the code significantly more complicated then I'm for it (otherwise need to think through it since the speedup is modest).

That's great about the optimization when many contractions end up in the same block. Is that handled in a general way now? Or is it just a special mode when there's exactly one output block?

I'll do a detailed review of the code this time before we merge - hopefully get back to you later tonight or tomorrow.

@mtfishman
Copy link
Member Author

Sounds really good. As long as the switch to unordered_map doesn't make the code significantly more complicated then I'm for it (otherwise need to think through it since the speedup is modest).

The code is not any more complicated. If anything, things like adding ITensors is a lot simpler since the block offset lists need to merged, and that is easier with std::unordered_map since the blocks don't need to be sorted (so was just a simple .insert call). Most of the changes were just replacing std::vector functions with the closest std::unordered_map function.

That's great about the optimization when many contractions end up in the same block. Is that handled in a general way now? Or is it just a special mode when there's exactly one output block?

It is just a special case when there is exactly one output block. When Alex and I did some analysis of DMRG, it seemed like the most expensive contractions didn't involve multiple contractions to the same block, but it would definitely be good to think about optimizing cases in between some more (where there is more than one output block, and multiple contractions to the same blocks).

I'll do a detailed review of the code this time before we merge - hopefully get back to you later tonight or tomorrow.

Great, it is definitely a fairly significant design change, so I will appreciate someone else looking over it. No rush though, I asked Alex and Daniel to try it out in there codes to see if it helps them but they can just use this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants