New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store block-offset list as an std::unordered_map #336
base: v3
Are you sure you want to change the base?
Conversation
Also in this PR, I added an improvement to the OMP multithreading of QDense contraction that I have discussed with Nils. The way the multithreading is done now, block contractions with the same output block are done in serial on a certain thread, to avoid race conditions. For the case when there is only one output block (like contracting to a scalar), this is not ideal, and clearly can be improved by parallelizing over the block contractions and then adding them up at the end. This improvement doesn't make a significant difference for DMRG calculations I have seen (since cases like that are an insignificant cost to other contractions that are done), but at least for toy cases it improves things a lot: #include "itensor/all.h"
#include <bits/stdc++.h>
#include <chrono>
using namespace itensor;
int
main()
{
auto n = 300;
auto i = Index(QN(0),n,QN(1),n,QN(2),n,QN(3),n,
QN(4),n,QN(5),n,QN(6),n,QN(7),n,
QN(8),n,QN(9),n,QN(10),n,QN(11),n,"i");
auto A = randomITensor(QN(),i,dag(prime(i)));
auto t1 = std::chrono::high_resolution_clock::now();
for(auto i : range(10000))
A*dag(A);
auto t2 = std::chrono::high_resolution_clock::now();
auto time_taken = std::chrono::duration<double>(t2-t1).count();
std::cout << "Time taken by program is: " << std::fixed << time_taken << " sec " << std::setprecision(9) << std::endl;
} On v3 we get:
(so no improvement with threads) and on this branch we get:
so a nearly perfect speedup. I also tried out sorting the block contractions by the cost of contractions (not just for this single block output case, but for any type of contraction). However, I actually saw that it either did not help or made things a little bit slower. The code to sort by the contraction cost starts here and is currently commented out, we can try it out in the future. |
Sounds really good. As long as the switch to unordered_map doesn't make the code significantly more complicated then I'm for it (otherwise need to think through it since the speedup is modest). That's great about the optimization when many contractions end up in the same block. Is that handled in a general way now? Or is it just a special mode when there's exactly one output block? I'll do a detailed review of the code this time before we merge - hopefully get back to you later tonight or tomorrow. |
The code is not any more complicated. If anything, things like adding ITensors is a lot simpler since the block offset lists need to merged, and that is easier with std::unordered_map since the blocks don't need to be sorted (so was just a simple
It is just a special case when there is exactly one output block. When Alex and I did some analysis of DMRG, it seemed like the most expensive contractions didn't involve multiple contractions to the same block, but it would definitely be good to think about optimizing cases in between some more (where there is more than one output block, and multiple contractions to the same blocks).
Great, it is definitely a fairly significant design change, so I will appreciate someone else looking over it. No rush though, I asked Alex and Daniel to try it out in there codes to see if it helps them but they can just use this branch. |
This PR implements a suggestion by Alex to use an std::unordered_map to store the block-offset list in the QDense storage instead of an std::vector.
In profiles, we have seen the function
offsetOf
take up a significant amount of runtime (for example, around 10% for the 2D Hubbard model with momentum conservation around the cylinder). With std::vector storage, the offset of the block is determined by searching the sorted std::vector of pairs of blocks and offsets, and this search is what is taking up a lot of time. Using std::unordered_map, finding the offsets is an O(1) operation (at the expense of a higher memory overhead).Here is a comparison for a 4x4 Hubbard model calculation with particle number, spin and momentum around the cylinder conserved. On v3 (with 12 OMP threads), we get:
while on this branch we get:
The speedup is not huge, but not insignificant (about 7.5%).
The main concern with this design is a possible increase in memory overhead. I did some heap profiling with gperftools of this example and did not see any noticeable difference. To me, this makes sense, since the memory overhead of this design would scale linearly in the number of blocks, so it should not be noticeable in the large bond dimension limit (where presumably you see block sizes growing faster than the number of blocks, and the memory scaling with the block sizes is square in the bond dimension for an MPS). So I think for memory limited calculations, I wouldn't expect this to have a significant change on what bond dimension can be reached. It is definitely something to keep in mind, though.