Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ITensors] [BUG] Bad performance of DMRG in AMD CPU #1298

Open
ZipWin opened this issue Jan 3, 2024 · 4 comments
Open

[ITensors] [BUG] Bad performance of DMRG in AMD CPU #1298

ZipWin opened this issue Jan 3, 2024 · 4 comments
Assignees
Labels
bug Something isn't working ITensorMPS Issues related to the ITensorMPS submodule mps Issues related to MPS/MPO functionality

Comments

@ZipWin
Copy link

ZipWin commented Jan 3, 2024

I was running the same code for DMRG in an M3 MacBook Pro and an AMD EPYC 7763 64-Core Processor server. The speed in EPYC is much slower than M3. I also tested the same code in my AMD R7 4800h laptop, the speed is faster than EPYC but slower than M3. I'm not sure whether this is a problem of AMD CPU or not. Is there any method to improve the performance?

This is the output in M3
image

And this one is in EPYC
image

Minimal code
My code is nothing but a simple DMRG

os = a 2D quantum spin model
N = 3 * 4 * 8
sites = siteinds("S=1/2", N)
H = MPO(os, sites)
psi0 = randomMPS(sites, 10)
energy, psi = dmrg(H, psi0; nsweeps=30, maxdim=60, cutoff=1E-5)

Version information

  • Output from versioninfo():
  • M3
julia> versioninfo()
Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M3
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 16 on 4 virtual cores
Environment:
  JULIA_NUM_THREADS = 16
  • EPYC
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 256 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 128 on 256 virtual cores
Environment:
  JULIA_NUM_THREADS = 128
  • Output from using Pkg; Pkg.status("ITensors"):
  • M3
julia> using Pkg; Pkg.status("ITensors")
Status `~/.julia/environments/v1.9/Project.toml`
  [9136182c] ITensors v0.3.52
  • EPYC
julia> using Pkg; Pkg.status("ITensors")
Status `~/.julia/environments/v1.9/Project.toml`
  [9136182c] ITensors v0.3.52
@ZipWin ZipWin added bug Something isn't working ITensors Issues or pull requests related to the `ITensors` package. labels Jan 3, 2024
@mtfishman
Copy link
Member

Thanks for the report. Very likely this is due to differences in the performance of BLAS and LAPACK on those two systems, I would recommend comparing BLAS/LAPACK functionality like matrix multiplication, SVD, etc. independent of ITensor and see if you see similar discrepancies.

@kmp5VT kmp5VT self-assigned this Jan 16, 2024
@kmp5VT
Copy link
Collaborator

kmp5VT commented Jan 16, 2024

@mtfishman I see that I have access to some rome AMD EPYC™ 7002 CPU's which have 128 cores. So I can also run some tests performance testing

@kmp5VT
Copy link
Collaborator

kmp5VT commented Jan 16, 2024

Okay I have run a quick test with two different processors. The test looks like this

using ITensors, LinearAlgebra
N = 3 * 4 * 8
Nx = 3 * 4
Ny = 8
sites = siteinds("S=1/2", N)
lattice = square_lattice(Nx, Ny; yperiodic=false)

  os = OpSum()
  for b in lattice
    os .+= 0.5, "S+", b.s1, "S-", b.s2
    os .+= 0.5, "S-", b.s1, "S+", b.s2
    os .+= "Sz", b.s1, "Sz", b.s2
  end
  H = MPO(os, sites)

  state = [isodd(n) ? "Up" : "Dn" for n in 1:N]
psi0 = randomMPS(sites, state, 10)
energy, psi = dmrg(H, psi0; nsweeps=30, maxdim=60, cutoff=1E-5

I am using the AMD rome chip and a cascade lake Intel chip. The AMD info for versioninfo is

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 7742 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
  Threads: 1 on 128 virtual cores
Environment:
  LD_LIBRARY_PATH = /cm/shared/apps/slurm/current/lib64:/mnt/sw/nix/store/pmwk60bp5k4qr8vsg411p7vzhr502d83-openblas-0.3.23/lib

And the cascadelake is

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 1 on 32 virtual cores
Environment:
LD_LIBRARY_PATH = /mnt/sw/nix/store/hayjz1l94cb2ky37bhcv71aygjzq7fci-openblas-0.3.21/lib:/cm/shared/apps/slurm/current/lib64

The AMD has a clock speed of 2.25Ghz and 64cores (128 threads) and the Intel has a clocks peed of 3.6Ghz and 32 cores. That puts my estimate at 2.3 TFLOPS for the AMD and ~1.8TFLOPS for the intel. Maybe I should but I am not considering overclock speed for these estimates. I made sure both were using the openblas linear algebra and I do not have MKL loaded on the intel chip.
** When I look up the AMD core it says there are 64 cores and 128 threads so I am not sure if I should use 64 or 128 so my FLOP count could be off by a factor of two. This reference implies that I am off by a factor of 2. This fact only makes my AMD results look worse.
Here is the output for the first few iterations on AMD

After sweep 1 energy=-57.54141295684079  maxlinkdim=38 maxerr=9.95E-06 time=5.758
After sweep 2 energy=-58.904767252032805  maxlinkdim=60 maxerr=3.38E-04 time=7.347
After sweep 3 energy=-59.11224893888357  maxlinkdim=60 maxerr=5.26E-04 time=8.478
After sweep 4 energy=-59.150691419028654  maxlinkdim=60 maxerr=6.03E-04 time=8.192
After sweep 5 energy=-59.164719100449744  maxlinkdim=60 maxerr=6.33E-04 time=6.386
After sweep 6 energy=-59.17082567081569  maxlinkdim=60 maxerr=6.44E-04 time=6.386
After sweep 7 energy=-59.174711234075126  maxlinkdim=60 maxerr=6.46E-04 time=7.625

and here is the output for Intel

After sweep 1 energy=-57.59077526962814  maxlinkdim=39 maxerr=9.98E-06 time=0.694
After sweep 2 energy=-58.90697215419522  maxlinkdim=60 maxerr=3.26E-04 time=2.616
After sweep 3 energy=-59.118974149770764  maxlinkdim=60 maxerr=5.14E-04 time=3.790
After sweep 4 energy=-59.14627690267304  maxlinkdim=60 maxerr=5.79E-04 time=3.603
After sweep 5 energy=-59.15889103852813  maxlinkdim=60 maxerr=6.06E-04 time=3.108
After sweep 6 energy=-59.1689043895468  maxlinkdim=60 maxerr=6.35E-04 time=3.512
After sweep 7 energy=-59.1743327295998  maxlinkdim=60 maxerr=6.56E-04 time=3.536

So it does look like AMD is running significantly slower. This could potentially be related to slurm I have talked to Miles about something weird I have found with slurm. I do not use slurm to run on Intel, just AMD. To be sure I am trying to run on an Intel icelake node that I have access to. I will update when I have those results

@kmp5VT kmp5VT closed this as completed Jan 16, 2024
@kmp5VT kmp5VT reopened this Jan 16, 2024
@kmp5VT
Copy link
Collaborator

kmp5VT commented Jan 17, 2024

Update on the icelake node. Here are the results for the first few iterations

After sweep 1 energy=-57.63326237341011  maxlinkdim=38 maxerr=9.95E-06 time=9.316
After sweep 2 energy=-58.920829223723956  maxlinkdim=60 maxerr=3.22E-04 time=2.890
After sweep 3 energy=-59.12751248183843  maxlinkdim=60 maxerr=5.89E-04 time=3.399
After sweep 4 energy=-59.15597966767008  maxlinkdim=60 maxerr=6.52E-04 time=2.623
After sweep 5 energy=-59.1670197697073  maxlinkdim=60 maxerr=6.64E-04 time=2.420
After sweep 6 energy=-59.172930867376905  maxlinkdim=60 maxerr=6.74E-04 time=2.552
After sweep 7 energy=-59.176015668923334  maxlinkdim=60 maxerr=6.79E-04 time=2.629

And here is the versioninfo

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Platinum 8362 CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, icelake-server)
  Threads: 1 on 64 virtual cores
Environment:
  LD_LIBRARY_PATH = /cm/shared/apps/slurm/current/lib64:/mnt/sw/nix/store/hayjz1l94cb2ky37bhcv71aygjzq7fci-openblas-0.3.21/lib

And the node has a peak performance of 2.86TFLOPS and shows that the performance issue does not seem to be related to Slurm

@mtfishman mtfishman added mps Issues related to MPS/MPO functionality ITensorMPS Issues related to the ITensorMPS submodule and removed ITensors Issues or pull requests related to the `ITensors` package. labels May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ITensorMPS Issues related to the ITensorMPS submodule mps Issues related to MPS/MPO functionality
Projects
None yet
Development

No branches or pull requests

3 participants