Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommendations for Faster Simulation #1667

Open
jpvantassel opened this issue Jan 1, 2024 · 3 comments
Open

Recommendations for Faster Simulation #1667

jpvantassel opened this issue Jan 1, 2024 · 3 comments

Comments

@jpvantassel
Copy link

Dear SPECFEM3D Team,

I am new to SPECFEM3D but I have used SPECFEM2D quite extensively. I am working on simulating a 30 Hz Ricker point source at the surface of a 60 m by 60 m by 25 m domain. The simulation is taking about 30 hours to complete using 121 cores of a AMD EPYC 7702.

I have two questions.

  • First, does this runtime seem about right? For reference, the corresponding 2D simulation takes 40 seconds on the same node with 30 cores.
  • Second, if it does seem about right, what is the best way to accelerate the simulation, should I run this across multiple nodes or switch to GPU, or maybe something else?

Thank you in advance for your help and building such an excellent resource.

In case they are helpful, I have included the output_meshfem3D.txt and output_solver.txt from the run below.

output_meshfem3D.txt

 ******************************************
 *** Specfem3D MPI meshfem3D - f90 version ***
 ******************************************

 Running Git package version of the code: 4.1.0
 which is Git unknown
 dating unknown

 Reading parameters from ./DATA/Par_file

 Reading mesh parameters from file ./DATA/meshfem3D_files/Mesh_Par_file
   input parameters...
   doubling layers...
   visualization...
   CPML...
   domain materials...
     material            1  elastic
   domain regions...
     region            1  with material            1
       nex_xi  begin/end =            1         242
       nex_eta begin/end =            1         242
       nz      begin/end =            1         101

   reading Mesh_Par_file done successfully

   checking mesh setup...
   all okay


 Reading interface data from file ./DATA/meshfem3D_files/interfaces.dat
   maximum interface points x/y =            2           2
   interfaces done

 parameter setup:
   total number of elements =        48884
   total number of points   =       411075


 Creating global slice addressing

 Spatial distribution of slice numbers:
  110  111  112  113  114  115  116  117  118  119  120 
   99  100  101  102  103  104  105  106  107  108  109 
   88   89   90   91   92   93   94   95   96   97   98 
   77   78   79   80   81   82   83   84   85   86   87 
   66   67   68   69   70   71   72   73   74   75   76 
   55   56   57   58   59   60   61   62   63   64   65 
   44   45   46   47   48   49   50   51   52   53   54 
   33   34   35   36   37   38   39   40   41   42   43 
   22   23   24   25   26   27   28   29   30   31   32 
   11   12   13   14   15   16   17   18   19   20   21 
    0    1    2    3    4    5    6    7    8    9   10 
 This is process            0
 There are          121  MPI processes
 Processes are numbered from 0 to          120

 There are          242  elements along xi
 There are          242  elements along eta
 There are          101  elements along Z

 There are          101  spectral elements along Z in layer            1

 There are           11  slices along xi
 There are           11  slices along eta
 There is a total of          121  slices

 Shape functions defined by NGNOD =            8  control nodes
 Surface shape functions defined by NGNOD2D =            4  control nodes
 Beware! Curvature (i.e. HEX27 elements) is not handled by our internal mesher

 region selected:

 latitude min =    0.0000000000000000     
 latitude max =    60.000000000000000     

 longitude min =    0.0000000000000000     
 longitude max =    60.000000000000000     

 this is given directly as UTM

 UTM X min =    0.0000000000000000     
 UTM X max =    60.000000000000000     

 UTM Y min =    0.0000000000000000     
 UTM Y max =    60.000000000000000     

 UTM size of model along X is    5.9999999999999998E-002  km
 UTM size of model along Y is    5.9999999999999998E-002  km

 Bottom of the mesh is at a depth of    2.5000000000000001E-002  km


 suppressing UTM projection


 **************************
 Creating interfaces
 **************************

 Reading interface data from file ./DATA/meshfem3D_files/interfaces.dat

 number of interfaces:            1

 mesh:
   origin UTM minimum x/y        (m) =    0.00000000       0.00000000    
   origin UTM maximum x/y        (m) =    60.0000000       60.0000000    

 reading interface            1
   interface file   : topo.dat

   number of points x/y =            2           2
   origin x/y     (m) =    0.00000000       0.00000000    
   spacing x/y    (m) =    60.0000000       60.0000000    

   dimension x-direction (m) =    0.00000000     /   60.0000000    
   dimension y-direction (m) =    0.00000000     /   60.0000000    

   total number of file points =            4  should be            4
   this point total is okay

   original elevation min/max             =    0.00000000       0.00000000    
   interpolated mesh elevation min/max    =    0.00000000       0.00000000    

   interpolated mesh UTM minimum x/y (m) =    0.00000000       0.00000000    
   interpolated mesh UTM maximum x/y (m) =    60.0000000       60.0000000    


 **************************
 Creating mesh in the model
 **************************

 creating mesh:
   NGLLX_M/NGLLY_M/NGLLZ_M =            3           3           3
   NGNOD/NGNOD2D           =            8           4
   NSPEC_AB                =        48884
   NGLOB_AB                =       411075

 allocating mesh arrays

 number of subregions =            1
   defining subregion            1
     has material            1

 number of mesh regions =            1
   creating mesh region            1  (regular mesh)

 mesh dimensions:
   Xmin and Xmax of the model =    0.00000000       60.0000000    
   Ymin and Ymax of the model =    0.00000000       60.0000000    
   Zmin and Zmax of the model =   -25.0000000       0.00000000    

 exact area =    3600.00000     (m^2)
            =    3.59999994E-03 (km^2)

   Max element size =   0.247936249     (m)
   Min element size =   0.247524261     (m)
   Max/min ratio =    1.00166440    


 creating indirect addressing for unstructured mesh


 File "./DATA/meshfem3D_files/dummy" not found: assume no cavity

 no PML region


 saving mesh files

 **************************
 Checking mesh quality
 **************************

 start computing the minimum and maximum edge size
 done processing 

 ------------
 mesh quality parameter definitions:

 equiangle skewness: 0. perfect,  1. bad
 skewness max deviation angle: 0. perfect,  90. bad
 edge aspect ratio: 1. perfect,  above 1. gives stretching factor
 diagonal aspect ratio: 1. perfect,  above 1. gives stretching factor
 ------------

 minimum length of an edge in the whole mesh (m) =   0.24752475247524330     

 maximum length of an edge in the whole mesh (m) =   0.24793388429752383     

 ***
 *** max equiangle skewness =    3.6470360347448326E-014  in element          393  of slice           25
 ***

 max deviation angle from a right angle (90 degrees) is therefore =    3.2823324312703493E-012

 worst angle in the mesh is therefore    89.999999999996717     
 or    90.000000000003283       degrees

 max edge aspect ratio =    1.0016528925620134     

 max diagonal aspect ratio =    1.0000000000000431     

 ***
 *** Maximum suggested time step for simulation =    0.00003351
 ***
 *** Max CFL stability condition of the time scheme (must be below about 0.55 or so) =   0.47040000000000004     
 *** computed using the maximum P wave velocity =    600.00000000000000     
 ***
 that value is below the upper CFL limit of   0.55000000000000004     
 therefore the run should be stable

 creating histogram of mesh quality

 histogram of skewness (0. good - 1. bad):

   0.00000000      -    5.00000007E-02     5914964     100.000000      %
   5.00000007E-02  -   0.100000001               0     0.00000000      %
  0.100000001      -   0.150000006               0     0.00000000      %
  0.150000006      -   0.200000003               0     0.00000000      %
  0.200000003      -   0.250000000               0     0.00000000      %
  0.250000000      -   0.300000012               0     0.00000000      %
  0.300000012      -   0.349999994               0     0.00000000      %
  0.349999994      -   0.400000006               0     0.00000000      %
  0.400000006      -   0.449999988               0     0.00000000      %
  0.449999988      -   0.500000000               0     0.00000000      %
  0.500000000      -   0.550000012               0     0.00000000      %
  0.550000012      -   0.600000024               0     0.00000000      %
  0.600000024      -   0.649999976               0     0.00000000      %
  0.649999976      -   0.699999988               0     0.00000000      %
  0.699999988      -   0.750000000               0     0.00000000      %
  0.750000000      -   0.800000012               0     0.00000000      %
  0.800000012      -   0.850000024               0     0.00000000      %
  0.850000024      -   0.899999976               0     0.00000000      %
  0.899999976      -   0.949999988               0     0.00000000      %
  0.949999988      -    1.00000000               0     0.00000000      %
 plotting skewness to VTK-file: ./OUTPUT_FILES/DATABASES_MPI/proc000000_skewness.vtk
 mesh files:
   saving files: proc***_Database
   done mesh files
 Repartition of elements:
 -----------------------
 total number of elements in mesh slice 0:        48884
 total number of points in mesh slice 0:       411075
 total number of elements in entire mesh:      5914964
 approximate total number of points in entire mesh (with duplicates on MPI edges):              49740075
 approximate total number of DOFs in entire mesh (with duplicates on MPI edges):             149220225
 using single precision for the calculations
 smallest and largest possible floating-point numbers are:    1.17549435E-38   3.40282347E+38
 Elapsed time for mesh generation and buffer creation in seconds =    26.917678016000000     
 End of mesh generation
 done

output_solver.txt

**********************************************
 **** Specfem 3-D Solver - MPI version f90 ****
 **********************************************

 Running Git package version of the code: 4.1.0
 which is Git unknown
 dating unknown




 There are          121  MPI processes
 Processes are numbered from 0 to          120

 There is a total of          121  slices

  NDIM =            3

  NGLLX =            5
  NGLLY =            5
  NGLLZ =            5

 using single precision for the calculations

 smallest and largest possible floating-point numbers are:    1.17549435E-38   3.40282347E+38

 velocity model:   default 

 Reading mesh databases...
   reads binary mesh files: proc***_external_mesh.bin
   from directory         : ./OUTPUT_FILES/DATABASES_MPI

   simulation w/ acoustic    domain:  F
   simulation w/ elastic     domain:  T
   simulation w/ poroelastic domain:  F

   slice 0 has:
   number of elements acoustic   :           0
   number of elements elastic    :       48884
   number of elements poroelastic:           0
   done

   total acoustic elements    :           0
   total elastic elements     :     5914964
   total poroelastic elements :           0

 Mesh resolution:

 ********
 minimum and maximum number of elements
 and points in the CUBIT + SCOTCH mesh:

 NSPEC_global_min =        48884
 NSPEC_global_max =        48884
 NSPEC_global_max / NSPEC_global_min imbalance =    1.00000000      =    0.00000000      %
 NSPEC_global_sum =      5914964

 NGLOB_global_min =      3208005
 NGLOB_global_max =      3208005
 NGLOB_global_max / NGLOB_global_min imbalance =    1.00000000      =    0.00000000      %
 NGLOB_global_sum =    388168605

 If you have elements of a single type (all acoustic, all elastic, all poroelastic, and without CPML)
 in the whole mesh, then there should be no significant imbalance in the above numbers.
 Otherwise, it is normal to have imbalance in elements and points because the domain decomposer
 compensates for the different cost of different elements by partitioning them unevenly among processes.
 ********


 ********
 Model: P   velocity min,max =    600.000000       600.000000    
 Model: S   velocity min,max =    300.000000       300.000000    

 Model: Poisson's ratio min,max =   0.333333343      0.333333343    
 ********

 *********************************************
 *** Verification of simulation parameters ***
 *********************************************

 *** Xmin and Xmax of the model =    0.00000000       60.0000000    
 *** Ymin and Ymax of the model =    0.00000000       60.0000000    
 *** Zmin and Zmax of the model =   -25.0000000       0.00000000    

 *** Max GLL point distance =    8.11576843E-02
 *** Min GLL point distance =    4.27398682E-02
 *** Max/min ratio =    1.89887536    

 *** Max element size =   0.247936249    
 *** Min element size =   0.247524261    
 *** Max/min ratio =    1.00166440    

 *** Minimum period resolved =    1.03306770E-03
 *** Maximum suggested time step =    3.55000011E-05

 *** for DT :    2.0000000000000002E-005
 *** Max stability for wave velocities =   0.280768305    

 Elapsed time for checking mesh resolution in seconds =    5.8972541000000003E-002

 ******************************************
 There is a total of          121  slices
 ******************************************


 kd-tree:
   total data points:        48884
   theoretical   number of nodes:        97757
                tree memory size:    2.98330688     MB
   actual        number of nodes:        97767
                tree memory size:    2.98361206     MB
   maximum depth   :           17
   creation timing :    1.34897232E-02 (s)


 sources:            1

 ********************
  locating sources
 ********************

 reading source information from ./DATA/FORCESOLUTION file

 no UTM projection


 source #            1
   source located in slice           60
                  in element        48631
                  in elastic domain

   using force point source: 
     xi coordinate of source in that element:    1.0000000000000000     
     eta coordinate of source in that element:    1.0000000000000000     
     gamma coordinate of source in that element:    1.0000000000000000     

     component of direction vector in East direction:    0.00000000    
     component of direction vector in North direction:    0.00000000    
     component of direction vector in Vertical direction:   -1.00000000    

     nu1 =    1.00000000       0.00000000       0.00000000    
     nu2 =    0.00000000       1.00000000       0.00000000    
     nu3 =    0.00000000       0.00000000       1.00000000    

     at (x,y,z) coordinates =    30.000000000000000        30.000000000000000        0.0000000000000000     

   source time function:
     using Ricker source time function

     using a source of dominant frequency    30.000000000000000     
     t0_ricker =    4.0000000000000001E-002 tshift_src =    0.0000000000000000     

     lambda_S at dominant frequency =    57.735026041666664     
     lambda_S at highest significant frequency =    23.094010416666666     

     half duration in frequency:    30.000000000000000       seconds**(-1)

     time shift:    0.0000000000000000       seconds

   magnitude of the source:
     factor =    10000000000.000000     

   original (requested) position of the source:

             latitude:    30.000000000000000     
            longitude:    30.000000000000000     

                x:    30.000000000000000     
                y:    30.000000000000000     
            depth:    0.0000000000000000       km
   topo elevation:    0.0000000000000000     

   position of the source that will be used:

                x:    30.000000000000000     
                y:    30.000000000000000     
            depth:    0.0000000000000000       km
                z:    0.0000000000000000     

   error in location of the source:    0.00000000      m



 maximum error in location of the sources:    0.00000000      m


 Elapsed time for detection of sources in seconds =    3.9835930999999991E-002

 End of source detection - done


 receivers:

 there are           24  stations in file ./DATA/STATIONS
 saving           24  stations inside the model in file ./DATA/STATIONS_FILTERED
 excluding            0  stations located outside the model


 Total number of receivers =           24


 ********************
  locating receivers
 ********************

 reading receiver information from ./DATA/STATIONS_FILTERED file

 station details from SU_stations_info.bin

 Elapsed time for receiver detection in seconds =    9.9468883999999980E-002

 End of receiver detection - done

 found a total of           24  receivers in all the slices

 source arrays:
   number of sources is            1
   size of source array                 =    1.43051147E-03 MB
                                        =    1.39698386E-06 GB

 seismograms:
   seismograms written by all processes

   Total number of simulation steps (NSTEP)                       =        50000
   writing out seismograms at every NTSTEP_BETWEEN_OUTPUT_SEISMOS =        50000
   number of subsampling steps for seismograms                    =           50
   Total number of samples for seismograms                        =         1000

   maximum number of local receivers is            3  in slice           57
   size of maximum seismogram array       =    3.43322754E-02 MB
                                          =    3.35276127E-05 GB


 Total number of samples for seismograms =        50000


 Simulation setup:

   no acoustic simulation
 incorporating elastic simulation
   no poroelastic simulation

   no attenuation
   no anisotropy
   no oceans
   no gravity
   no movie simulation


 preparing mass matrices
 preparing constants
 preparing wavefields
 preparing fault simulation
   no dynamic faults
   no kinematic faults
   no fault simulation
 preparing gravity
   no gravity simulation
 preparing Stacey absorbing boundaries
 preparing optimized arrays
   number of regular shaped elements  :            0
   number of irregular shaped elements:        48884
   fused array done
   bandwidth test (STREAM TRIAD): 
      memory accesses =    110.138107     MB
      timing  min/max =    6.85643181E-02 s /    6.86828420E-02 s
      timing      avg =    6.86385110E-02 s
      bandwidth       =    1.56700301     GB/s


 Elapsed time for preparing timerun in seconds =    1.4492546520000000     

 ************
  time loop
 ************
               scheme:         Newmark

            time step:    1.99999995E-05  s
 number of time steps:        50000
 total simulated time:    1.00000000      seconds
 start time:  -3.99999991E-02  seconds

 All processes are synchronized before the time loop

 Starting time iteration loop...

 Time step #            5
 Time:   -3.99199985E-02  seconds
 Elapsed time in seconds =    8.3381129709999993     
 Elapsed time in hh:mm:ss =      0 h 00 m 08 s
 Mean elapsed time per time step in seconds =    1.66762257    
 Max norm displacement vector U in all slices (m) =    1.87417548E-02
 Time steps done =            5  out of        50000
 Time steps remaining =        49995
 Estimated remaining time in seconds =    83372.7891    
 Estimated remaining time in hh:mm:ss =     23 h 09 m 32 s
 Estimated total run time in seconds =    83381.1328    
 Estimated total run time in hh:mm:ss =     23 h 09 m 41 s
 We have done    9.99999978E-03 % of that
 The run will finish approximately on (in local time): Sun Dec 31, 2023 11:11
 ************************************************************
 **** BEWARE: the above time estimates are not very reliable
 **** because fewer than 100 iterations have been performed
 ************************************************************

 Time step #        10000
 Time:   0.159979999      seconds
 Elapsed time in seconds =    19877.248016485002     
 Elapsed time in hh:mm:ss =      5 h 31 m 17 s
 Mean elapsed time per time step in seconds =    1.98772478    
 Max norm displacement vector U in all slices (m) =   0.247961834    
 Time steps done =        10000  out of        50000
 Time steps remaining =        40000
 Estimated remaining time in seconds =    79508.9922    
 Estimated remaining time in hh:mm:ss =     22 h 05 m 08 s
 Estimated total run time in seconds =    99386.2422    
 Estimated total run time in hh:mm:ss =     27 h 36 m 26 s
 We have done    20.0000000     % of that
 The run will finish approximately on (in local time): Sun Dec 31, 2023 15:37

 Time step #        20000
 Time:   0.359979987      seconds
 Elapsed time in seconds =    39946.529897007000     
 Elapsed time in hh:mm:ss =     11 h 05 m 46 s
 Mean elapsed time per time step in seconds =    1.99732649    
 Max norm displacement vector U in all slices (m) =    2.71000173E-02
 Time steps done =        20000  out of        50000
 Time steps remaining =        30000
 Estimated remaining time in seconds =    59919.7930    
 Estimated remaining time in hh:mm:ss =     16 h 38 m 39 s
 Estimated total run time in seconds =    99866.3281    
 Estimated total run time in hh:mm:ss =     27 h 44 m 26 s
 We have done    40.0000000     % of that
 The run will finish approximately on (in local time): Sun Dec 31, 2023 15:46

 Time step #        30000
 Time:   0.559979975      seconds
 Elapsed time in seconds =    60058.661398420998     
 Elapsed time in hh:mm:ss =     16 h 40 m 58 s
 Mean elapsed time per time step in seconds =    2.00195527    
 Max norm displacement vector U in all slices (m) =    3.30846384E-03
 Time steps done =        30000  out of        50000
 Time steps remaining =        20000
 Estimated remaining time in seconds =    40039.1094    
 Estimated remaining time in hh:mm:ss =     11 h 07 m 19 s
   total poroelastic elements :           0
 Estimated total run time in seconds =    100097.766    
 Estimated total run time in hh:mm:ss =     27 h 48 m 17 s
 We have done    60.0000000     % of that
 The run will finish approximately on (in local time): Sun Dec 31, 2023 15:49
 Time step #        40000
 Time:   0.759980023      seconds
 Elapsed time in seconds =    80135.239633543999     
 Elapsed time in hh:mm:ss =     22 h 15 m 35 s
 Mean elapsed time per time step in seconds =    2.00338101    
 Max norm displacement vector U in all slices (m) =    4.06415318E-04
 Time steps done =        40000  out of        50000
 Time steps remaining =        10000
 Estimated remaining time in seconds =    20033.8105    
 Estimated remaining time in hh:mm:ss =      5 h 33 m 53 s
 Estimated total run time in seconds =    100169.047    
 Estimated total run time in hh:mm:ss =     27 h 49 m 29 s
 We have done    80.0000000     % of that
 The run will finish approximately on (in local time): Sun Dec 31, 2023 15:51
 Time step #        50000
 Time:   0.959980011      seconds
 Elapsed time in seconds =    100302.82204794600     
 Elapsed time in hh:mm:ss =     27 h 51 m 42 s
 Mean elapsed time per time step in seconds =    2.00605655    
 Max norm displacement vector U in all slices (m) =    4.77911555E-04
 Time steps done =        50000  out of        50000
 Time steps remaining =            0
 Estimated remaining time in seconds =    0.00000000    
 Estimated remaining time in hh:mm:ss =      0 h 00 m 00 s
 Estimated total run time in seconds =    100302.820    
 Estimated total run time in hh:mm:ss =     27 h 51 m 42 s
 We have done    100.000000     % of that
 Writing the seismograms
 Total number of time steps written:         1000
 Writing the seismograms in parallel took   0.158207759      seconds
 Time loop finished. Timing info:
 Total elapsed time in seconds =    100305.00411447301     
 Total elapsed time in hh:mm:ss =     27 h 51 m 45 s
 finalizing simulation
 End of the simulation
@danielpeter
Copy link
Contributor

that's indeed quite a slow performance. I have no access to such an AMD chip, but comparing this to a similar setup on my Apple M1 Pro chip, the mean elapsed time per time step should be about 10x-20x faster.

something I notice is that your bandwidth test shows a very slow bandwidth:

   bandwidth test (STREAM TRIAD): 
      memory accesses =    110.138107     MB
      timing  min/max =    6.85643181E-02 s /    6.86828420E-02 s
      timing      avg =    6.86385110E-02 s
      bandwidth       =    1.56700301     GB/s

compared to the bandwidth on the M1 chip, which is around

   bandwidth test (STREAM TRIAD): 
      memory accesses =    124.998627     MB
      timing  min/max =    2.20200000E-03 s /    2.41199997E-03 s
      timing      avg =    2.25510006E-03 s
      bandwidth       =    54.1301842     GB/s
      with force_vectorization:
      timing  min/max =    1.67999999E-03 s /    1.97000010E-03 s
      timing      avg =    1.72840001E-03 s
      bandwidth       =    70.6254196     GB/s

this bandwidth with the force_vectorization turned on (you can compile with the flag ./configure --enable-vectorization ..) is a factor around ~45x faster. given the SPECFEM code uses by default 4th order elements (NGLL==5), the arithmetic intensity of the main computations are still quite low, and therefore the code is mostly memory-bound. having a faster memory bandwidth thus helps in getting better performance.

the main simulation features that affect performance is the number of elastic elements per process, i.e., 48884 in your case:

 NSPEC_global_min =        48884
 NSPEC_global_max =        48884

and having attenuation turned on or not, as well as having PML turned on or not. compared with a similar simulation setup on my system, the Mean elapsed time per time step in seconds for your number of elements per process would become around ~0.11 s. that would be between 10x - 20x faster.

The system specifics depend on compiler, chip, node memory and MPI fabrics. In you case, the following could be checked:

  • compile first with a good compiler and optimized flags for your system, and check also if the --enable-vectorization would help.
  • make sure you're not oversubscribing a single node, where you would run too many MPI processes on a single node. for your EPYC chip, it seems to have 64-cores. try using only half of it per node to see if this speeds up, i.e., assign only 32 MPI processes on a single node, or less to see if it affects performance. It is unclear from your message if you run all 121 processes on such a single chip. if you do, then this might be the culprint as your processes compete against each other to use the system resources. if you don't have access to more compute nodes, then reduce the number of MPI processes down to 64 at most.
  • and similar with the amount of memory per node, although your simulation doesn't seem to need too much memory per process.
  • Finally, switch to another MPI library if possible to check if there is a slow-down due to the communication.

@jpvantassel
Copy link
Author

Hi @danielpeter,

Thank you for your help with this I have tried a few things (details below), but I have not been able to achieve the very fast bandwidth numbers you were able to achieve. I am hoping you may have some additional suggestions.

Attempt 1: Reduce cores to 64 and use --enable-vectorization

I recompiled with the --enable-vectorization. Specifically, I configured with ./configure FC=gfortran CC=gcc MPIFC=mpif90 --with-mpi --enable-vectorization. I ran the code on only 64 cores. The impact on the solver was:

**********************************************
 **** Specfem 3-D Solver - MPI version f90 ****
 **********************************************

 Running Git package version of the code: 4.1.0
 which is Git unknown
 dating unknown




 There are           64  MPI processes
 Processes are numbered from 0 to           63

 There is a total of           64  slices

  NDIM =            3

  NGLLX =            5
  NGLLY =            5
  NGLLZ =            5

 using single precision for the calculations
 using force vectorization

 smallest and largest possible floating-point numbers are:    1.17549435E-38   3.40282347E+38

 velocity model:   default 

 Reading mesh databases...
   reads binary mesh files: proc***_external_mesh.bin
   from directory         : ./OUTPUT_FILES/DATABASES_MPI

   simulation w/ acoustic    domain:  F
   simulation w/ elastic     domain:  T
   simulation w/ poroelastic domain:  F

   slice 0 has:
   number of elements acoustic   :           0
   number of elements elastic    :       90900
   number of elements poroelastic:           0
   done

   total acoustic elements    :           0
   total elastic elements     :     5817600
   total poroelastic elements :           0

 Mesh resolution:

 ********
 minimum and maximum number of elements
 and points in the CUBIT + SCOTCH mesh:

 NSPEC_global_min =        90900
 NSPEC_global_max =        90900
 NSPEC_global_max / NSPEC_global_min imbalance =    1.00000000      =    0.00000000      %
 NSPEC_global_sum =      5817600

 NGLOB_global_min =      5929605
 NGLOB_global_max =      5929605
 NGLOB_global_max / NGLOB_global_min imbalance =    1.00000000      =    0.00000000      %
 NGLOB_global_sum =    379494720

 If you have elements of a single type (all acoustic, all elastic, all poroelastic, and without CPML)
 in the whole mesh, then there should be no significant imbalance in the above numbers.
 Otherwise, it is normal to have imbalance in elements and points because the domain decomposer
 compensates for the different cost of different elements by partitioning them unevenly among processes.
 ********


 ********
 Model: P   velocity min,max =    600.000000       600.000000    
 Model: S   velocity min,max =    300.000000       300.000000    

 Model: Poisson's ratio min,max =   0.333333343      0.333333343    
 ********

 *********************************************
 *** Verification of simulation parameters ***
 *********************************************

 *** Xmin and Xmax of the model =    0.00000000       60.0000000    
 *** Ymin and Ymax of the model =    0.00000000       60.0000000    
 *** Zmin and Zmax of the model =   -25.0000000       0.00000000    

 *** Max GLL point distance =    8.18328857E-02
 *** Min GLL point distance =    4.27398682E-02
 *** Max/min ratio =    1.91467333    

 *** Max element size =   0.250000000    
 *** Min element size =   0.247524261    
 *** Max/min ratio =    1.01000202    

 *** Minimum period resolved =    1.04166672E-03
 *** Maximum suggested time step =    3.55000011E-05

 *** for DT :    2.0000000000000002E-005
 *** Max stability for wave velocities =   0.280768305    

 Elapsed time for checking mesh resolution in seconds =    8.5677223999999996E-002

 ******************************************
 There is a total of           64  slices
 ******************************************


 kd-tree:
   total data points:        90900
   theoretical   number of nodes:       181793
                tree memory size:    5.54788208     MB
   actual        number of nodes:       181799
                tree memory size:    5.54806519     MB
   maximum depth   :           17
   creation timing :    2.17423439E-02 (s)


 sources:           21

 ********************
  locating sources
 ********************

 reading source information from ./DATA/FORCESOLUTION file

 no UTM projection


 *************************************
  using sources           21
 *************************************


 maximum error in location of the sources:    0.00000000      m


 Elapsed time for detection of sources in seconds =    7.3242452000000013E-002

 End of source detection - done


 receivers:

 there are           24  stations in file ./DATA/STATIONS
 saving           24  stations inside the model in file ./DATA/STATIONS_FILTERED
 excluding            0  stations located outside the model


 Total number of receivers =           24


 ********************
  locating receivers
 ********************

 reading receiver information from ./DATA/STATIONS_FILTERED file

 station details from SU_stations_info.bin

 Elapsed time for receiver detection in seconds =    8.0566551000000097E-002

 End of receiver detection - done

 found a total of           24  receivers in all the slices

 source arrays:
   number of sources is           21
   size of source array                 =    3.00407410E-02 MB
                                        =    2.93366611E-05 GB

 seismograms:
   seismograms written by all processes

   Total number of simulation steps (NSTEP)                       =        50000
   writing out seismograms at every NTSTEP_BETWEEN_OUTPUT_SEISMOS =        50000
   number of subsampling steps for seismograms                    =           50
   Total number of samples for seismograms                        =         1000

   maximum number of local receivers is            4  in slice           25
   size of maximum seismogram array       =    4.57763672E-02 MB
                                          =    4.47034836E-05 GB


 Total number of samples for seismograms =        50000

 Using           21  point sources


 Simulation setup:

   no acoustic simulation
 incorporating elastic simulation
   no poroelastic simulation

   no attenuation
   no anisotropy
   no oceans
   no gravity
   no movie simulation


 preparing mass matrices
 preparing constants
 preparing wavefields
 preparing fault simulation
   no dynamic faults
   no kinematic faults
   no fault simulation
 preparing gravity
   no gravity simulation
 preparing Stacey absorbing boundaries
 preparing optimized arrays
   number of regular shaped elements  :            0
   number of irregular shaped elements:        90900
   fused array done
   bandwidth test (STREAM TRIAD): 
      memory accesses =    203.576828     MB
      timing  min/max =    6.46131188E-02 s /    6.47902414E-02 s
      timing      avg =    6.46588057E-02 s
      bandwidth       =    3.07468581     GB/s
      with force_vectorization:
      timing  min/max =    6.46693558E-02 s /    6.48961514E-02 s
      timing      avg =    6.48031160E-02 s
      bandwidth       =    3.06783867     GB/s
 Elapsed time for preparing timerun in seconds =    2.3109655500000001     
 ************
  time loop
 ************
               scheme:         Newmark
            time step:    1.99999995E-05  s
 number of time steps:        50000
 total simulated time:    1.00000000      seconds
 start time:  -3.99999991E-02  seconds
 All processes are synchronized before the time loop
 Starting time iteration loop...
 Time step #            5
 Time:   -3.99199985E-02  seconds
 Elapsed time in seconds =    3.4316378739999998     
 Elapsed time in hh:mm:ss =      0 h 00 m 03 s
 Mean elapsed time per time step in seconds =   0.686327577    
 Max norm displacement vector U in all slices (m) =    1.84836667E-02
 Time steps done =            5  out of        50000
 Time steps remaining =        49995
 Estimated remaining time in seconds =    34312.9453    
 Estimated remaining time in hh:mm:ss =      9 h 31 m 52 s
 Estimated total run time in seconds =    34316.3789    
 Estimated total run time in hh:mm:ss =      9 h 31 m 56 s
 We have done    9.99999978E-03 % of that
 The run will finish approximately on (in local time): Sat Mar 30, 2024 19:07
 ************************************************************
 **** BEWARE: the above time estimates are not very reliable
 **** because fewer than 100 iterations have been performed
 ************************************************************

Attempt 2: Using IntelMPI and Intel Xeon Platinum 9242

You recommended using a different MPI library so rather than using OpenMPI (as above) I switched to using IntelMPI. I also chose to switch nodes type from the AMD EPYC 7702 (above) to Intel Xeon Platinum 9242:

**********************************************
 **** Specfem 3-D Solver - MPI version f90 ****
 **********************************************
 
 Running Git package version of the code: 4.1.0
 which is Git unknown
 dating unknown
 
 
 
 
 There are           36  MPI processes
 Processes are numbered from 0 to           35
 
 There is a total of           36  slices
 
  NDIM =            3
 
  NGLLX =            5
  NGLLY =            5
  NGLLZ =            5
 
 using single precision for the calculations
 using force vectorization
 
 smallest and largest possible floating-point numbers are:   1.1754944E-38
  3.4028235E+38
 
 velocity model:   default 
 
 Reading mesh databases...
   reads binary mesh files: proc***_external_mesh.bin
   from directory         : ./OUTPUT_FILES/DATABASES_MPI
 
   simulation w/ acoustic    domain:  F
   simulation w/ elastic     domain:  T
   simulation w/ poroelastic domain:  F
 
   slice 0 has:
   number of elements acoustic   :           0
   number of elements elastic    :      161600
   number of elements poroelastic:           0
   done
 
   total acoustic elements    :           0
   total elastic elements     :     5817600
   total poroelastic elements :           0
 
 Mesh resolution:
 
 ********
 minimum and maximum number of elements
 and points in the CUBIT + SCOTCH mesh:
 
 NSPEC_global_min =       161600
 NSPEC_global_max =       161600
 NSPEC_global_max / NSPEC_global_min imbalance =    1.000000      = 
  0.0000000E+00  %
 NSPEC_global_sum =      5817600
 
 NGLOB_global_min =     10498005
 NGLOB_global_max =     10498005
 NGLOB_global_max / NGLOB_global_min imbalance =    1.000000      = 
  0.0000000E+00  %
 NGLOB_global_sum =    377928180
 
 If you have elements of a single type (all acoustic, all elastic, all poroelast
 ic, and without CPML)
 in the whole mesh, then there should be no significant imbalance in the above n
 umbers.
 Otherwise, it is normal to have imbalance in elements and points because the do
 main decomposer
 compensates for the different cost of different elements by partitioning them u
 nevenly among processes.
 ********
 
 
 ********
 Model: P   velocity min,max =    600.0000       600.0000    
 Model: S   velocity min,max =    300.0000       300.0000    
 
 Model: Poisson's ratio min,max =   0.3333333      0.3333333    
 ********
 
 *********************************************
 *** Verification of simulation parameters ***
 *********************************************
 
 *** Xmin and Xmax of the model =   0.0000000E+00   60.00000    
 *** Ymin and Ymax of the model =   0.0000000E+00   60.00000    
 *** Zmin and Zmax of the model =   -25.00000      0.0000000E+00
 
 *** Max GLL point distance =   8.1832886E-02
 *** Min GLL point distance =   4.2739868E-02
 *** Max/min ratio =    1.914673    
 
 *** Max element size =   0.2500000    
 *** Min element size =   0.2475243    
 *** Max/min ratio =    1.010002    
 
 *** Minimum period resolved =   1.0416667E-03
 *** Maximum suggested time step =   3.5500001E-05
 
 *** for DT :   2.000000000000000E-005
 *** Max stability for wave velocities =   0.2807683    
 
 Elapsed time for checking mesh resolution in seconds =   0.123510264791548     
 
 ******************************************
 There is a total of           36  slices
 ******************************************
 
 
 kd-tree:
   total data points:       161600
   theoretical   number of nodes:       323192
                tree memory size:    9.863037     MB
   actual        number of nodes:       323199
                tree memory size:    9.863251     MB
   maximum depth   :           19
   creation timing :   7.0978165E-02 (s)
 
 
 sources:           21
 
 ********************
  locating sources
 ********************
 
 reading source information from ./DATA/FORCESOLUTION file
 
 no UTM projection
 
 
 *************************************
  using sources           21
 *************************************
 
 
 maximum error in location of the sources:   0.0000000E+00  m
 
 
 Elapsed time for detection of sources in seconds =   8.017714228481054E-002
 
 End of source detection - done
 
 
 receivers:
 
 there are           24  stations in file ./DATA/STATIONS
 saving           24  stations inside the model in file ./DATA/STATIONS_FILTERED
 excluding            0  stations located outside the model
 
 
 Total number of receivers =           24
 
 
 ********************
  locating receivers
 ********************
 
 reading receiver information from ./DATA/STATIONS_FILTERED file
 
 
 maximum error in location of all the receivers:   0.0000000E+00  m
 
 Elapsed time for receiver detection in seconds =   0.134181050583720     
 
 End of receiver detection - done
 
 found a total of           24  receivers in all the slices
 
 source arrays:
   number of sources is           21
   size of source array                 =   3.0040741E-02 MB
                                        =   2.9336661E-05 GB
 
 seismograms:
   seismograms written by all processes
 
   Total number of simulation steps (NSTEP)                       =        50000
   writing out seismograms at every NTSTEP_BETWEEN_OUTPUT_SEISMOS =        50000
   number of subsampling steps for seismograms                    =           50
   Total number of samples for seismograms                        =         1000
 
   maximum number of local receivers is            5  in slice           13
   size of maximum seismogram array       =   5.7220459E-02 MB
                                          =   5.5879354E-05 GB
 
 
 Total number of samples for seismograms =        50000
 
 Using           21  point sources
 
 
 Simulation setup:
 
   no acoustic simulation
 incorporating elastic simulation
   no poroelastic simulation
 
   no attenuation
   no anisotropy
   no oceans
   no gravity
   no movie simulation
 
 
 preparing mass matrices
 preparing constants
 preparing wavefields
 preparing fault simulation
   no dynamic faults
   no kinematic faults
   no fault simulation
 preparing gravity
   no gravity simulation
 preparing Stacey absorbing boundaries
 preparing optimized arrays
   number of regular shaped elements  :            0
   number of irregular shaped elements:       161600
   fused array done
   bandwidth test (STREAM TRIAD): 
      memory accesses =    360.4204     MB
      timing  min/max =   4.6040960E-02 s /   4.6381809E-02 s
      timing      avg =   4.6192527E-02 s
      bandwidth       =    7.619697     GB/s
      with force_vectorization:
      timing  min/max =   4.5109615E-02 s /   4.5594495E-02 s
      timing      avg =   4.5234695E-02 s
      bandwidth       =    7.781042     GB/s
 
 
 Elapsed time for preparing timerun in seconds =    1.87096668966115     
 
 ************
  time loop
 ************
               scheme:         Newmark
 
            time step:   1.9999999E-05  s
 number of time steps:        50000
 total simulated time:    1.000000      seconds
 start time: -3.9999999E-02  seconds
 
 All processes are synchronized before the time loop
 
 Starting time iteration loop...
 
 Time step #            5
 Time:  -3.9919998E-02  seconds
 Elapsed time in seconds =    4.27697095926851     
 Elapsed time in hh:mm:ss =      0 h 00 m 04 s
 Mean elapsed time per time step in seconds =   0.8553942    
 Max norm displacement vector U in all slices (m) =   1.8483667E-02
 Time steps done =            5  out of        50000
 Time steps remaining =        49995
 Estimated remaining time in seconds =    42765.43    
 Estimated remaining time in hh:mm:ss =     11 h 52 m 45 s
 Estimated total run time in seconds =    42769.71    
 Estimated total run time in hh:mm:ss =     11 h 52 m 49 s
 We have done   9.9999998E-03 % of that
 The run will finish approximately on (in local time): Sat Mar 30, 2024 22:11
 ************************************************************
 **** BEWARE: the above time estimates are not very reliable
 **** because fewer than 100 iterations have been performed
 ************************************************************

Thanks again for your help!

@danielpeter
Copy link
Contributor

well, there is probably not much else to do other than trying to run this simulation on a bigger system and across multiple nodes, where you can parallelize the simulation with a higher number of available processor cores, or get access to a GPU system.

just to follow up your outputs with some more suggestions:

  • to compare total run time estimates with more reliable estimations, you should probably run it for a couple of hundred time steps. for example, change the NTSTEP_BETWEEN_OUTPUT_INFO to say 100 and compare those estimated total run times.
  • for the Intel Xeon Platinum 9242 chip, it looks like that chip has a total of 48 cores. thus, aim for the same number of MPI processes for your simulation. the chip would have hyper-threading capabilities, so you might be able to compare it with and without hyper-threading turned on. with hyper-threading, you can double the number of threads, i.e., use 96 MPI processes. or even more fancy, compile it with together with OpenMP turned on (--enable-openmp) and use 2 OpenMP threads per MPI process... - not sure though how the chip will bind your threads onto cores, so there would be ways to play with the thread binding when calling mpirun, but that's going to be too difficult to figure out in this discussion without having more system infos. maybe a sys admin can help you out with that?
  • last, instead of the general gfortran compiler, use a vendor compiler. that is, if you're running it on an Intel chip, the Intel compiler ifort is usually giving you better performance.

and, don't try to match the high bandwidth numbers of the Apple M1 Pro chip with these AMD and Intel chips. the M1 has an exceptionally high bandwidth as compared to other CPUs.

better to find a larger system to run your simulation - or try out google colab to see if it fit's onto a free TPU/GPU. this example in the package might help you with setting this up:
https://github.com/SPECFEM/specfem3d/blob/master/EXAMPLES/notebooks/specfem3d-colab-example.ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants