Flux scale testing Oct 2022 #4686

garlick · 2022-10-13T19:16:26Z

garlick
Oct 13, 2022
Maintainer

We have Flux scaling test opportunity scheduled for Wed Oct 19 on quartz (~2970 nodes). Let's discuss the plan and the results here, and we can spin off issues for any weak spots we discover.

Quartz is running TOSS3 (EL7 based). We don't support flux-security on this OS so we can't test a multi-user system instance. However, we should be able to launch a large Flux instance as a slurm job and put it through its paces.

Ideas:

measure time to launch large MPI and non-MPI jobs (@trws)
get times for powers of 2 size jobs leading up to the big one to observe scaling effects
repeat launch time tests using different values of tbon.fanout in the enclosing instance (64, 128, 256)
repeat launch time tests on an El Cap size enclosing instance (multiple brokers per node)

Other thoughts?

grondo · 2022-10-13T19:51:57Z

grondo
Oct 13, 2022
Maintainer

Before the DAT we should build a recent version of Flux for this system (not sure if we should install into a shared location as well), choose the versions of MPI we're going to test, and ensure it can bootstrap without issues on the target system under Slurm.

Would also make sense to test the time it takes Flux to bootstrap a new instance of Flux in addition to the MPI launch timing?

9 replies

grondo Oct 18, 2022
Maintainer

Ok, I've also got a modified version of flux-mini.py which generates a set of timings similar to the MPI timings above:

$ flux mini alloc -N8 --bg --timing
    8       0.0287       0.2934       4.6655       4.9875       0.0993

The values with headings for reference are:

NODES      T_ALLOC        T_URI      T_READY  T_BOOTSTRAP   T_SHUTDOWN
    8       0.0287       0.2934       4.6655       4.9875       0.0993

Probably the most interesting column here is T_READY which is the time from when the URI memo is posted for the job (early in rank 0 broker startup) to when rc1 is complete and the instance is ready to accept jobs.

Unfortunately there is not bulksubmit capability for flux mini alloc, so we'll have to run these serially unless I have time to refactor the code that implements the background feature of mini alloc.

$ for i in 1 2 3 4 5 6 7 8; do flux mini alloc -N$i --bg --timing; done
    1       0.0274       0.1010       1.7399       1.8682       0.0847
    2       0.0270       0.1464       3.4871       3.6605       0.0985
    3       0.0267       0.2172       3.6404       3.8843       0.0868
    4       0.0265       0.2139       4.6139       4.8544       0.0905
    5       0.9456       0.1239       3.6659       4.7354       0.1021
    6       0.8928       0.1298       3.6781       4.7006       0.0937
    7       1.0458       0.1392       3.5415       4.7264       0.0909
    8       1.1885       0.2058       4.3990       5.7932       0.0932

Also might look into adding a --brokers-per-node option here for testing.

garlick Oct 18, 2022
Maintainer Author

Nice! Really like the columnar timing output. Great work here.

grondo Oct 19, 2022
Maintainer

Ok, I've pushed the mods for today's testing to a 2022-scaling branch in my fork of flux-core.

flux mini alloc --bg --testing now also prints the fanout of the child instance along with the size:

$ flux mini alloc -N1 --bg --timing --timing-header; for fanout in 2 8 0; do for i in 1 2 4 8 16 32 64 128; do flux mini alloc -o per-resource.count=$i --broker-opts=-Stbon.fanout=$fanout -N8 --bg --timing; done; done
 SIZE FNOUT      T_ALLOC        T_URI      T_READY      (TOTAL)   T_SHUTDOWN
    1     2       0.0257       0.0820       1.6126       1.7203       0.0011
    8     2       0.5704       0.1076       4.4684       5.1464       0.0012
   16     2       0.9961       0.1077       7.4465       8.5503       0.0014
   32     2       1.2487       0.1063       6.7478       8.1028       0.0015
   64     2       2.0444       0.1255       8.2978      10.4677       0.0019
  128     2      29.2535       2.3477      14.7678      46.3689       0.0014
  256     2       1.9166       0.4383      16.9521      19.3070       0.0017
  512     2       2.8540       1.2333      25.4969      29.5842       0.0032
 1024     2       5.5962       4.2328      37.1204      46.9494       0.0120
    8     8       9.8835       0.1039       2.4965      12.4839       0.0014
   16     8       0.4887       0.1056       2.7617       3.3561       0.0016
   32     8       1.0565       0.1063       2.9195       4.0824       0.0014
   64     8       1.0077       0.1234       4.2508       5.3820       0.0013
  128     8       0.5649       0.1811       5.8610       6.6069       0.0014
  256     8       1.0985       0.4419      10.6946      12.2349       0.0014
  512     8       1.1641       1.2162      19.1300      21.5103       0.0022
 1024     8       2.2705       4.1710      37.1337      43.5752       0.0081
    8     0       5.4416       0.1080       2.3535       7.9031       0.0014
   16     0       0.4576       0.1104       2.1036       2.6716       0.0013
   32     0       0.7799       0.1137       2.3758       3.2694       0.0016
   64     0       0.7138       0.1358       2.7932       3.6428       0.0012
  128     0       0.8158       0.1757       3.8780       4.8694       0.0015
  256     0       0.2515       0.3060       6.3961       6.9536       0.0012
  512     0       0.5516       0.7219      11.6166      12.8901       0.0012
 1024     0       0.7781       2.1567      20.8355      23.7703       0.0029

trws Oct 19, 2022
Maintainer

That looks great!

trws Oct 19, 2022
Maintainer

On the openmpi issue, lets leave it out for now. Getting the IB stuff working has proven a tough nut to crack.

grondo · 2022-10-19T17:40:45Z

grondo
Oct 19, 2022
Maintainer

Here's a script to drive the MPI scale testing, In this DAT we can basically run this under srun --pty -N <nnodes> flux start [OPTIONS] ./mpi-test.sh

#!/bin/bash

NNODES=$(flux resource list -no {nnodes})
NCORES=$(flux resource list -no {ncores})
CPN=$((${NCORES}/${NNODES}))

printf "MPI scale testing on ${LCSCHEDCLUSTER}\n"
printf "TIME:   $(date -Is)\n"
printf "INFO:   $(flux resource info)\n"
printf "FANOUT: $(flux getattr tbon.fanout)\n"
printf "\n"
printf " NODES   NTASKS           INIT        BARRIER       FINALIZE          TOTAL\n"

seq2()
{
    local start=$1
    local end=$2
    local printend=1

    while [[ $start -lt $end ]]; do
        printf "$start\n"
        [[ $start = $end ]] && printend=0
        ((start*=2))
    done
    [[ $printend = 1 ]] && printf "$end\n"
}

flux mini bulksubmit --watch --progress --quiet --nodes={0} --tasks-per-node={1} \
    --exclusive \
    --env=FLUX_MPI_TEST_TIMING=t \
    ./t/mpi/hello \
    ::: $(seq2 1 ${NNODES}) \
    ::: $(seq2 1 ${CPN})

# vi: ts=4 sw=4 expandtab

Let me know if powers of 2 scaling isn't the right approach.
E.g.

$ srun --pty -N8 ./src/cmd/flux start -o-Stbon.fanout=0 ./mpi-scale.sh 
MPI scale testing on mammoth
TIME:   2022-10-19T10:38:20-0700
INFO:   8 Nodes, 1024 Cores, 0 GPUs
FANOUT: 0

 NODES   NTASKS           INIT        BARRIER       FINALIZE          TOTAL
     1        1    0.120408295    0.000017530    0.026350396    0.146776392
     1        2    0.247044578    0.000021280    0.027165694    0.274231694
     1        8    0.255578820    0.003441866    0.022947635    0.281968421
     1       16    0.300907664    0.006436076    0.023918238    0.331262128
     1       32    0.353205447    0.015671256    0.051539656    0.420416549
     1        4    0.515359992    0.000083761    0.016126541    0.531570464
     2        2    0.124149830    0.005388638    0.177004715    0.306543333
     1       64    0.342178547    0.034258889    0.114106157    0.490543783
     2        4    0.152538355    0.002849584    0.184196495    0.339584594
     2        8    0.145661494    0.001002305    0.173005612    0.319669571
     1      128    0.695548800    0.072997695    0.201960088    0.970506764
     2       16    0.161562866    0.000070312    0.177473684    0.339107022
     2       32    0.210315840    0.000649573    0.202386340    0.413351863
     2       64    0.303383488    0.005213797    0.229790180    0.538387616
     2      128    0.515357182    0.026913300    0.272929648    0.815200291
     4        4    0.148437575    0.000023210    0.190720598    0.339181533
     4        8    0.154291603    0.001456992    0.181385187    0.337133942
     2      256    1.347883719    0.025394977    0.368588671    1.741867507
     4       16    0.157359998    0.000026741    0.173216745    0.330603614
     4       32    0.180073968    0.002974244    0.191913165    0.374961557
     4       64    0.228568515    0.013123602    0.211228428    0.452920705
     4      128    0.329307955    0.018139371    0.239887321    0.587334807
     4      256    0.865110944    0.022677813    0.275205535    1.162994442
     4      512    1.639541113    0.128465618    0.397119047    2.165125948
     8        8    0.168247919    0.004781156    0.180335400    0.353364655
     8       16    0.154139966    0.004597001    0.177126766    0.335863874
     8       32    0.160068415    0.007155382    0.181946403    0.349170350
     8       64    0.186454171    0.022474939    0.184045723    0.392974994
     8      128    0.224943759    0.047555480    0.245997619    0.518497009
     8      256    1.649888681    0.094998605    0.244259213    1.989146669
     8      512    0.868126479    0.105073766    0.288091267    1.261291642
     8     1024    2.798268776    0.089336412    0.436704179    3.324309537
PD:0  R:0  CD:32 F:0  │██████████████████████████████████████████│100.0% 0:00:22

3 replies

trws Oct 19, 2022
Maintainer

Power of two scaling should be good, but we probably want at least one data point at whatever the maximum number of stable nodes we can target is, if only so we can get as close to full scale as possible.

garlick Oct 19, 2022
Maintainer Author

Looks good to me and @trws makes a good point also.

grondo Oct 19, 2022
Maintainer

Great! I made sure the sequence generating function always prints the maximum of the range at least once so that should be good.

grondo · 2022-10-20T15:51:02Z

grondo
Oct 20, 2022
Maintainer

We got through most of the MPI scale testing with parent instance fanouts of 2 (the current default), 8, 16, and 0 (all ranks are children of rank 0). We didn't have time to get through all of the fanout=0 tests.

We were able to get data for one run for each fanout value at the maximum number of tasks: 107064 tasks across 2974 nodes.

           NODES   NTASKS           INIT        BARRIER       FINALIZE          TOTAL
fanout=0:   2974   107064  340.923443580   25.542872260   98.308813571  464.775129636
fanout=16:  2974   107064  187.370121511    0.023706568   14.326472172  201.720300533
fanout=2:   2974   107064  110.171576376    0.582222329   15.171241097  125.925040051
fanout=8:   2974   107064  172.175688041    1.347049127   14.890609175  188.413346524

It would be nice to have another chance to run the fanout=0 case just to verify the above is not an outlier, however, looking at the plots of the raw data (for MPI_Init time only), we see that probably this is not an outlier, as fanout=0 starts to fall over at larger scale:

(number of tasks on x axis, seconds to complete `MPI_Init` on y axis)

Here's the same data with a log2 scaled x axis:

5 replies

trws Oct 20, 2022
Maintainer

Oh boy, ok, thanks for getting all this data! The barrier times on non-zero fanouts look great. We're going to need to do some testing with cray-bootstrap scaling as soon as feasible so we know what our time looks like with their scalable bootstrap. The language in the SOW says the overhad introduced by the contractor mustn't exceed 2 minutes, I believe that means for the MPI init time of cray MPI independent of us, but we may need to keep an eye on our times for distributing the scalable bootstrap information to the ranks. @jameshcorbett, how are we set for being able to do some scalability testing with the cray MPI (in terms of just running some hello-world type jobs so we can get a baseline).

grondo Oct 20, 2022
Maintainer

I actually should have clarified the barrier times are for MPI_Barrier. I don't know if MPI_Barrier uses PMI at all and thus I had assumed the time for the barrier should not be influenced by Flux (one reason I thought perhaps something else was going on in the fanout=0 case). However, @garlick might know for sure if the enclosing instance could influence MPI_Barrier

For reference, here's the results for 16K tasks and up from fanout=0

 NODES   NTASKS           INIT        BARRIER       FINALIZE          TOTAL
   512    16384   27.066083841    1.548645226    1.351088941   29.965818278
  1024    16384   18.430317203    0.866934314   13.582741258   32.879992986
  2048    16384   23.014128891    0.375052778   19.217176221   42.606358114
   512    18432   20.222747741    2.520668702   13.499597855   36.243014554
  1024    32768   41.756627363    0.413432647    3.022614049   45.192674259
  2048    32768  387.741308550    0.000738472    4.864661770  392.606709011
  1024    36864   47.922098785    1.816483253    4.390996822   54.129579084
  2048    65536  173.022034511   46.220443489   22.067525959  241.310004236
  2048    73728  157.151180888   17.703697521   67.690411742  242.545290370
  2974   107064  340.923443580   25.542872260   98.308813571  464.775129636

So there does seem to be a definitive effect there starting around 64K tasks?

I've attached the raw output files here for reference

fanout=0.txt
fanout=2.txt
fanout=8.txt
fanout=16.txt

grondo Oct 20, 2022
Maintainer

I should also note these times do not include the time for the shell to initialize. Luckily there is always one shell per node, but it was taking at least 10s of seconds to get through the initial job shell barrier. We'll want to characterize that as well the instance bootstrap time if we ever get another chance to do this kind of scaling testing.

trws Oct 20, 2022
Maintainer

Ok, still these are pretty good numbers I think. Do we have any idea how they compare to similar slurm runs?

grondo Oct 20, 2022
Maintainer

No, unfortunately I didn't think of running full scale tests of the same MPI hello job under srun 😞

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux scale testing Oct 2022 #4686

{{title}}

Replies: 3 comments 17 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Flux scale testing Oct 2022 #4686

garlick Oct 13, 2022 Maintainer

Replies: 3 comments · 17 replies

grondo Oct 13, 2022 Maintainer

grondo Oct 18, 2022 Maintainer

garlick Oct 18, 2022 Maintainer Author

grondo Oct 19, 2022 Maintainer

trws Oct 19, 2022 Maintainer

trws Oct 19, 2022 Maintainer

grondo Oct 19, 2022 Maintainer

trws Oct 19, 2022 Maintainer

garlick Oct 19, 2022 Maintainer Author

grondo Oct 19, 2022 Maintainer

grondo Oct 20, 2022 Maintainer

trws Oct 20, 2022 Maintainer

grondo Oct 20, 2022 Maintainer

grondo Oct 20, 2022 Maintainer

trws Oct 20, 2022 Maintainer

grondo Oct 20, 2022 Maintainer

garlick
Oct 13, 2022
Maintainer

Replies: 3 comments 17 replies

grondo
Oct 13, 2022
Maintainer

grondo Oct 18, 2022
Maintainer

garlick Oct 18, 2022
Maintainer Author

grondo Oct 19, 2022
Maintainer

trws Oct 19, 2022
Maintainer

trws Oct 19, 2022
Maintainer

grondo
Oct 19, 2022
Maintainer

trws Oct 19, 2022
Maintainer

garlick Oct 19, 2022
Maintainer Author

grondo Oct 19, 2022
Maintainer

grondo
Oct 20, 2022
Maintainer

trws Oct 20, 2022
Maintainer

grondo Oct 20, 2022
Maintainer

grondo Oct 20, 2022
Maintainer

trws Oct 20, 2022
Maintainer

grondo Oct 20, 2022
Maintainer