MPI job with multiple nodes can not be launched correctly on Polaris #639

GKNB · 2023-03-17T01:50:10Z

I have a workflow that consists of a single stage with a single task. The task is an MPI job, which use multiple processes and is supposed to run on multiple nodes. I find that actually it only runs on a single node. It is on Polaris machine, the resource manager system is PBS, and task launching is handled by mpiexec. Below is my entk script:

from radical import entk
import os
import argparse, sys, math

class MVP(object):

    def __init__(self):
        self.am = entk.AppManager()

    def set_resource(self, res_desc):
        self.am.resource_desc = res_desc

    def generate_task(self):
        t = entk.Task()
        t.pre_exec = []
        t.executable = '/bin/echo'
        t.arguments = ["mytest"]
        t.post_exec = []
        t.cpu_reqs = {
                'cpu_processes'     : 8,
                'cpu_process_type'  : 'MPI',
                'cpu_threads'       : 16,
                'cpu_thread_type'   : 'OpenMP'
                }
        return t

    def generate_pipeline(self):
        p = entk.Pipeline()
        s = entk.Stage()
        t = self.generate_task()
        s.add_tasks(t)
        p.add_stages(s)
        return p

    def run_workflow(self):
        p = self.generate_pipeline()
        self.am.workflow = [p]
        self.am.run()


if __name__ == '__main__':

    mvp = MVP()
    n_nodes = 2
    mvp.set_resource(res_desc = {
        'resource'  : 'anl.polaris',
        'queue'     : 'debug',
        'walltime'  : 60,
        'cpus'      : 64 * n_nodes,
        'gpus'      : 4 * n_nodes,
        'project'   : 'CSC249ADCD08'
        })
    mvp.run_workflow()

Here my MPI job is basically a echo command. I launched 8 processes for that, each with 16 cores, and since Polaris has 64 cores (32 cores but 64 hardware threads, and in resource_anl.json, cpu_per_node is set to be 64) per node, I ask for two Polaris nodes. This is supposed to generate an output file of 8 lines of "mytest". However, I only see 4 lines of "mytest" (see sandbox below, task.0000.out). The script is executed without any error message.

My understanding is that radical is not generating an mpiexec command correctly. If you look at task.0000.launch.sh in sandbox, the mpiexec command it generates is:

/opt/cray/pe/pals/1.1.7/bin/mpiexec -host x3006c0s1b0n0,x3006c0s1b0n0,x3006c0s1b0n0,x3006c0s1b0n0 -n 4 -host x3006c0s1b1n0,x3006c0s1b1n0,x3006c0s1b1n0,x3006c0s1b1n0 -n 4 $RP_TASK_SANDBOX/task.0000.exec.sh

However, I think this command can not submit a job to two nodes (x3006c0s1b0n0 and x3006c0s1b1n0), and my guess is that only the first -host flag is recognized. I did a small test using interactive nodes. I first ask for two interactive nodes on polaris, then I run two command as below:

a). /opt/cray/pe/pals/1.1.7/bin/mpiexec -host x3004c0s25b1n0,x3004c0s25b1n0,x3004c0s25b1n0,x3004c0s25b1n0 -n 4 -host x3004c0s31b0n0,x3004c0s31b0n0,x3004c0s31b0n0,x3004c0s31b0n0 -n 4 echo "mytest"
(Here the two hostname are obtained from $PBS_NODEFILE. This is trying to mimic what rct is doing). This outputs only four lines of "mytest"

b). mpiexec -n 8 --ppn 4 echo "mytest"
This outputs eight lines of "mytest", which is consistent with what we want.

Because of that, I think there is an issue with the mpiexec command rct generated on Polaris. Could you take a look at that? Thanks!

PS. It seems like github does not allow tar file, so I wrap it with zip.
mpi_issue.zip

The text was updated successfully, but these errors were encountered:

andre-merzky · 2023-03-17T10:06:06Z

On a first glance this seems like an incompatibility between mpiexec implementations. We should be able to switch that to a different parameter mode. @mtitov : should we add a LM config option to always use hostfiles? That's nice for debugging anyway and would resolve problems like this (hostfiles are part of the MPI spec and should be universally support IIRC).

mtitov · 2023-03-17T20:56:21Z

@andre-merzky agree, using hostfile seems a safe approach, but when I did a quick check I found several names for that: -f or [most common] -hostfile. Would need to dig into this more

andre-merzky · 2023-03-18T06:21:22Z

The MPI -Standard defines -host for mpiexec (see section 11.5). So that should be a safe bet...

GKNB added layer:entk type:bug labels Mar 17, 2023

andre-merzky self-assigned this Mar 17, 2023

andre-merzky added the priority:critical label Mar 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI job with multiple nodes can not be launched correctly on Polaris #639

MPI job with multiple nodes can not be launched correctly on Polaris #639

GKNB commented Mar 17, 2023

andre-merzky commented Mar 17, 2023

mtitov commented Mar 17, 2023

andre-merzky commented Mar 18, 2023

MPI job with multiple nodes can not be launched correctly on Polaris #639

MPI job with multiple nodes can not be launched correctly on Polaris #639

Comments

GKNB commented Mar 17, 2023

andre-merzky commented Mar 17, 2023

mtitov commented Mar 17, 2023

andre-merzky commented Mar 18, 2023