Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI job with multiple nodes can not be launched correctly on Polaris #639

Open
GKNB opened this issue Mar 17, 2023 · 3 comments
Open

MPI job with multiple nodes can not be launched correctly on Polaris #639

GKNB opened this issue Mar 17, 2023 · 3 comments

Comments

@GKNB
Copy link

GKNB commented Mar 17, 2023

I have a workflow that consists of a single stage with a single task. The task is an MPI job, which use multiple processes and is supposed to run on multiple nodes. I find that actually it only runs on a single node. It is on Polaris machine, the resource manager system is PBS, and task launching is handled by mpiexec. Below is my entk script:

from radical import entk
import os
import argparse, sys, math

class MVP(object):

    def __init__(self):
        self.am = entk.AppManager()

    def set_resource(self, res_desc):
        self.am.resource_desc = res_desc

    def generate_task(self):
        t = entk.Task()
        t.pre_exec = []
        t.executable = '/bin/echo'
        t.arguments = ["mytest"]
        t.post_exec = []
        t.cpu_reqs = {
                'cpu_processes'     : 8,
                'cpu_process_type'  : 'MPI',
                'cpu_threads'       : 16,
                'cpu_thread_type'   : 'OpenMP'
                }
        return t

    def generate_pipeline(self):
        p = entk.Pipeline()
        s = entk.Stage()
        t = self.generate_task()
        s.add_tasks(t)
        p.add_stages(s)
        return p

    def run_workflow(self):
        p = self.generate_pipeline()
        self.am.workflow = [p]
        self.am.run()


if __name__ == '__main__':

    mvp = MVP()
    n_nodes = 2
    mvp.set_resource(res_desc = {
        'resource'  : 'anl.polaris',
        'queue'     : 'debug',
        'walltime'  : 60,
        'cpus'      : 64 * n_nodes,
        'gpus'      : 4 * n_nodes,
        'project'   : 'CSC249ADCD08'
        })
    mvp.run_workflow()

Here my MPI job is basically a echo command. I launched 8 processes for that, each with 16 cores, and since Polaris has 64 cores (32 cores but 64 hardware threads, and in resource_anl.json, cpu_per_node is set to be 64) per node, I ask for two Polaris nodes. This is supposed to generate an output file of 8 lines of "mytest". However, I only see 4 lines of "mytest" (see sandbox below, task.0000.out). The script is executed without any error message.

My understanding is that radical is not generating an mpiexec command correctly. If you look at task.0000.launch.sh in sandbox, the mpiexec command it generates is:

/opt/cray/pe/pals/1.1.7/bin/mpiexec -host x3006c0s1b0n0,x3006c0s1b0n0,x3006c0s1b0n0,x3006c0s1b0n0 -n 4 -host x3006c0s1b1n0,x3006c0s1b1n0,x3006c0s1b1n0,x3006c0s1b1n0 -n 4 $RP_TASK_SANDBOX/task.0000.exec.sh

However, I think this command can not submit a job to two nodes (x3006c0s1b0n0 and x3006c0s1b1n0), and my guess is that only the first -host flag is recognized. I did a small test using interactive nodes. I first ask for two interactive nodes on polaris, then I run two command as below:

a). /opt/cray/pe/pals/1.1.7/bin/mpiexec -host x3004c0s25b1n0,x3004c0s25b1n0,x3004c0s25b1n0,x3004c0s25b1n0 -n 4 -host x3004c0s31b0n0,x3004c0s31b0n0,x3004c0s31b0n0,x3004c0s31b0n0 -n 4 echo "mytest"
(Here the two hostname are obtained from $PBS_NODEFILE. This is trying to mimic what rct is doing). This outputs only four lines of "mytest"

b). mpiexec -n 8 --ppn 4 echo "mytest"
This outputs eight lines of "mytest", which is consistent with what we want.

Because of that, I think there is an issue with the mpiexec command rct generated on Polaris. Could you take a look at that? Thanks!

PS. It seems like github does not allow tar file, so I wrap it with zip.
mpi_issue.zip

@andre-merzky
Copy link
Member

On a first glance this seems like an incompatibility between mpiexec implementations. We should be able to switch that to a different parameter mode. @mtitov : should we add a LM config option to always use hostfiles? That's nice for debugging anyway and would resolve problems like this (hostfiles are part of the MPI spec and should be universally support IIRC).

@mtitov
Copy link
Contributor

mtitov commented Mar 17, 2023

@andre-merzky agree, using hostfile seems a safe approach, but when I did a quick check I found several names for that: -f or [most common] -hostfile. Would need to dig into this more

@andre-merzky
Copy link
Member

The MPI -Standard defines -host for mpiexec (see section 11.5). So that should be a safe bet...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants