Replies: 1 comment
-
The issue is most likely unrelated to mpi4py. Looks like your Open MPI installation may be broken. You should ask for help the the support staff of the cluster you are using. You have not said what Open MPI version are you using, how was it installed, etc. Your best bet is to ask for help to the support staff of the computing resources you are using, there is almost nothing I can do to further debug your issue. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have tried to use mpi for python to calculate a calculation intensive task of several nodes, unfortunately I get an error.
I'm trying to spread my program over 2 nodes, however I can't get it to work that way and I don't understand the error message. The only thing that works is that I run the program on one node.
I hope that I am in the right place here and that someone can help me.
~/py4mpi$ mpiexec -hostfile hosts.txt -n 16 --display-map python3 monte_carlo_mpi.py --mca plm_base_verbose 30
[server-backend-gpu:139171] mca: base: components_register: registering framework plm components
[server-backend-gpu:139171] mca: base: components_register: found loaded component rsh
[server-backend-gpu:139171] mca: base: components_register: component rsh register function successful
[server-backend-gpu:139171] mca: base: components_register: found loaded component isolated
[server-backend-gpu:139171] mca: base: components_register: component isolated has no register or open function
[server-backend-gpu:139171] mca: base: components_register: found loaded component slurm
[server-backend-gpu:139171] mca: base: components_register: component slurm register function successful
[server-backend-gpu:139171] mca: base: components_open: opening plm components
[server-backend-gpu:139171] mca: base: components_open: found loaded component rsh
[server-backend-gpu:139171] mca: base: components_open: component rsh open function successful
[server-backend-gpu:139171] mca: base: components_open: found loaded component isolated
[server-backend-gpu:139171] mca: base: components_open: component isolated open function successful
[server-backend-gpu:139171] mca: base: components_open: found loaded component slurm
[server-backend-gpu:139171] mca: base: components_open: component slurm open function successful
[server-backend-gpu:139171] mca:base:select: Auto-selecting plm components
[server-backend-gpu:139171] mca:base:select:( plm) Querying component [rsh]
[server-backend-gpu:139171] mca:base:select:( plm) Query of component [rsh] set priority to 10
[server-backend-gpu:139171] mca:base:select:( plm) Querying component [isolated]
[server-backend-gpu:139171] mca:base:select:( plm) Query of component [isolated] set priority to 0
[server-backend-gpu:139171] mca:base:select:( plm) Querying component [slurm]
[server-backend-gpu:139171] mca:base:select:( plm) Selected component [rsh]
[server-backend-gpu:139171] mca: base: close: component isolated closed
[server-backend-gpu:139171] mca: base: close: unloading component isolated
[server-backend-gpu:139171] mca: base: close: component slurm closed
[server-backend-gpu:139171] mca: base: close: unloading component slurm
[server-backend-gpu:139171] [[56867,0],0] plm:rsh: final template argv:
/usr/bin/ssh orted -mca ess "env" -mca ess_base_jobid "3726835712" -mca ess_base_vpid "" -mca ess_base_num_procs "3" -mca orte_node_regex "server-backend-gpu,[3:192].168.0.52,[3:192].168.0.24@0(3)" -mca orte_hnp_uri "3726835712.0;tcp://192.168.0.53:47097" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "3726835712.0;tcp://192.168.0.53:47097" -mca plm_base_verbose "30" -mca rmaps_base_display_map "1" -mca pmix "^s1,s2,cray,isolated"
[feldbus:227053] mca: base: components_register: registering framework plm components
[feldbus:227053] mca: base: components_register: found loaded component rsh
[feldbus:227053] mca: base: components_register: component rsh register function successful
[feldbus:227053] mca: base: components_open: opening plm components
[feldbus:227053] mca: base: components_open: found loaded component rsh
[feldbus:227053] mca: base: components_open: component rsh open function successful
[feldbus:227053] mca:base:select: Auto-selecting plm components
[feldbus:227053] mca:base:select:( plm) Querying component [rsh]
[feldbus:227053] mca:base:select:( plm) Query of component [rsh] set priority to 10
[feldbus:227053] mca:base:select:( plm) Selected component [rsh]
[server-backend:09192] mca: base: components_register: registering framework plm components
[server-backend:09192] mca: base: components_register: found loaded component rsh
[server-backend:09192] mca: base: components_register: component rsh register function successful
[server-backend:09192] mca: base: components_open: opening plm components
[server-backend:09192] mca: base: components_open: found loaded component rsh
[server-backend:09192] mca: base: components_open: component rsh open function successful
[server-backend:09192] mca:base:select: Auto-selecting plm components
[server-backend:09192] mca:base:select:( plm) Querying component [rsh]
[server-backend:09192] mca:base:select:( plm) Query of component [rsh] set priority to 10
[server-backend:09192] mca:base:select:( plm) Selected component [rsh]
[server-backend-gpu:139171] [[56867,0],0] complete_setup on job [56867,1]
Data for JOB [56867,1] offset 0 Total slots allocated 24
======================== JOB MAP ========================
Data for node: 192.168.0.52 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [56867,1] App: 0 Process rank: 0 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 1 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 2 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 3 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 4 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 5 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 6 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 7 Bound: N/A
Data for node: 192.168.0.24 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [56867,1] App: 0 Process rank: 8 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 9 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 10 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 11 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 12 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 13 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 14 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 15 Bound: N/A
=============================================================
Data for JOB [56867,1] offset 0 Total slots allocated 24
======================== JOB MAP ========================
Data for node: 192.168.0.52 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [56867,1] App: 0 Process rank: 0 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 1 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 2 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 3 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 4 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 5 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 6 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 7 Bound: N/A
Data for node: 192.168.0.24 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [56867,1] App: 0 Process rank: 8 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 9 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 10 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 11 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 12 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 13 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 14 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 15 Bound: UNBOUND
=============================================================
Data for JOB [56867,1] offset 0 Total slots allocated 24
======================== JOB MAP ========================
Data for node: 192.168.0.52 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [56867,1] App: 0 Process rank: 0 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 1 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 2 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 3 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 4 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 5 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 6 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 7 Bound: UNBOUND
Data for node: 192.168.0.24 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [56867,1] App: 0 Process rank: 8 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 9 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 10 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 11 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 12 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 13 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 14 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 15 Bound: N/A
=============================================================
[server-backend-gpu:139171] [[56867,0],0] plm:base:receive update proc state command from [[56867,0],2]
[server-backend-gpu:139171] [[56867,0],0] plm:base:receive got update_proc_state for job [56867,1]
[server-backend-gpu:139171] [[56867,0],0] plm:base:receive update proc state command from [[56867,0],1]
[server-backend-gpu:139171] [[56867,0],0] plm:base:receive got update_proc_state for job [56867,1]
[feldbus:227053] PMIX ERROR: NOT-FOUND in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 181
[feldbus:227053] PMIX ERROR: NOT-FOUND in file ../../../../../../src/mca/common/dstore/dstore_base.c at line 2571
[feldbus:227053] PMIX ERROR: NOT-FOUND in file ../../../src/server/pmix_server.c at line 2462
[server-backend:09192] *** Process received signal ***
[server-backend:09192] Signal: Segmentation fault (11)
[server-backend:09192] Signal code: Address not mapped (1)
[server-backend:09192] Failing at address: (nil)
[server-backend:09192] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f342bbb3520]
[server-backend:09192] [ 1] /lib/x86_64-linux-gnu/libpmix.so.2(pmix_bfrops_base_pack_value+0x4b)[0x7f34292a1fdb]
[server-backend:09192] [ 2] /lib/x86_64-linux-gnu/libpmix.so.2(pmix_bfrops_base_pack_kval+0x8f)[0x7f342929fe3f]
[server-backend:09192] [ 3] /lib/x86_64-linux-gnu/libpmix.so.2(pmix_bfrops_base_pack+0x7f)[0x7f34292a2d6f]
[server-backend:09192] [ 4] /lib/x86_64-linux-gnu/libpmix.so.2(pmix_common_dstor_store+0x2c5)[0x7f342929d515]
[server-backend:09192] [ 5] /lib/x86_64-linux-gnu/libpmix.so.2(+0x9f9dc)[0x7f34292699dc]
[server-backend:09192] [ 6] /lib/x86_64-linux-gnu/libevent_core-2.1.so.7(+0x1dee8)[0x7f342ba2cee8]
[server-backend:09192] [ 7] /lib/x86_64-linux-gnu/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f342ba2ebf7]
[server-backend:09192] [ 8] /lib/x86_64-linux-gnu/libpmix.so.2(+0x9c406)[0x7f3429266406]
[server-backend:09192] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f342bc05b43]
[server-backend:09192] [10] /lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7f342bc97a00]
[server-backend:09192] *** End of error message ***
ORTE has lost communication with a remote daemon.
HNP daemon : [[56867,0],0] on node server-backend-gpu
Remote daemon: [[56867,0],1] on node 192.168.0.52
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
[feldbus:227053] mca: base: close: component rsh closed
[feldbus:227053] mca: base: close: unloading component rsh
[server-backend-gpu:139171] mca: base: close: component rsh closed
[server-backend-gpu:139171] mca: base: close: unloading component rsh
Beta Was this translation helpful? Give feedback.
All reactions