pytest test/integration error #379

ChaiBapchya · 2020-06-10T08:29:18Z

Test integration

pytest test/integration/sagemaker/test_horovod.py --docker-base-name sm-tf-horovod-integration --tag latest --framework-version 1.15.0 --processor gpu

Error stacktrace:

sagemaker.exceptions.UnexpectedStatusException: Error for Training job test-tf-horovod-1591768266-74da: Failed. Reason: Alg
orithmError: ExecuteUserScriptError:
E           Command "mpirun --host algo-1 -np 1 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tc
p_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to socket -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_s
tatus 1 -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/usr/local/lib/pyth
on3.6/dist-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -verbose -x orte_base_help_aggregate=0 -x SM_HOSTS -x SM_NETWORK_INTERF
ACE_NAME -x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x
 SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_O
UTPUT_DIR -x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_D
IR -x SM_HP_MODEL_DIR -x PYTHONPATH /usr/bin/python

Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 1.

2020-06-10 05:55:55 Uploading - Uploading generated training model
2020-06-10 05:55:55 Failed - Training job failed
======================================================= short test summary info =======================================================
FAILED test/integration/sagemaker/test_horovod.py::test_distributed_training_horovod[gpu-3] - sagemaker.exceptions.UnexpectedStatusE...

The text was updated successfully, but these errors were encountered:

laurenyu · 2020-06-16T21:45:40Z

--docker-base-name sm-tf-horovod-integration --tag latest

what image did you use for your test run?

ChaiBapchya · 2020-06-17T04:09:04Z

Likely an image I build locally & pushed to ECR using the steps mentioned in the readme.

laurenyu · 2020-06-17T17:07:29Z

running

pytest test/integration/sagemaker/test_horovod.py --account-id 763104351884 --docker-base-name tensorflow-training --tag 1.15.0-gpu-py3 --processor gpu --dockerfile-type dlc.gpu

produced

[ip-10-0-79-182.us-west-2.compute.internal:00039] 1 more process has sent help message help-orte-odls-default.txt / memory not bound
[ip-10-0-79-182.us-west-2.compute.internal:00039] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,0]<stderr>:    "__main__", mod_spec)
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
[1,0]<stderr>:    exec(code, run_globals)
[1,0]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/mpi4py/__main__.py", line 7, in <module>
[1,0]<stderr>:    main()
[1,0]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/mpi4py/run.py", line 196, in main
[1,0]<stderr>:    run_command_line(args)
[1,0]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/mpi4py/run.py", line 47, in run_command_line
[1,0]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 263, in run_path
[1,0]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,0]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
[1,0]<stderr>:    exec(code, run_globals)
[1,0]<stderr>:  File "horovod_mnist.py", line 46, in <module>
[1,0]<stderr>:    loss = tf.losses.SparseCategoricalCrossentropy()
[1,0]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/module_wrapper.py", line 193, in __getattr__
[1,0]<stderr>:    attr = getattr(self._tfmw_wrapped_module, name)
[1,0]<stderr>:AttributeError: module 'tensorflow._api.v1.losses' has no attribute 'SparseCategoricalCrossentropy'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[1061,1],1]
  Exit code:    1
--------------------------------------------------------------------------
2020-06-17 16:26:30,741 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "mpirun --host algo-1,algo-2 -np 2 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to socket -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1
 -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/usr/local/lib/python3.6/dist-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -verbose -x orte_base_help_aggregate=0 -x SM_HOSTS -x SM_NETWORK_INTERFACE_NAME
-x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_OUTPUT_DIR -
x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_DIR -x SM_HP_MODEL_DIR -x PYTHONPATH /usr/bin/python3 -m mpi4py horovod_mnist.py --model_dir s3://sagemaker-us-west-2-583851319346/test-tf-horovod-15
92410946-69f0/model"
Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 1.

2020-06-17 16:26:39 Failed - Training job failed

which seems to match the partial stacktrace you provided. The actual error message looks to be:

AttributeError: module 'tensorflow._api.v1.losses' has no attribute 'SparseCategoricalCrossentropy'

which seems to have been a known bug in older versions of TF: tensorflow/tensorflow#26007, tensorflow/tensorflow#26012.

Running with TF 1.15.2 also failed, but running with TF 2.2 passed.

This makes me believe that the issue is with the TF installation rather than the code in this repository. I'll pass this along to the owners of https://github.com/aws/deep-learning-containers.

ChaiBapchya · 2020-06-18T06:28:04Z

Awesome. Thanks for redirecting to the concerned folks.

chuyang-deng assigned ChaiBapchya and unassigned ChaiBapchya Jun 11, 2020

laurenyu added the type: question label Jun 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytest test/integration error #379

pytest test/integration error #379

ChaiBapchya commented Jun 10, 2020 •

edited

laurenyu commented Jun 16, 2020

ChaiBapchya commented Jun 17, 2020

laurenyu commented Jun 17, 2020 •

edited

ChaiBapchya commented Jun 18, 2020

pytest test/integration error #379

pytest test/integration error #379

Comments

ChaiBapchya commented Jun 10, 2020 • edited

laurenyu commented Jun 16, 2020

ChaiBapchya commented Jun 17, 2020

laurenyu commented Jun 17, 2020 • edited

ChaiBapchya commented Jun 18, 2020

ChaiBapchya commented Jun 10, 2020 •

edited

laurenyu commented Jun 17, 2020 •

edited