Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytest test/integration error #379

Open
ChaiBapchya opened this issue Jun 10, 2020 · 4 comments
Open

pytest test/integration error #379

ChaiBapchya opened this issue Jun 10, 2020 · 4 comments

Comments

@ChaiBapchya
Copy link

ChaiBapchya commented Jun 10, 2020

Test integration

pytest test/integration/sagemaker/test_horovod.py --docker-base-name sm-tf-horovod-integration --tag latest --framework-version 1.15.0 --processor gpu

Error stacktrace:

sagemaker.exceptions.UnexpectedStatusException: Error for Training job test-tf-horovod-1591768266-74da: Failed. Reason: Alg
orithmError: ExecuteUserScriptError:
E           Command "mpirun --host algo-1 -np 1 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tc
p_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to socket -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_s
tatus 1 -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/usr/local/lib/pyth
on3.6/dist-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -verbose -x orte_base_help_aggregate=0 -x SM_HOSTS -x SM_NETWORK_INTERF
ACE_NAME -x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x
 SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_O
UTPUT_DIR -x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_D
IR -x SM_HP_MODEL_DIR -x PYTHONPATH /usr/bin/python

Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 1.

2020-06-10 05:55:55 Uploading - Uploading generated training model
2020-06-10 05:55:55 Failed - Training job failed
======================================================= short test summary info =======================================================
FAILED test/integration/sagemaker/test_horovod.py::test_distributed_training_horovod[gpu-3] - sagemaker.exceptions.UnexpectedStatusE...
@laurenyu
Copy link
Contributor

--docker-base-name sm-tf-horovod-integration --tag latest

what image did you use for your test run?

@ChaiBapchya
Copy link
Author

Likely an image I build locally & pushed to ECR using the steps mentioned in the readme.

@laurenyu
Copy link
Contributor

laurenyu commented Jun 17, 2020

running

pytest test/integration/sagemaker/test_horovod.py --account-id 763104351884 --docker-base-name tensorflow-training --tag 1.15.0-gpu-py3 --processor gpu --dockerfile-type dlc.gpu

produced

[ip-10-0-79-182.us-west-2.compute.internal:00039] 1 more process has sent help message help-orte-odls-default.txt / memory not bound
[ip-10-0-79-182.us-west-2.compute.internal:00039] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,0]<stderr>:    "__main__", mod_spec)
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
[1,0]<stderr>:    exec(code, run_globals)
[1,0]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/mpi4py/__main__.py", line 7, in <module>
[1,0]<stderr>:    main()
[1,0]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/mpi4py/run.py", line 196, in main
[1,0]<stderr>:    run_command_line(args)
[1,0]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/mpi4py/run.py", line 47, in run_command_line
[1,0]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 263, in run_path
[1,0]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,0]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
[1,0]<stderr>:    exec(code, run_globals)
[1,0]<stderr>:  File "horovod_mnist.py", line 46, in <module>
[1,0]<stderr>:    loss = tf.losses.SparseCategoricalCrossentropy()
[1,0]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/module_wrapper.py", line 193, in __getattr__
[1,0]<stderr>:    attr = getattr(self._tfmw_wrapped_module, name)
[1,0]<stderr>:AttributeError: module 'tensorflow._api.v1.losses' has no attribute 'SparseCategoricalCrossentropy'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[1061,1],1]
  Exit code:    1
--------------------------------------------------------------------------
2020-06-17 16:26:30,741 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "mpirun --host algo-1,algo-2 -np 2 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to socket -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1
 -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/usr/local/lib/python3.6/dist-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -verbose -x orte_base_help_aggregate=0 -x SM_HOSTS -x SM_NETWORK_INTERFACE_NAME
-x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_OUTPUT_DIR -
x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_DIR -x SM_HP_MODEL_DIR -x PYTHONPATH /usr/bin/python3 -m mpi4py horovod_mnist.py --model_dir s3://sagemaker-us-west-2-583851319346/test-tf-horovod-15
92410946-69f0/model"
Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 1.

2020-06-17 16:26:39 Failed - Training job failed

which seems to match the partial stacktrace you provided. The actual error message looks to be:

AttributeError: module 'tensorflow._api.v1.losses' has no attribute 'SparseCategoricalCrossentropy'

which seems to have been a known bug in older versions of TF: tensorflow/tensorflow#26007, tensorflow/tensorflow#26012.

Running with TF 1.15.2 also failed, but running with TF 2.2 passed.

This makes me believe that the issue is with the TF installation rather than the code in this repository. I'll pass this along to the owners of https://github.com/aws/deep-learning-containers.

@ChaiBapchya
Copy link
Author

Awesome. Thanks for redirecting to the concerned folks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants