Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lingvo/jax:main fails with "undefined symbol: _ZNK10tensorflow6Status14GetAllPayloadsEv" #283

Open
ruomingp opened this issue Mar 24, 2022 · 5 comments

Comments

@ruomingp
Copy link
Contributor

To reproduce:

# bazel run -c opt     lingvo/jax:main --     --model=lm.ptb.PTBCharTransformerSmallSgd     --job_log_dir=/tmp/jax_log_dir/exp01 --alsologtostderr
...
2022-03-24 17:46:18.859227: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/ops/__init__.py", line 22, in <module>
    from lingvo.core.ops import gen_x_ops  # pylint: disable=g-import-not-at-top
ImportError: cannot import name 'gen_x_ops' from 'lingvo.core.ops' (/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/ops/__init__.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/jax/main.py", line 36, in <module>
    from lingvo.jax import eval as eval_lib
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/jax/eval.py", line 29, in <module>
    from lingvo.jax import base_input
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/jax/base_input.py", line 23, in <module>
    from lingvo.core import datasource
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/datasource.py", line 33, in <module>
    from lingvo.core import base_layer
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/base_layer.py", line 27, in <module>
    from lingvo.core import py_utils
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/py_utils.py", line 43, in <module>
    from lingvo.core import ops
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/ops/__init__.py", line 25, in <module>
    tf.resource_loader.get_path_to_datafile('x_ops.so'))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/core/ops/x_ops.so: undefined symbol: _ZNK10tensorflow6Status14GetAllPayloadsEv
@laurentes
Copy link
Contributor

I'm unable to reproduce internally :(
I'm wondering if the external docker image is somehow different.
I will try to reproduce starting from the OSS version.

@laurentes
Copy link
Contributor

Could you update which model name you've used? (probably not lm.ptb.PTBCharTransformerSmallSgd)

@ruomingp
Copy link
Contributor Author

Thanks for looking into this, Laurent! It's actually lm.ptb.PTBCharTransformerSmallSgd:

% bazel run -c opt     lingvo/jax:main --  \
   --model=lm.ptb.PTBCharTransformerSmallSgd  \
   --job_log_dir=/tmp/jax_log_dir/exp01 --alsologtostderr

@ruomingp
Copy link
Contributor Author

I noticed that lingvo/jax/pip_package/build.Dockerfile does not specify dependency versions explicitly, so maybe we are using different versions of TF?

I see:

tensorflow                        2.8.0
tensorflow-datasets               4.5.2
tensorflow-hub                    0.12.0
tensorflow-io-gcs-filesystem      0.24.0
tensorflow-metadata               1.7.0
tensorflow-text                   2.8.1

@laurentes
Copy link
Contributor

My TF versions for python3.7 are exactly the same as yours.

Otherwise, it's definitely not the main issue. But just for the records, we didn't open source configs on PTB like lm.ptb.PTBCharTransformerSmallSgd, so you may want to try out with e.g. lm.lm_cloud.LmCloudSpmdTest instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants