Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target //lingvo/jax:main failed to build #281

Open
ruomingp opened this issue Mar 19, 2022 · 9 comments
Open

Target //lingvo/jax:main failed to build #281

ruomingp opened this issue Mar 19, 2022 · 9 comments

Comments

@ruomingp
Copy link
Contributor

To reproduce:

% docker build --tag tensorflow:lingvo - < "$LINGVO_DIR/lingvo/jax/pip_package/build.Dockerfile"
...
% docker run --rm $(test "$LINGVO_DEVICE" = "gpu" && echo "--runtime=nvidia") -it -v ${LINGVO_DIR}:/tmp/lingvo -v ${HOME}/.gitconfig:/home/${USER}/.gitconfig:ro -p 6006:6006 -p 8888:8888 --name lingvo tensorflow:lingvo bash
#
# bazel run -c opt \
>     lingvo/jax:main -- \
>     --model=lm.ptb.PTBCharTransformerSmallSgd \
>     --job_log_dir=/tmp/jax_log_dir/exp01 --alsologtostderr
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
DEBUG: Rule 'subpar' indicated that a canonical reproducible form can be obtained by modifying arguments commit = "35bb9f0092f71ea56b742a520602da9b3638a24f", shallow_since = "1557863961 -0400" and dropping ["tag"]
DEBUG: Repository subpar instantiated at:
  /tmp/lingvo/WORKSPACE:12:15: in <toplevel>
Repository rule git_repository defined at:
  /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/external/bazel_tools/tools/build_defs/repo/git.bzl:199:33: in <toplevel>
INFO: Analyzed target //lingvo/jax:main (36 packages loaded, 6669 targets configured).
INFO: Found 1 target...
INFO: From Compiling icu4c/source/common/unistr.cpp:
external/icu/icu4c/source/common/unistr.cpp:1975:13: warning: 'void uprv_UnicodeStringDummy()' defined but not used [-Wunused-function]
 static void uprv_UnicodeStringDummy(void) {
             ^
INFO: From Compiling icu4c/source/common/ucptrie.cpp:
external/icu/icu4c/source/common/ucptrie.cpp: In function 'UChar32 {anonymous}::getRange(const void*, UChar32, uint32_t (*)(const void*, uint32_t), const void*, uint32_t*)':
external/icu/icu4c/source/common/ucptrie.cpp:404:5: warning: 'value' may be used uninitialized in this function [-Wmaybe-uninitialized]
     if (maybeFilterValue(highValue, trie->nullValue, nullValue,
     ^
INFO: From Compiling lingvo/core/ops/record_yielder.cc:
lingvo/core/ops/record_yielder.cc:347:6: warning: 'tensorflow::lingvo::{anonymous}::register_text_iterator' defined but not used [-Wunused-variable]
 bool register_text_iterator = RecordIterator::Register(
      ^
lingvo/core/ops/record_yielder.cc:356:6: warning: 'tensorflow::lingvo::{anonymous}::register_indirect_text_iterator' defined but not used [-Wunused-variable]
 bool register_indirect_text_iterator =
      ^
lingvo/core/ops/record_yielder.cc:366:6: warning: 'tensorflow::lingvo::{anonymous}::register_tf_record_iterator' defined but not used [-Wunused-variable]
 bool register_tf_record_iterator =
      ^
lingvo/core/ops/record_yielder.cc:371:6: warning: 'tensorflow::lingvo::{anonymous}::register_tf_record_gzip_iterator' defined but not used [-Wunused-variable]
 bool register_tf_record_gzip_iterator =
      ^
lingvo/core/ops/record_yielder.cc:376:6: warning: 'tensorflow::lingvo::{anonymous}::register_iota_iterator' defined but not used [-Wunused-variable]
 bool register_iota_iterator = RecordIterator::RegisterWithPatternParser(
      ^
ERROR: /tmp/lingvo/lingvo/core/ops/BUILD:180:18: Compiling lingvo/core/ops/input_common.cc failed: (Exit 1): gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 64 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 64 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
In file included from lingvo/core/ops/input_common.cc:16:0:
./lingvo/core/ops/input_common.h:143:55: error: expected class-name before '{' token
 class InputResource : public tensorflow::ResourceBase {
                                                       ^
./lingvo/core/ops/input_common.h: In member function 'void tensorflow::lingvo::InputOpV2Create<RecordProcessorClass>::Compute(tensorflow::OpKernelContext*)':
./lingvo/core/ops/input_common.h:228:9: error: 'MakeRefCountingHandle' is not a member of 'tensorflow::ResourceHandle'
         ResourceHandle::MakeRefCountingHandle(resource, ctx->device()->name(),
         ^
./lingvo/core/ops/input_common.h: In member function 'void tensorflow::lingvo::InputOpV2GetNext<RecordProcessorClass>::Compute(tensorflow::OpKernelContext*)':
./lingvo/core/ops/input_common.h:252:28: error: 'const class tensorflow::ResourceHandle' has no member named 'GetResource'
     auto statusor = handle.GetResource<resource_type>();
                            ^
./lingvo/core/ops/input_common.h:252:53: error: expected primary-expression before '>' token
     auto statusor = handle.GetResource<resource_type>();
                                                     ^
./lingvo/core/ops/input_common.h:252:55: error: expected primary-expression before ')' token
     auto statusor = handle.GetResource<resource_type>();
                                                       ^
Target //lingvo/jax:main failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 257.942s, Critical Path: 137.19s
INFO: 147 processes: 22 internal, 125 processwrapper-sandbox.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
@laurentes
Copy link
Contributor

Are you able to build using the provided build.sh script?
https://github.com/tensorflow/lingvo/blob/master/lingvo/jax/pip_package/build.sh

This sets several environment variables and flags, and it is hard for me to infer, which one you may need to fix your issue.

@laurentes
Copy link
Contributor

Nvm, I could reproduce your issue after modifying the script.

@laurentes
Copy link
Contributor

Heads up that I have a fix (pending review) that will hopefully land Tuesday morning PDT.

@laurentes
Copy link
Contributor

This should be fixed if you sync after fe60d03
Also make sure to update your docker build with the latest optax-shampoo (v0.0.5).

@ruomingp
Copy link
Contributor Author

Thank you so much, Laurent. Let me try it.

@ruomingp
Copy link
Contributor Author

Strangely now I ran into the "No module named 'clu'" error again:

# bazel run -c opt \
>     lingvo/jax:main -- \
>     --model=lm.ptb.PTBCharTransformerSmallSgd \
>     --job_log_dir=/tmp/jax_log_dir/exp01 --alsologtostderr
Extracting Bazel installation...
...
INFO: Build completed successfully, 239 total actions
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/jax/main.runfiles/__main__/lingvo/jax/main.py", line 34, in <module>
    from clu import platform
ModuleNotFoundError: No module named 'clu'

even though it's installed according to pip list:

root@5c3049184a19:/tmp/lingvo# pip list
Package                           Version
--------------------------------- -------------------
absl-py                           1.0.0
...
clu                               0.0.6
...

@ruomingp
Copy link
Contributor Author

Before my run, my docker was running out of space, so I ran

docker system prune --all

@laurentes
Copy link
Contributor

I think I know the issue. I should have warned you about this.

Could you try to just run python3 and check the default version / import clu?

My intuition is that the default will be python3.6, which is unsupported / doesn't come with the right dependencies.

All you have to do is to set another python version as the default, e.g. using:
update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.7 1
and then re-run your bazel command.

@ruomingp
Copy link
Contributor Author

Thanks!

Indeed. It's 3.6 and update-alternative solves the problem. Can we update the docker file to avoid 3.6?

# python3 --version
Python 3.6.13

After that, I'm running into another issue. Let me file a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants