Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow 2.5 compatibility #401

Open
oliverhu opened this issue Jul 10, 2021 · 1 comment
Open

TensorFlow 2.5 compatibility #401

oliverhu opened this issue Jul 10, 2021 · 1 comment

Comments

@oliverhu
Copy link

Describe the bug
The example tensorflow2_keras_mnist.py example is not compatible with TF 2.5.

To Reproduce
Steps to reproduce the behavior:

  1. install tensorflow-gpu (2.5)
  2. install byteps latest (0.2.5)
2021-07-09 20:47:55.265246: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "tensorflow2_keras_mnist.py", line 19, in <module>
    import byteps.tensorflow.keras as bps
  File "/home/pi/p3/lib/python3.8/site-packages/byteps/tensorflow/__init__.py", line 27, in <module>
    from byteps.tensorflow.ops import broadcast, _push_pull
  File "/home/pi/p3/lib/python3.8/site-packages/byteps/tensorflow/ops.py", line 52, in <module>
    C_LIB = _load_library('c_lib' + get_ext_suffix())
  File "/home/pi/p3/lib/python3.8/site-packages/byteps/tensorflow/ops.py", line 48, in _load_library
    library = load_library.load_op_library(filename)
  File "/home/pi/p3/lib/python3.8/site-packages/tensorflow/python/framework/load_library.py", line 58, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/pi/p3/lib/python3.8/site-packages/byteps/tensorflow/c_lib.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN10tensorflow6StatusC1ENS_5error4CodeEN4absl14lts_2020_02_2511string_viewEOSt6vectorINS_10StackFrameESaIS7_EE
(p3) pi@pig ~/tf/byteps/example/tensorflow (master) $

Expected behavior
It should work

Environment (please complete the following information):

  • OS: Ubuntu 20.04
  • GCC version: 9.3
  • CUDA and NCCL version: 11.4, 8.2
  • Framework (TF, PyTorch, MXNet): TF
@oliverhu
Copy link
Author

Recompiled byteps locally and the incompatibility issue is gone, got a new issue:

(p3) pi@pig ~/tf/byteps (master) $ python example/tensorflow/tensorflow2_keras_mnist.py
2021-07-09 21:08:50.745545: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-09 21:08:52.169294: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-09 21:08:52.169430: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-09 21:08:52.169951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:08:00.0 name: NVIDIA GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.6325GHz coreCount: 28 deviceMemorySize: 10.91GiB deviceMemoryBandwidth: 451.17GiB/s
2021-07-09 21:08:52.169967: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-09 21:08:52.172159: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-07-09 21:08:52.172192: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-07-09 21:08:52.173500: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-07-09 21:08:52.173668: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-07-09 21:08:52.175531: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-07-09 21:08:52.176099: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-07-09 21:08:52.176199: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-07-09 21:08:52.176287: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-09 21:08:52.176844: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-09 21:08:52.177321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-07-09 21:08:52.447661: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-09 21:08:52.448289: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-09 21:08:52.448825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:08:00.0 name: NVIDIA GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.6325GHz coreCount: 28 deviceMemorySize: 10.91GiB deviceMemoryBandwidth: 451.17GiB/s
2021-07-09 21:08:52.448880: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-09 21:08:52.449390: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-09 21:08:52.449868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-07-09 21:08:52.449901: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-09 21:08:52.449981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-09 21:08:52.449992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0
2021-07-09 21:08:52.449999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N
2021-07-09 21:08:52.450072: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-09 21:08:52.450587: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-09 21:08:52.451099: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-09 21:08:52.451592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10373 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:08:00.0, compute capability: 6.1)
Epoch 1/24
Traceback (most recent call last):
  File "example/tensorflow/tensorflow2_keras_mnist.py", line 94, in <module>
    mnist_model.fit(dataset, steps_per_epoch=500 // bps.size(), callbacks=callbacks, epochs=24, verbose=verbose)
  File "/home/pi/p3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1183, in fit
    tmp_logs = self.train_function(iterator)
  File "/home/pi/p3/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/home/pi/p3/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 933, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/home/pi/p3/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 763, in _initialize
    self._stateful_fn._get_concrete_function_internal_garbage_collected(  # pylint: disable=protected-access
  File "/home/pi/p3/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3050, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/home/pi/p3/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3444, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/home/pi/p3/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3279, in _create_graph_function
    func_graph_module.func_graph_from_py_func(
  File "/home/pi/p3/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 999, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/home/pi/p3/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 672, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/home/pi/p3/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 986, in wrapper
    raise e.ag_error_metadata.to_exception(e)
Exception: in user code:

    /home/pi/p3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:855 train_function  *
        return step_function(self, iterator)
    /home/pi/p3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:845 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /home/pi/p3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:1285 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /home/pi/p3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:2833 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /home/pi/p3/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:3608 _call_for_each_replica
        return fn(*args, **kwargs)
    /home/pi/p3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:838 run_step  **
        outputs = model.train_step(data)
    /home/pi/p3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:799 train_step
        self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    /home/pi/p3/lib/python3.8/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:530 minimize
        return self.apply_gradients(grads_and_vars, name=name)
    /home/pi/p3/lib/python3.8/site-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/_keras/__init__.py:70 apply_gradients
        raise Exception('`apply_gradients()` was called without a call to '

    Exception: `apply_gradients()` was called without a call to `get_gradients()` or `_aggregate_gradients`. If you're using TensorFlow 2.0, please specify `experimental_run_tf_function=False` in `compile()`.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant