Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No OpKernel was registered to support Op 'PreprocessingForward' Error for Multi Machine, Multi GPU #829

Open
wangcaihua opened this issue Apr 24, 2023 · 0 comments

Comments

@wangcaihua
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 20.04): Linux Ubuntu 20.04, Offical GPU Image 2304
  • DeepRec version or commit id: deeprec2302
  • Python version: 3.8.10
  • Bazel version (if compiling from source): not compiling from source
  • GCC/Compiler version (if compiling from source): gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  • CUDA/cuDNN version: 11.6

Describe the current behavior
[1,9]:Traceback (most recent call last):
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
[1,9]: return fn(*args)
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
[1,9]: self._extend_graph()
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
[1,9]: tf_session.ExtendSession(self._session)
[1,9]:tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'PreprocessingForward' used by {{node input_layer/input_layer/group_embedding_lookup/PreprocessingForward/PreprocessingForward}}with these attrs: [rank=9, id_in_local_rank=0, num_ranks=16, num_gpus=16, Toffsets=DT_INT64, Tindices=DT_INT64, num_lookups=26, combiners=["mean", "mean", "mean", "mean", "mean", ..., "mean", "mean", "mean", "mean", "mean"], dimensions=[16, 16, 16, 16, 16, ..., 16, 16, 16, 16, 16], shard=[-1, -1, -1, -1, -1, ..., -1, -1, -1, -1, -1]]
[1,9]:Registered devices: [CPU, XLA_CPU]
[1,9]:Registered kernels:
[1,9]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT32]
[1,9]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT64]
[1,9]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT32]
[1,9]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT64]
[1,9]:
[1,9]: [[input_layer/input_layer/group_embedding_lookup/PreprocessingForward/PreprocessingForward]]
[1,9]:
[1,9]:During handling of the above exception, another exception occurred:
[1,9]:
[1,9]:Traceback (most recent call last):
[1,9]: File "train.py", line 887, in
[1,9]: main()
[1,9]: File "train.py", line 642, in main
[1,9]: train(sess_config, hooks, model, train_init_op, train_steps,
[1,9]: File "train.py", line 505, in train
[1,9]: with tf.train.MonitoredTrainingSession(
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 655, in MonitoredTrainingSession
[1,9]: return MonitoredSession(
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1085, in init
[1,9]: super(MonitoredSession, self).init(
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 800, in init
[1,9]: self._sess = _RecoverableSession(self._coordinated_creator)
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1282, in init
[1,9]: _WrappedSession.init(self, self._create_session())
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1287, in _create_session
[1,9]: return self._sess_creator.create_session()
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 953, in create_session
[1,9]: self.tf_sess = self._session_creator.create_session()
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 713, in create_session
[1,9]: return self._get_session_manager().prepare_session(
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py", line 306, in prepare_session
[1,9]: sess.run(init_op, feed_dict=init_feed_dict)
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 955, in run
[1,9]: result = self._run(None, fetches, feed_dict, options_ptr,
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1179, in _run
[1,9]: results = self._do_run(handle, final_targets, final_fetches,
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1358, in _do_run
[1,9]: return self._do_call(_run_fn, feeds, fetches, targets, options,
[1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
[1,9]: raise type(e)(node_def, op, message)
[1,9]:tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'PreprocessingForward' used by node input_layer/input_layer/group_embedding_lookup/PreprocessingF[1,9]:orward/PreprocessingForward (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [rank=9, id_in_local_rank=0, num_ranks=16, num_gpus=16, Toffsets=DT_INT64, Tindices=DT_INT64, num_lookups=26, combiners=["mean", "mean", "mean", "mean", "mean", ..., "mean", "mean", "mean", "mean", "mean"], dimensions=[16, 16, 16, 16, 16, ..., 16, 16, 16, 16, 16], shard=[-1, -1, -1, -1, -1, ..., -1, -1, -1, -1, -1]]
[1,9]:Registered devices: [CPU, XLA_CPU]
[1,9]:Registered kernels:
[1,9]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT32]
[1,9]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT64]
[1,9]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT32]
[1,9]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT64]
[1,9]:
[1,9]: [[input_layer/input_layer/group_embedding_lookup/PreprocessingForward/PreprocessingForward]]

Describe the expected behavior

Code to reproduce the issue

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant