Bug on specifying GPU to tutorial example minist #2292

gaoteng-git · 2016-05-09T13:50:37Z

I tried to specify GPU ID to run the tutorial example mnist. I change the code to:

with tf.device('/gpu:3‘):
    # Generate placeholders for the images and labels.
    images_placeholder, labels_placeholder = placeholder_inputs(
        FLAGS.batch_size)
    # Build a Graph that computes predictions from the inference model.
    logits = mnist.inference(images_placeholder,
                                FLAGS.hidden1,
                                FLAGS.hidden2)
    # Add to the Graph the Ops for loss calculation.
    loss = mnist.loss(logits, labels_placeholder)

    # Add to the Graph the Ops that calculate and apply gradients.
    train_op = mnist.training(loss, FLAGS.learning_rate)

Then it reports error when running:

tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'global_step': Could not satisfy explicit device specification '/device:GPU:3' because no supported kernel for GPU devices is available
[[Node: global_step = Variablecontainer="", dtype=DT_INT32, shape=[], shared_name="", _device="/device:GPU:3"]]
Caused by op u'global_step', defined at:
File "fully_connected_feed.py", line 232, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "fully_connected_feed.py", line 228, in main
run_training()
File "fully_connected_feed.py", line 150, in run_training
train_op = mnist.training(loss, FLAGS.learning_rate)
File "/search/guangliang/package/tensorflow/tensorflow/examples/tutorials/mnist/mnist.py", line 125, in training
global_step = tf.Variable(0, name='global_step', trainable=False)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 209, in init
dtype=dtype)
...

Then I fix the line 125 in "mnist.py" with the following code:

with tf.device('/cpu:0'):
global_step = tf.Variable(0, name='global_step', trainable=False)

Then it reports the following error on rerunning:

tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'gradients/xentropy_mean_grad/Prod': Could not satisfy explicit device specification '/device:GPU:3' because no supported kernel for GPU devices is available
[[Node: gradients/xentropy_mean_grad/Prod = Prod[T=DT_INT32, keep_dims=false, _device="/device:GPU:3"](gradients/xentropy_mean_grad/Shape_2, gradients/xentropy_mean_grad/range_1)]]
Caused by op u'gradients/xentropy_mean_grad/Prod', defined at:
File "fully_connected_feed.py", line 232, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "fully_connected_feed.py", line 228, in main
run_training()
File "fully_connected_feed.py", line 150, in run_training
train_op = mnist.training(loss, FLAGS.learning_rate)
File "/search/guangliang/package/tensorflow/tensorflow/examples/tutorials/mnist/mnist.py", line 129, in training
train_op = optimizer.minimize(loss, global_step=global_step)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 190, in minimize
colocate_gradients_with_ops=colocate_gradients_with_ops)
...

Would you please help on this?
Thanks a lot in advance!

The text was updated successfully, but these errors were encountered:

gaoteng-git · 2016-05-10T05:19:29Z

I just follow mrry's suggestion here, adding "allow_soft_placement=True" as follows:

config = tf.ConfigProto(allow_soft_placement = True)
sess = tf.Session(config = config)

Then it works.

I reviewed the Using GPUs in tutorial. It mentions adding "allow_soft_placement" under the error "Could not satisfy explicit device specification '/gpu:X' ". But it not mentions it could also solve the error "no supported kernel for GPU devices is available". Maybe it's better to add this in tutorial text in order to avoid confusing future users.

ZhuFengdaaa · 2016-05-11T16:55:52Z

Have you notice that even if there is no error occurs, but the /gpu:3 device is not used ?

I have a problem described here that I cannot make use of the GPUs on the second machine. If I use like tf.device("/gpu:5"), the error like InvalidArgumentError: Cannot assign a device to node... occurs. But if I set allow_soft_placement to True, then all tasks will be running on the 4 gpus on machine A.

gaoteng-git · 2016-05-13T05:44:22Z

GPU3 is really under use if "allow_soft_placement = True" is added.
It seems multi-GPU-tower style can't assign your work to another machine, it could only parallel work to multiple GPUs inner machine. If you want to parallel it in a multi-node GPU cluster, you should try Distributed Tensorflow

ZhuFengdaaa · 2016-05-13T09:11:38Z

Yes, you are right. /gpu:%d is for local devices.

sherrym · 2016-05-13T20:55:31Z

As @smartcat2010 mentioned, the tutorial is to illustrate the use of allow_soft_placement.

Closing this as it's a not a bug.

MInner · 2016-06-09T21:51:22Z

I want to notice, that after doing tf.ConfigProto(allow_soft_placement=True, log_device_placement=True) it does actually choose the device you specify (gpu_n) without "no supported kernel for GPU devices is available" error.

geyang · 2016-09-10T02:59:11Z

Why is this the case?

I'm happy that this also solved my problem, but I'm a bit confused.

According to the doc, the allow_soft_placement=True is a flag used to find substitute devices if the device specified is unavailable. In this case, we specified a different device that is available. We shouldn't need this flag.

ilovin · 2017-03-27T14:55:07Z

after setting `allow_soft_placement=True' I get

site-packages/tensorflow/python/framework/test_util.py", line 248, in prepare_config
    config.allow_soft_placement = False
AttributeError: 'NoneType' object has no attribute 'allow_soft_placement'

adler-j · 2017-12-19T16:33:02Z

I was not getting this issue with TensorFlow 1.1, but after an upgrade to 1.4 I keep getting this issue (running the exact same file).

If i use allow_soft_placement=True I get a new error:

InvalidArgumentError: AttrValue must not have reference type value of float_ref
	 for attr 'tensor_type'
	; NodeDef: Conv/weights/Adam_1/_515 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3091_Conv/weights/Adam_1", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^optimizer/beta1_power/read/_281, ^optimizer/beta2_power/read/_283, ^optimizer/learning_rate/mul_2/_285, ^optimizer/Adam/beta1/_287, ^optimizer/Adam/beta2/_289, ^optimizer/Adam/epsilon/_291, ^optimizer/gradients/AddN_40/_517); Op<name=_Recv; signature= -> tensor:tensor_type; attr=tensor_type:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>
	 [[Node: Conv/weights/Adam_1/_515 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3091_Conv/weights/Adam_1", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^optimizer/beta1_power/read/_281, ^optimizer/beta2_power/read/_283, ^optimizer/learning_rate/mul_2/_285, ^optimizer/Adam/beta1/_287, ^optimizer/Adam/beta2/_289, ^optimizer/Adam/epsilon/_291, ^optimizer/gradients/AddN_40/_517)]]

magick93 · 2018-03-22T08:14:11Z

Im getting this issue on TensorFlow 1.5.

fmkazemi · 2018-05-27T20:42:20Z

I had this problem on Tensorflow-gpu 1.8 and Tensorflow-gpu 1.5 on GPU clusters but I didn't get this issue after installing Tensorflow-gpu 1.0.1. So my problem was solved.
Ofcourse, I had the code below for all tests.
config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC'
config.gpu_options.allow_growth = True
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)) as sess:

petewarden assigned sherrym May 10, 2016

myme5261314 mentioned this issue May 11, 2016

the bug of using multiple GPUs, related to tf.Variable pinned to CPU #2285

Closed

sherrym closed this as completed May 13, 2016

gaoteng-git mentioned this issue May 16, 2016

Distributed tensorflow hang when turning from Async to Sync #2386

Closed

suiyuan2009 mentioned this issue Jul 27, 2016

add support for nesterov momentum #2798

Merged

cathalgarvey mentioned this issue May 30, 2017

Explicit Device Specification doesn't work? hughperkins/tf-coriander#37

Closed

dpressel mentioned this issue Jun 2, 2017

not working on gpu ;( dpressel/rude-carnie#24

Closed

dhruvmalik007 mentioned this issue Oct 4, 2017

Can't run new ops in new session after sess.run RuntimeError #13492

Closed

ArvinCharl mentioned this issue Nov 20, 2019

TensorFlow device error while running example. NVIDIA-AI-IOT/tf_trt_models#64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug on specifying GPU to tutorial example minist #2292

Bug on specifying GPU to tutorial example minist #2292

gaoteng-git commented May 9, 2016 •

edited

gaoteng-git commented May 10, 2016 •

edited

ZhuFengdaaa commented May 11, 2016

gaoteng-git commented May 13, 2016

ZhuFengdaaa commented May 13, 2016

sherrym commented May 13, 2016

MInner commented Jun 9, 2016 •

edited

geyang commented Sep 10, 2016

ilovin commented Mar 27, 2017

adler-j commented Dec 19, 2017

magick93 commented Mar 22, 2018

fmkazemi commented May 27, 2018

Bug on specifying GPU to tutorial example minist #2292

Bug on specifying GPU to tutorial example minist #2292

Comments

gaoteng-git commented May 9, 2016 • edited

gaoteng-git commented May 10, 2016 • edited

ZhuFengdaaa commented May 11, 2016

gaoteng-git commented May 13, 2016

ZhuFengdaaa commented May 13, 2016

sherrym commented May 13, 2016

MInner commented Jun 9, 2016 • edited

geyang commented Sep 10, 2016

ilovin commented Mar 27, 2017

adler-j commented Dec 19, 2017

magick93 commented Mar 22, 2018

fmkazemi commented May 27, 2018

gaoteng-git commented May 9, 2016 •

edited

gaoteng-git commented May 10, 2016 •

edited

MInner commented Jun 9, 2016 •

edited