Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug on specifying GPU to tutorial example minist #2292

Closed
gaoteng-git opened this issue May 9, 2016 · 11 comments
Closed

Bug on specifying GPU to tutorial example minist #2292

gaoteng-git opened this issue May 9, 2016 · 11 comments
Assignees

Comments

@gaoteng-git
Copy link

gaoteng-git commented May 9, 2016

I tried to specify GPU ID to run the tutorial example mnist. I change the code to:

with tf.device('/gpu:3‘):
    # Generate placeholders for the images and labels.
    images_placeholder, labels_placeholder = placeholder_inputs(
        FLAGS.batch_size)
    # Build a Graph that computes predictions from the inference model.
    logits = mnist.inference(images_placeholder,
                                FLAGS.hidden1,
                                FLAGS.hidden2)
    # Add to the Graph the Ops for loss calculation.
    loss = mnist.loss(logits, labels_placeholder)

    # Add to the Graph the Ops that calculate and apply gradients.
    train_op = mnist.training(loss, FLAGS.learning_rate)

Then it reports error when running:

tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'global_step': Could not satisfy explicit device specification '/device:GPU:3' because no supported kernel for GPU devices is available
[[Node: global_step = Variablecontainer="", dtype=DT_INT32, shape=[], shared_name="", _device="/device:GPU:3"]]
Caused by op u'global_step', defined at:
File "fully_connected_feed.py", line 232, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "fully_connected_feed.py", line 228, in main
run_training()
File "fully_connected_feed.py", line 150, in run_training
train_op = mnist.training(loss, FLAGS.learning_rate)
File "/search/guangliang/package/tensorflow/tensorflow/examples/tutorials/mnist/mnist.py", line 125, in training
global_step = tf.Variable(0, name='global_step', trainable=False)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 209, in init
dtype=dtype)
...

Then I fix the line 125 in "mnist.py" with the following code:

with tf.device('/cpu:0'):
global_step = tf.Variable(0, name='global_step', trainable=False)

Then it reports the following error on rerunning:

tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'gradients/xentropy_mean_grad/Prod': Could not satisfy explicit device specification '/device:GPU:3' because no supported kernel for GPU devices is available
[[Node: gradients/xentropy_mean_grad/Prod = Prod[T=DT_INT32, keep_dims=false, _device="/device:GPU:3"](gradients/xentropy_mean_grad/Shape_2, gradients/xentropy_mean_grad/range_1)]]
Caused by op u'gradients/xentropy_mean_grad/Prod', defined at:
File "fully_connected_feed.py", line 232, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "fully_connected_feed.py", line 228, in main
run_training()
File "fully_connected_feed.py", line 150, in run_training
train_op = mnist.training(loss, FLAGS.learning_rate)
File "/search/guangliang/package/tensorflow/tensorflow/examples/tutorials/mnist/mnist.py", line 129, in training
train_op = optimizer.minimize(loss, global_step=global_step)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 190, in minimize
colocate_gradients_with_ops=colocate_gradients_with_ops)
...

Would you please help on this?
Thanks a lot in advance!

@gaoteng-git
Copy link
Author

gaoteng-git commented May 10, 2016

I just follow mrry's suggestion here, adding "allow_soft_placement=True" as follows:

config = tf.ConfigProto(allow_soft_placement = True)
sess = tf.Session(config = config)

Then it works.

I reviewed the Using GPUs in tutorial. It mentions adding "allow_soft_placement" under the error "Could not satisfy explicit device specification '/gpu:X' ". But it not mentions it could also solve the error "no supported kernel for GPU devices is available". Maybe it's better to add this in tutorial text in order to avoid confusing future users.

@ZhuFengdaaa
Copy link

Have you notice that even if there is no error occurs, but the /gpu:3 device is not used ?

I have a problem described here that I cannot make use of the GPUs on the second machine. If I use like tf.device("/gpu:5"), the error like InvalidArgumentError: Cannot assign a device to node... occurs. But if I set allow_soft_placement to True, then all tasks will be running on the 4 gpus on machine A.

@gaoteng-git
Copy link
Author

GPU3 is really under use if "allow_soft_placement = True" is added.
It seems multi-GPU-tower style can't assign your work to another machine, it could only parallel work to multiple GPUs inner machine. If you want to parallel it in a multi-node GPU cluster, you should try Distributed Tensorflow

@ZhuFengdaaa
Copy link

Yes, you are right. /gpu:%d is for local devices.

@sherrym
Copy link
Contributor

sherrym commented May 13, 2016

As @smartcat2010 mentioned, the tutorial is to illustrate the use of allow_soft_placement.

Closing this as it's a not a bug.

@MInner
Copy link

MInner commented Jun 9, 2016

I want to notice, that after doing tf.ConfigProto(allow_soft_placement=True, log_device_placement=True) it does actually choose the device you specify (gpu_n) without "no supported kernel for GPU devices is available" error.

@geyang
Copy link

geyang commented Sep 10, 2016

Why is this the case?

I'm happy that this also solved my problem, but I'm a bit confused.

According to the doc, the allow_soft_placement=True is a flag used to find substitute devices if the device specified is unavailable. In this case, we specified a different device that is available. We shouldn't need this flag.

@ilovin
Copy link

ilovin commented Mar 27, 2017

after setting `allow_soft_placement=True' I get

site-packages/tensorflow/python/framework/test_util.py", line 248, in prepare_config
    config.allow_soft_placement = False
AttributeError: 'NoneType' object has no attribute 'allow_soft_placement'

@adler-j
Copy link

adler-j commented Dec 19, 2017

I was not getting this issue with TensorFlow 1.1, but after an upgrade to 1.4 I keep getting this issue (running the exact same file).

If i use allow_soft_placement=True I get a new error:

InvalidArgumentError: AttrValue must not have reference type value of float_ref
	 for attr 'tensor_type'
	; NodeDef: Conv/weights/Adam_1/_515 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3091_Conv/weights/Adam_1", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^optimizer/beta1_power/read/_281, ^optimizer/beta2_power/read/_283, ^optimizer/learning_rate/mul_2/_285, ^optimizer/Adam/beta1/_287, ^optimizer/Adam/beta2/_289, ^optimizer/Adam/epsilon/_291, ^optimizer/gradients/AddN_40/_517); Op<name=_Recv; signature= -> tensor:tensor_type; attr=tensor_type:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>
	 [[Node: Conv/weights/Adam_1/_515 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3091_Conv/weights/Adam_1", tensor_type=DT_FLOAT_REF, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^optimizer/beta1_power/read/_281, ^optimizer/beta2_power/read/_283, ^optimizer/learning_rate/mul_2/_285, ^optimizer/Adam/beta1/_287, ^optimizer/Adam/beta2/_289, ^optimizer/Adam/epsilon/_291, ^optimizer/gradients/AddN_40/_517)]]

@magick93
Copy link

Im getting this issue on TensorFlow 1.5.

@fmkazemi
Copy link

I had this problem on Tensorflow-gpu 1.8 and Tensorflow-gpu 1.5 on GPU clusters but I didn't get this issue after installing Tensorflow-gpu 1.0.1. So my problem was solved.
Ofcourse, I had the code below for all tests.
config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC'
config.gpu_options.allow_growth = True
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)) as sess:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants