Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is "multi-gpus training support" in next branch only for linux? #1437

Closed
trainewbie opened this issue May 16, 2018 · 9 comments
Closed

Is "multi-gpus training support" in next branch only for linux? #1437

trainewbie opened this issue May 16, 2018 · 9 comments

Comments

@trainewbie
Copy link

My OS is windows7(with python 3.5 or 3.6 or ananconda, tensorflow 1.4, cuda 8.0 and cudnn 6.0).
There was no problem to use net_to_model and train a network in a single gpu with the master branch tf.

When using python net_to_model.py and python parse.py command with multi-gpus tf, errors occur.
Am I missing something?

@gcp
Copy link
Member

gcp commented May 16, 2018

errors occur.

What errors occur?

@trainewbie
Copy link
Author

trainewbie commented May 16, 2018

This is the error message when using net_to_model

J:\mtraining\tf>python net_to_model.py a.txt
Version 1
Channels 256
Blocks 20
2018-05-17 02:58:29.349474: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that th
is TensorFlow binary was not compiled to use: AVX AVX2
2018-05-17 02:58:29.861504: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 10.72GiB
2018-05-17 02:58:30.180522: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:02:00.0
totalMemory: 11.00GiB freeMemory: 10.73GiB
2018-05-17 02:58:30.196523: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1045] Device peer to peer matrix
2018-05-17 02:58:30.203523: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1051] DMA: 0 1
2018-05-17 02:58:30.210524: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 0: Y N
2018-05-17 02:58:30.217524: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 1: N Y
2018-05-17 02:58:30.225525: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/devic
e:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute cap
ability: 6.1)
2018-05-17 02:58:30.237525: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/devic
e:GPU:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute cap
ability: 6.1)
Traceback (most recent call last):
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\common_shapes.py", line 686, in _call_cpp_shape_fn_impl
input_tensors_as_shapes, status)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension size must be eve
nly divisible by 2 but is 1
Number of ways to split should evenly divide the split dimension for 'split' (op
: 'Split') with input shapes: [], [1,18,361] and with computed input tensors: input[0] =
<0>.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "net_to_model.py", line 27, in
tfprocess.init(batch_size=1)
File "J:\mtraining\tf\tfprocess.py", line 162, in init
self.init_net(planes, probs, winner)
File "J:\mtraining\tf\tfprocess.py", line 166, in init_net
self.sx = tf.split(planes, self.gpus_num)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\ops\array_ops.py", line 1265, in split
split_dim=axis, num_split=num_or_size_splits, value=value, name=name)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\ops\gen_array_ops.py", line 5093, in _split
name=name)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\ops.py", line 2958, in create_op
set_shapes_for_outputs(ret)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\ops.py", line 2209, in set_shapes_for_outputs
shapes = shape_func(op)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\ops.py", line 2159, in call_with_requiring
return call_cpp_shape_fn(op, require_shape_fn=True)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\common_shapes.py", line 627, in call_cpp_shape_fn
require_shape_fn)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\common_shapes.py", line 691, in _call_cpp_shape_fn_impl
raise ValueError(err.message)
ValueError: Dimension size must be evenly divisible by 2 but is 1
Number of ways to split should evenly divide the split dimension for 'split' (op
: 'Split') with input shapes: [], [1,18,361] and with computed input tensors: input[0] =
<0>.

J:\mtraining\tf>

@godmoves
Copy link
Contributor

godmoves commented May 17, 2018

Number of ways to split should evenly divide the split dimension for 'split' (op
: 'Split') with input shapes: [], [1,18,361] and with computed input tensors: input[0] =
<0>.

Seems an error of split, I will check that.

UPDATE: When using net_to_model, GPU number should be set to 1 : )
And I fix this in a new branch, can you @trainewbie check if this works?

@trainewbie
Copy link
Author

trainewbie commented May 17, 2018

Thanks @godmoves. I dowonloaded your new branch tf, and I got this result.
(Before run it, I set the 256 filters, 20 blocks, and self.gpu_num=1
and for one more test, I set the 256 filters, 20 blocks, and self.gpu_num=4.)

J:\mtrainingfix\tf>python net_to_model.py b.txt
Version 1
Channels 256
Blocks 20
2018-05-17 12:35:35.438078: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that th
is TensorFlow binary was not compiled to use: AVX AVX2
2018-05-17 12:35:35.903105: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 10.65GiB
2018-05-17 12:35:36.158120: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:02:00.0
totalMemory: 11.00GiB freeMemory: 10.73GiB
2018-05-17 12:35:36.455137: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:03:00.0
totalMemory: 11.00GiB freeMemory: 10.73GiB
2018-05-17 12:35:36.715152: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:04:00.0
totalMemory: 11.00GiB freeMemory: 10.73GiB
2018-05-17 12:35:36.728152: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1045] Device peer to peer matrix
2018-05-17 12:35:36.735153: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1051] DMA: 0 1 2 3
2018-05-17 12:35:36.742153: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 0: Y N N N
2018-05-17 12:35:36.748153: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 1: N Y N N
2018-05-17 12:35:36.754154: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 2: N N Y N
2018-05-17 12:35:36.761154: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 3: N N N Y
2018-05-17 12:35:36.767154: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/devic
e:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute cap
ability: 6.1)
2018-05-17 12:35:36.778155: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/devic
e:GPU:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute cap
ability: 6.1)
2018-05-17 12:35:36.790156: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/devic
e:GPU:2) -> (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute cap
ability: 6.1)
2018-05-17 12:35:36.801156: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/devic
e:GPU:3) -> (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute cap
ability: 6.1)
Traceback (most recent call last):
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\client\session.py", line 1323, in _do_call
return fn(*args)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\client\session.py", line 1293, in _run_fn
self._extend_graph()
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\client\session.py", line 1354, in _extend_graph
self._session, graph_def.SerializeToString(), status)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for
operation 'tower_0/get_regularization_penalty/l2_regularizer_45/L2Loss': Could not sati
sfy explicit device specification '/device:GPU:0' because no supported kernel for GPU de
vices is available.
[[Node: tower_0/get_regularization_penalty/l2_regularizer_45/L2Loss = L2LossT=
DT_FLOAT, _device="/device:GPU:0"
]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "net_to_model.py", line 27, in
tfprocess.init(batch_size=1, gpus_num=1)
File "J:\mtrainingfix\tf\tfprocess.py", line 164, in init
self.init_net(planes, probs, winner, gpus_num)
File "J:\mtrainingfix\tf\tfprocess.py", line 287, in init_net
self.session.run(tf.global_variables_initializer())
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\client\session.py", line 889, in run
run_metadata_ptr)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\client\session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\client\session.py", line 1317, in _do_run
options, run_metadata)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\client\session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for
operation 'tower_0/get_regularization_penalty/l2_regularizer_45/L2Loss': Could not sati
sfy explicit device specification '/device:GPU:0' because no supported kernel for GPU de
vices is available.
[[Node: tower_0/get_regularization_penalty/l2_regularizer_45/L2Loss = L2LossT=
DT_FLOAT, _device="/device:GPU:0"
]]

Caused by op 'tower_0/get_regularization_penalty/l2_regularizer_45/L2Loss', defined at:
File "net_to_model.py", line 27, in
tfprocess.init(batch_size=1, gpus_num=1)
File "J:\mtrainingfix\tf\tfprocess.py", line 164, in init
self.init_net(planes, probs, winner, gpus_num)
File "J:\mtrainingfix\tf\tfprocess.py", line 191, in init_net
self.sx[i], self.sy_[i], self.sz_[i])
File "J:\mtrainingfix\tf\tfprocess.py", line 322, in tower_loss
tf.contrib.layers.apply_regularization(regularizer, reg_variables)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\contrib\layers\python\layers\regularizers.py", line 193, in apply_regularization
penalties = [regularizer(w) for w in weights_list]
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\contrib\layers\python\layers\regularizers.py", line 193, in
penalties = [regularizer(w) for w in weights_list]
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\contrib\layers\python\layers\regularizers.py", line 107, in l2
return standard_ops.multiply(my_scale, nn.l2_loss(weights), name=name)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\ops\gen_nn_ops.py", line 2586, in l2_loss
"L2Loss", t=t, name=name)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\ops.py", line 2956, in create_op
op_def=op_def)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'to
wer_0/get_regularization_penalty/l2_regularizer_45/L2Loss': Could not satisfy explicit d
evice specification '/device:GPU:0' because no supported kernel for GPU devices is avail
able.
[[Node: tower_0/get_regularization_penalty/l2_regularizer_45/L2Loss = L2LossT=
DT_FLOAT, _device="/device:GPU:0"
]]

J:\mtrainingfix\tf>

Upper blue text is so strange. T=DT_FLOAT, _device="/device:GPU:0"](w_fc_3/read)]] is original text.

@bjiyxo
Copy link

bjiyxo commented May 17, 2018

I heard that it can be solved by compiling TensorFlow. You may try it.

@trainewbie
Copy link
Author

trainewbie commented May 17, 2018

I'll try comiling and rebuild TF 1.4 with .whl file.
I guess it'll take 2~4 hours. OMG.

Thanks @bjiyxo

@godmoves
Copy link
Contributor

godmoves commented May 17, 2018

@trainewbie This issue is something about the TensorFlow's variable assignment mechanism, and maybe soft placement will solve that. I have updated the code, can you check it?

EDIT: something similar here

@trainewbie
Copy link
Author

trainewbie commented May 17, 2018

@godmoves Solved! Great!
I didn't see any error message,
Model was created, and model training by 4 gpus works well perfectly.

Thanks a lot. :-)

@godmoves
Copy link
Contributor

Thanks for your feedback, I will open a PR for this 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants