Is "multi-gpus training support" in next branch only for linux? #1437

trainewbie · 2018-05-16T15:22:36Z

My OS is windows7(with python 3.5 or 3.6 or ananconda, tensorflow 1.4, cuda 8.0 and cudnn 6.0).
There was no problem to use net_to_model and train a network in a single gpu with the master branch tf.

When using python net_to_model.py and python parse.py command with multi-gpus tf, errors occur.
Am I missing something?

gcp · 2018-05-16T16:42:29Z

errors occur.

What errors occur?

trainewbie · 2018-05-16T18:09:16Z

This is the error message when using net_to_model

J:\mtraining\tf>python net_to_model.py a.txt
Version 1
Channels 256
Blocks 20
2018-05-17 02:58:29.349474: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that th
is TensorFlow binary was not compiled to use: AVX AVX2
2018-05-17 02:58:29.861504: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 10.72GiB
2018-05-17 02:58:30.180522: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:02:00.0
totalMemory: 11.00GiB freeMemory: 10.73GiB
2018-05-17 02:58:30.196523: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1045] Device peer to peer matrix
2018-05-17 02:58:30.203523: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1051] DMA: 0 1
2018-05-17 02:58:30.210524: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 0: Y N
2018-05-17 02:58:30.217524: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 1: N Y
2018-05-17 02:58:30.225525: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/devic
e:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute cap
ability: 6.1)
2018-05-17 02:58:30.237525: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/devic
e:GPU:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute cap
ability: 6.1)
Traceback (most recent call last):
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\common_shapes.py", line 686, in _call_cpp_shape_fn_impl
input_tensors_as_shapes, status)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension size must be eve
nly divisible by 2 but is 1
Number of ways to split should evenly divide the split dimension for 'split' (op
: 'Split') with input shapes: [], [1,18,361] and with computed input tensors: input[0] =
<0>.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "net_to_model.py", line 27, in
tfprocess.init(batch_size=1)
File "J:\mtraining\tf\tfprocess.py", line 162, in init
self.init_net(planes, probs, winner)
File "J:\mtraining\tf\tfprocess.py", line 166, in init_net
self.sx = tf.split(planes, self.gpus_num)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\ops\array_ops.py", line 1265, in split
split_dim=axis, num_split=num_or_size_splits, value=value, name=name)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\ops\gen_array_ops.py", line 5093, in _split
name=name)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\ops.py", line 2958, in create_op
set_shapes_for_outputs(ret)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\ops.py", line 2209, in set_shapes_for_outputs
shapes = shape_func(op)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\ops.py", line 2159, in call_with_requiring
return call_cpp_shape_fn(op, require_shape_fn=True)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\common_shapes.py", line 627, in call_cpp_shape_fn
require_shape_fn)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\common_shapes.py", line 691, in _call_cpp_shape_fn_impl
raise ValueError(err.message)
ValueError: Dimension size must be evenly divisible by 2 but is 1
Number of ways to split should evenly divide the split dimension for 'split' (op
: 'Split') with input shapes: [], [1,18,361] and with computed input tensors: input[0] =
<0>.

J:\mtraining\tf>

godmoves · 2018-05-17T01:58:12Z

Number of ways to split should evenly divide the split dimension for 'split' (op
: 'Split') with input shapes: [], [1,18,361] and with computed input tensors: input[0] =
<0>.

Seems an error of split, I will check that.

UPDATE: When using net_to_model, GPU number should be set to 1 : )
And I fix this in a new branch, can you @trainewbie check if this works?

trainewbie · 2018-05-17T03:25:08Z

Thanks @godmoves. I dowonloaded your new branch tf, and I got this result.
(Before run it, I set the 256 filters, 20 blocks, and self.gpu_num=1
and for one more test, I set the 256 filters, 20 blocks, and self.gpu_num=4.)

J:\mtrainingfix\tf>python net_to_model.py b.txt
Version 1
Channels 256
Blocks 20
2018-05-17 12:35:35.438078: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that th
is TensorFlow binary was not compiled to use: AVX AVX2
2018-05-17 12:35:35.903105: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 10.65GiB
2018-05-17 12:35:36.158120: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:02:00.0
totalMemory: 11.00GiB freeMemory: 10.73GiB
2018-05-17 12:35:36.455137: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:03:00.0
totalMemory: 11.00GiB freeMemory: 10.73GiB
2018-05-17 12:35:36.715152: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:04:00.0
totalMemory: 11.00GiB freeMemory: 10.73GiB
2018-05-17 12:35:36.728152: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1045] Device peer to peer matrix
2018-05-17 12:35:36.735153: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1051] DMA: 0 1 2 3
2018-05-17 12:35:36.742153: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 0: Y N N N
2018-05-17 12:35:36.748153: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 1: N Y N N
2018-05-17 12:35:36.754154: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 2: N N Y N
2018-05-17 12:35:36.761154: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1061] 3: N N N Y
2018-05-17 12:35:36.767154: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/devic
e:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute cap
ability: 6.1)
2018-05-17 12:35:36.778155: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/devic
e:GPU:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute cap
ability: 6.1)
2018-05-17 12:35:36.790156: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/devic
e:GPU:2) -> (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute cap
ability: 6.1)
2018-05-17 12:35:36.801156: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\t
ensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/devic
e:GPU:3) -> (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute cap
ability: 6.1)
Traceback (most recent call last):
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\client\session.py", line 1323, in _do_call
return fn(*args)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\client\session.py", line 1293, in _run_fn
self._extend_graph()
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\client\session.py", line 1354, in _extend_graph
self._session, graph_def.SerializeToString(), status)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for
operation 'tower_0/get_regularization_penalty/l2_regularizer_45/L2Loss': Could not sati
sfy explicit device specification '/device:GPU:0' because no supported kernel for GPU de
vices is available.
[[Node: tower_0/get_regularization_penalty/l2_regularizer_45/L2Loss = L2LossT=
DT_FLOAT, _device="/device:GPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "net_to_model.py", line 27, in
tfprocess.init(batch_size=1, gpus_num=1)
File "J:\mtrainingfix\tf\tfprocess.py", line 164, in init
self.init_net(planes, probs, winner, gpus_num)
File "J:\mtrainingfix\tf\tfprocess.py", line 287, in init_net
self.session.run(tf.global_variables_initializer())
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\client\session.py", line 889, in run
run_metadata_ptr)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\client\session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\client\session.py", line 1317, in _do_run
options, run_metadata)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\client\session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for
operation 'tower_0/get_regularization_penalty/l2_regularizer_45/L2Loss': Could not sati
sfy explicit device specification '/device:GPU:0' because no supported kernel for GPU de
vices is available.
[[Node: tower_0/get_regularization_penalty/l2_regularizer_45/L2Loss = L2LossT=
DT_FLOAT, _device="/device:GPU:0"]]

Caused by op 'tower_0/get_regularization_penalty/l2_regularizer_45/L2Loss', defined at:
File "net_to_model.py", line 27, in
tfprocess.init(batch_size=1, gpus_num=1)
File "J:\mtrainingfix\tf\tfprocess.py", line 164, in init
self.init_net(planes, probs, winner, gpus_num)
File "J:\mtrainingfix\tf\tfprocess.py", line 191, in init_net
self.sx[i], self.sy_[i], self.sz_[i])
File "J:\mtrainingfix\tf\tfprocess.py", line 322, in tower_loss
tf.contrib.layers.apply_regularization(regularizer, reg_variables)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\contrib\layers\python\layers\regularizers.py", line 193, in apply_regularization
penalties = [regularizer(w) for w in weights_list]
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\contrib\layers\python\layers\regularizers.py", line 193, in
penalties = [regularizer(w) for w in weights_list]
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\contrib\layers\python\layers\regularizers.py", line 107, in l2
return standard_ops.multiply(my_scale, nn.l2_loss(weights), name=name)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\ops\gen_nn_ops.py", line 2586, in l2_loss
"L2Loss", t=t, name=name)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\ops.py", line 2956, in create_op
op_def=op_def)
File "C:\Users\home\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflo
w\python\framework\ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'to
wer_0/get_regularization_penalty/l2_regularizer_45/L2Loss': Could not satisfy explicit d
evice specification '/device:GPU:0' because no supported kernel for GPU devices is avail
able.
[[Node: tower_0/get_regularization_penalty/l2_regularizer_45/L2Loss = L2LossT=
DT_FLOAT, _device="/device:GPU:0"]]

J:\mtrainingfix\tf>

Upper blue text is so strange. T=DT_FLOAT, _device="/device:GPU:0"](w_fc_3/read)]] is original text.

bjiyxo · 2018-05-17T05:51:51Z

I heard that it can be solved by compiling TensorFlow. You may try it.

trainewbie · 2018-05-17T06:19:02Z

I'll try comiling and rebuild TF 1.4 with .whl file.
I guess it'll take 2~4 hours. OMG.

Thanks @bjiyxo

godmoves · 2018-05-17T08:18:47Z

@trainewbie This issue is something about the TensorFlow's variable assignment mechanism, and maybe soft placement will solve that. I have updated the code, can you check it?

EDIT: something similar here

trainewbie · 2018-05-17T09:00:42Z

@godmoves Solved! Great!
I didn't see any error message,
Model was created, and model training by 4 gpus works well perfectly.

Thanks a lot. :-)

godmoves · 2018-05-17T09:15:17Z

Thanks for your feedback, I will open a PR for this 😄

godmoves mentioned this issue May 17, 2018

[Multi GPU] fix split and variable placement error on Windows #1443

Merged

trainewbie closed this as completed May 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is "multi-gpus training support" in next branch only for linux? #1437

Is "multi-gpus training support" in next branch only for linux? #1437

trainewbie commented May 16, 2018

gcp commented May 16, 2018 •

edited

trainewbie commented May 16, 2018 •

edited

godmoves commented May 17, 2018 •

edited

trainewbie commented May 17, 2018 •

edited

bjiyxo commented May 17, 2018

trainewbie commented May 17, 2018 •

edited

godmoves commented May 17, 2018 •

edited

trainewbie commented May 17, 2018 •

edited

godmoves commented May 17, 2018

Is "multi-gpus training support" in next branch only for linux? #1437

Is "multi-gpus training support" in next branch only for linux? #1437

Comments

trainewbie commented May 16, 2018

gcp commented May 16, 2018 • edited

trainewbie commented May 16, 2018 • edited

godmoves commented May 17, 2018 • edited

trainewbie commented May 17, 2018 • edited

bjiyxo commented May 17, 2018

trainewbie commented May 17, 2018 • edited

godmoves commented May 17, 2018 • edited

trainewbie commented May 17, 2018 • edited

godmoves commented May 17, 2018

gcp commented May 16, 2018 •

edited

trainewbie commented May 16, 2018 •

edited

godmoves commented May 17, 2018 •

edited

trainewbie commented May 17, 2018 •

edited

trainewbie commented May 17, 2018 •

edited

godmoves commented May 17, 2018 •

edited

trainewbie commented May 17, 2018 •

edited