Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting kernel... while running in multi gpu mode! #95

Open
bhralzz opened this issue Aug 1, 2021 · 2 comments
Open

Restarting kernel... while running in multi gpu mode! #95

bhralzz opened this issue Aug 1, 2021 · 2 comments
Assignees
Labels
wontfix This will not be worked on

Comments

@bhralzz
Copy link

bhralzz commented Aug 1, 2021

Hi,
Dear all contributors,
I have tested multi gpu mode and came up with an error:
Restarting kernel...
just by switching the multi gpu flag to True its working some few batch progress the stopped suddenly, when the single gpu mode worked fine, without any error
all the logs of running as below:

Thanks for your valuable hints

debugfile('/media/hpds/harda/MISE/KITS/MIScnn-master_WV_400_04_31/KITS_TEST.py', wdir='/media/hpds/harda/MISE/KITS/MIScnn-master_WV_400_04_31')

/media/hpds/harda/MISE/KITS/MIScnn-master_WV_400_04_31/KITS_TEST.py(4)()
2 # Wavelet setting
3
----> 4 wv_l=3
5 wv_f='db1'
6

!continue
2021-08-01 06:51:45.454993: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
All samples: ['case_00001', 'case_00002', 'case_00003', 'case_00004', 'case_00005', 'case_00006', 'case_00007', 'case_00008', 'case_00009', 'case_00011', 'case_00012', 'case_00013', 'case_00014', 'case_00015', 'case_00016', 'case_00017', 'case_00018', 'case_00019', 'case_00021', 'case_00022', 'case_00023', 'case_00024', 'case_00025', 'case_00026', 'case_00027', 'case_00028', 'case_00029', 'case_00031', 'case_00032', 'case_00033', 'case_00034', 'case_00035', 'case_00036', 'case_00037', 'case_00038', 'case_00039', 'case_00041', 'case_00042', 'case_00043', 'case_00044', 'case_00045', 'case_00046', 'case_00047', 'case_00048', 'case_00049', 'case_00051', 'case_00052', 'case_00053', 'case_00054', 'case_00055', 'case_00056', 'case_00057', 'case_00058', 'case_00059', 'case_00061', 'case_00062', 'case_00063', 'case_00064', 'case_00065', 'case_00066', 'case_00067', 'case_00068', 'case_00069', 'case_00071', 'case_00072', 'case_00073', 'case_00074', 'case_00075', 'case_00076', 'case_00077', 'case_00078', 'case_00079', 'case_00081', 'case_00082', 'case_00083', 'case_00084', 'case_00085', 'case_00086', 'case_00087', 'case_00088', 'case_00089', 'case_00091', 'case_00092', 'case_00093', 'case_00094', 'case_00095', 'case_00096', 'case_00097', 'case_00098', 'case_00099', 'case_00101', 'case_00102', 'case_00103', 'case_00104', 'case_00105', 'case_00106', 'case_00107', 'case_00108', 'case_00109', 'case_00111', 'case_00112', 'case_00113', 'case_00114', 'case_00115', 'case_00116', 'case_00117', 'case_00118', 'case_00119', 'case_00121', 'case_00122', 'case_00123', 'case_00124', 'case_00125', 'case_00126', 'case_00127', 'case_00128', 'case_00129', 'case_00131', 'case_00132', 'case_00133', 'case_00134', 'case_00135', 'case_00136', 'case_00137', 'case_00138', 'case_00139', 'case_00141', 'case_00142', 'case_00143', 'case_00144', 'case_00145', 'case_00146', 'case_00147', 'case_00148', 'case_00149', 'case_00151', 'case_00152', 'case_00153', 'case_00154', 'case_00155', 'case_00156', 'case_00157', 'case_00158', 'case_00159', 'case_00161', 'case_00162', 'case_00163', 'case_00164', 'case_00165', 'case_00166', 'case_00167', 'case_00168', 'case_00169', 'case_00171', 'case_00172', 'case_00173', 'case_00174', 'case_00175', 'case_00176', 'case_00177', 'case_00178', 'case_00179', 'case_00181', 'case_00182', 'case_00183', 'case_00184', 'case_00185', 'case_00186', 'case_00187', 'case_00188', 'case_00189', 'case_00191', 'case_00192', 'case_00193', 'case_00194', 'case_00195', 'case_00196', 'case_00197', 'case_00198', 'case_00199', 'case_00201', 'case_00202', 'case_00203', 'case_00204', 'case_00205', 'case_00206', 'case_00207', 'case_00208', 'case_00209']
WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.
2021-08-01 06:51:48.991371: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-08-01 06:51:49.595050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:04:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:49.600184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
pciBusID: 0000:06:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:49.605247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 2 with properties:
pciBusID: 0000:07:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:49.609882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 3 with properties:
pciBusID: 0000:08:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:49.614662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 4 with properties:
pciBusID: 0000:0c:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:49.619480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 5 with properties:
pciBusID: 0000:0d:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:49.624138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 6 with properties:
pciBusID: 0000:0e:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:49.628801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 7 with properties:
pciBusID: 0000:0f:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:49.628832: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-01 06:51:49.632107: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-08-01 06:51:49.632158: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-08-01 06:51:49.633234: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-08-01 06:51:49.633475: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-08-01 06:51:49.634412: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-08-01 06:51:49.635249: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-08-01 06:51:49.635386: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-01 06:51:49.708710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2021-08-01 06:51:49.710022: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-01 06:51:52.131807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:04:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:52.134313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
pciBusID: 0000:06:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:52.136454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 2 with properties:
pciBusID: 0000:07:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:52.138645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 3 with properties:
pciBusID: 0000:08:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:52.141143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 4 with properties:
pciBusID: 0000:0c:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:52.143483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 5 with properties:
pciBusID: 0000:0d:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:52.148341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 6 with properties:
pciBusID: 0000:0e:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:52.155488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 7 with properties:
pciBusID: 0000:0f:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-08-01 06:51:52.193012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2021-08-01 06:51:52.193112: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-01 06:51:55.478323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-01 06:51:55.478373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1 2 3 4 5 6 7
2021-08-01 06:51:55.478382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Y Y Y Y Y Y Y
2021-08-01 06:51:55.478387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: Y N Y Y Y Y Y Y
2021-08-01 06:51:55.478391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 2: Y Y N Y Y Y Y Y
2021-08-01 06:51:55.478396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 3: Y Y Y N Y Y Y Y
2021-08-01 06:51:55.478401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 4: Y Y Y Y N Y Y Y
2021-08-01 06:51:55.478405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 5: Y Y Y Y Y N Y Y
2021-08-01 06:51:55.478410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 6: Y Y Y Y Y Y N Y
2021-08-01 06:51:55.478414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 7: Y Y Y Y Y Y Y N
2021-08-01 06:51:55.520887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 47220 MB memory) -> physical GPU (device: 0, name: Quadro RTX 8000, pci bus id: 0000:04:00.0, compute capability: 7.5)
2021-08-01 06:51:55.524191: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 47220 MB memory) -> physical GPU (device: 1, name: Quadro RTX 8000, pci bus id: 0000:06:00.0, compute capability: 7.5)
2021-08-01 06:51:55.527576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 47220 MB memory) -> physical GPU (device: 2, name: Quadro RTX 8000, pci bus id: 0000:07:00.0, compute capability: 7.5)
2021-08-01 06:51:55.530847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 47220 MB memory) -> physical GPU (device: 3, name: Quadro RTX 8000, pci bus id: 0000:08:00.0, compute capability: 7.5)
2021-08-01 06:51:55.534140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 47220 MB memory) -> physical GPU (device: 4, name: Quadro RTX 8000, pci bus id: 0000:0c:00.0, compute capability: 7.5)
2021-08-01 06:51:55.537678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 47220 MB memory) -> physical GPU (device: 5, name: Quadro RTX 8000, pci bus id: 0000:0d:00.0, compute capability: 7.5)
2021-08-01 06:51:55.540923: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:6 with 47220 MB memory) -> physical GPU (device: 6, name: Quadro RTX 8000, pci bus id: 0000:0e:00.0, compute capability: 7.5)
2021-08-01 06:51:55.544237: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 47220 MB memory) -> physical GPU (device: 7, name: Quadro RTX 8000, pci bus id: 0000:0f:00.0, compute capability: 7.5)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3', '/job:localhost/replica:0/task:0/device:GPU:4', '/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
/root/anaconda3/envs/MSE1/lib/python3.8/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:374: UserWarning: The lr argument is deprecated, use learning_rate instead.
warnings.warn(
Validation samples: ['case_00001', 'case_00002', 'case_00003', 'case_00004', 'case_00005', 'case_00006', 'case_00007', 'case_00008', 'case_00009', 'case_00011', 'case_00012', 'case_00013', 'case_00014', 'case_00015', 'case_00016', 'case_00017', 'case_00018', 'case_00019', 'case_00021', 'case_00022', 'case_00023', 'case_00024', 'case_00025', 'case_00026', 'case_00027', 'case_00028', 'case_00029', 'case_00031', 'case_00032', 'case_00033', 'case_00034', 'case_00035', 'case_00036', 'case_00037', 'case_00038', 'case_00039', 'case_00041', 'case_00042', 'case_00043', 'case_00044', 'case_00045', 'case_00046', 'case_00047', 'case_00048', 'case_00049', 'case_00051', 'case_00052', 'case_00053', 'case_00054', 'case_00055', 'case_00056', 'case_00057', 'case_00058', 'case_00059', 'case_00061', 'case_00062', 'case_00063', 'case_00064', 'case_00065', 'case_00066', 'case_00067', 'case_00068', 'case_00069', 'case_00071', 'case_00072', 'case_00073', 'case_00074', 'case_00075', 'case_00076', 'case_00077', 'case_00078', 'case_00079', 'case_00081', 'case_00082', 'case_00083', 'case_00084', 'case_00085', 'case_00086', 'case_00087', 'case_00088', 'case_00089', 'case_00091', 'case_00092', 'case_00093', 'case_00094', 'case_00095', 'case_00096', 'case_00097', 'case_00098', 'case_00099', 'case_00101', 'case_00102', 'case_00103', 'case_00104', 'case_00105', 'case_00106', 'case_00107', 'case_00108', 'case_00109', 'case_00111', 'case_00112', 'case_00113', 'case_00114', 'case_00115', 'case_00116', 'case_00117', 'case_00118', 'case_00119', 'case_00121', 'case_00122', 'case_00123', 'case_00124', 'case_00125', 'case_00126', 'case_00127', 'case_00128', 'case_00129', 'case_00131', 'case_00132', 'case_00133', 'case_00134', 'case_00135', 'case_00136', 'case_00137', 'case_00138', 'case_00139', 'case_00141', 'case_00142', 'case_00143', 'case_00144', 'case_00145', 'case_00146', 'case_00147', 'case_00148', 'case_00149', 'case_00151', 'case_00152', 'case_00153', 'case_00154', 'case_00155', 'case_00156', 'case_00157', 'case_00158', 'case_00159', 'case_00161', 'case_00162', 'case_00163', 'case_00164', 'case_00165', 'case_00166', 'case_00167', 'case_00168', 'case_00169', 'case_00171', 'case_00172', 'case_00173', 'case_00174', 'case_00175', 'case_00176', 'case_00177', 'case_00178', 'case_00179', 'case_00181', 'case_00182', 'case_00183', 'case_00184', 'case_00185', 'case_00186', 'case_00187', 'case_00188', 'case_00189', 'case_00191', 'case_00192', 'case_00193', 'case_00194', 'case_00195', 'case_00196', 'case_00197', 'case_00198', 'case_00199', 'case_00201', 'case_00202', 'case_00203', 'case_00204', 'case_00205', 'case_00206', 'case_00207', 'case_00208', 'case_00209']
2021-08-01 07:04:00.514661: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:695] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_2"
op: "FlatMapDataset"
input: "TensorDataset/_1"
attr {
key: "Targuments"
value {
list {
}
}
}
attr {
key: "f"
value {
func {
name: "__inference_Dataset_flat_map_flat_map_fn_10297"
}
}
}
attr {
key: "output_shapes"
value {
list {
shape {
dim {
size: -1
}
dim {
size: -1
}
dim {
size: -1
}
dim {
size: -1
}
dim {
size: -1
}
}
shape {
dim {
size: -1
}
dim {
size: -1
}
dim {
size: -1
}
dim {
size: -1
}
dim {
size: -1
}
}
shape {
dim {
size: -1
}
dim {
size: -1
}
dim {
size: -1
}
dim {
size: -1
}
dim {
size: -1
}
}
}
}
}
attr {
key: "output_types"
value {
list {
type: DT_FLOAT
type: DT_FLOAT
type: DT_FLOAT
}
}
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new tf.data.Options() object then setting options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA before applying the options object to the dataset via dataset.with_options(options).
2021-08-01 07:04:00.661989: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-01 07:04:00.683221: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2199990000 Hz
Epoch 1/500
INFO:tensorflow:batch_all_reduce: 82 all-reduces with algorithm = hierarchical_copy, num_packs = 1
INFO:tensorflow:batch_all_reduce: 82 all-reduces with algorithm = hierarchical_copy, num_packs = 1
2021-08-01 07:06:32.153633: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-01 07:06:33.447127: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8101
2021-08-01 07:06:37.275470: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8101
2021-08-01 07:06:38.062356: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-08-01 07:06:38.426856: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8101
2021-08-01 07:06:40.052920: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8101
2021-08-01 07:06:40.475483: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-08-01 07:06:41.434251: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8101
2021-08-01 07:06:42.987919: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8101
2021-08-01 07:06:44.766192: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8101
2021-08-01 07:06:46.079553: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8101

3/100 [..............................] - ETA: 2:16 - loss: 2.4387 - dice_soft: 0.1871 - dice_crossentropy: 2.4058
Restarting kernel...

@bhralzz
Copy link
Author

bhralzz commented Aug 11, 2021

any comments?

@muellerdo
Copy link
Member

Hey @bhralzz,

MIScnn utilizes Tensorflow for Neural Network actions as well as their multi-gpu API.
Sadly I'm unfamiliar with this error, however Tensorflow has various issues on multi-gpu support with Keras models due to it is still a quite 'experimental' integration.

I would highly recommend to generate a reproducible example of this behaviour (like in a Jupyter Notebook) and share it as a Tensorflow issue on their GitHub project.
BUT: Please check out their wiki beforehand on how to use the multi-gpu support correctly (missing driver, incompatible hardware etc etc).

Cheers,
Dominik

@muellerdo muellerdo self-assigned this Sep 8, 2021
@muellerdo muellerdo added the wontfix This will not be worked on label Sep 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants