EOFError on ubuntu 20.04 #89

dameng123 · 2021-11-17T02:18:51Z

I'm sorry to disturb you, but I really need your help. When I run "python examples/train_battle.py--train", this error occurred during the operation. The running process is as follows:

(marl) dzl112@dzl112:~/dameng/python/MAgent$ python examples/train_battle.py --train
/home/dzl112/anaconda3/envs/marl/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/dzl112/anaconda3/envs/marl/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/dzl112/anaconda3/envs/marl/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:521: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/dzl112/anaconda3/envs/marl/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:522: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/dzl112/anaconda3/envs/marl/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/dzl112/anaconda3/envs/marl/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
From /home/dzl112/dameng/python/MAgent/python/magent/builtin/tf_model/dqn.py:185: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
2021-11-16 18:29:04.489464: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
From /home/dzl112/dameng/python/MAgent/python/magent/builtin/tf_model/dqn.py:185: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
2021-11-16 18:29:04.942377: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Namespace(alg='dqn', eval=False, greedy=False, load_from=None, map_size=125, n_round=2000, name='battle', render=False, render_every=10, save_every=5, train=True)
view_space (13, 13, 7)
feature_space (34,)
===== sample =====
eps 1.00 number [625, 625]
step 0, nums: [625, 625] reward: [-26.53 -26.23], total_reward: [-26.53 -26.23]
step 50, nums: [625, 625] reward: [-26.63 -26.83], total_reward: [-1350.08 -1369.58]
step 100, nums: [624, 625] reward: [-27.42 -26.03], total_reward: [-2679.7 -2689.73]
step 150, nums: [624, 624] reward: [-27.22 -25.12], total_reward: [-4001.7 -4032.14]
step 200, nums: [619, 624] reward: [-25.1 -26.22], total_reward: [-5314.39 -5333.84]
step 250, nums: [616, 621] reward: [-25.78 -26.91], total_reward: [-6597.48 -6632.02]
step 300, nums: [613, 617] reward: [-24.37 -25.69], total_reward: [-7868.09 -7935.38]
step 350, nums: [612, 616] reward: [-26.86 -27.78], total_reward: [-9170.89 -9239.29]
step 400, nums: [610, 616] reward: [-25.95 -26.28], total_reward: [-10452.42 -10512.19]
step 450, nums: [609, 614] reward: [-26.35 -26.17], total_reward: [-11720.11 -11790.2 ]
step 500, nums: [605, 610] reward: [-25.03 -24.95], total_reward: [-12987.11 -13057.87]
step 550, nums: [602, 608] reward: [-24.61 -26.94], total_reward: [-14243.59 -14325.24]
steps: 551, total time: 36.23, step average 0.07
===== train =====
batch number: 6663 add: 341185 replay_len: 341185/1048576
batch number: 6625 add: 339220 replay_len: 339220/1048576
batch 0, loss 0.066977, eval 0.008397
batch 0, loss 0.298991, eval -0.234968
batch 1000, loss 0.024740, eval 0.163353
batch 1000, loss 0.009467, eval 0.127751
batch 2000, loss 0.007746, eval 0.121407
batch 2000, loss 0.004247, eval 0.178776
batch 3000, loss 0.001601, eval 0.168760
batch 3000, loss 0.001414, eval 0.217879
batch 4000, loss 0.000912, eval 0.184864
batch 4000, loss 0.001028, eval 0.248757
batch 5000, loss 0.000803, eval 0.196566
batch 5000, loss 0.000792, eval 0.247229
batch 6000, loss 0.000605, eval 0.193597
batch 6000, loss 0.000946, eval 0.254682
batches: 6625, total time: 648.04, 1k average: 97.82
batches: 6663, total time: 650.94, 1k average: 97.69
train_time 652.10
round 0 loss: [0.01, 0.01] num: [602, 608] reward: [-14243.59, -14325.24] value: [0.21, 0.3]
round time 688.33 total time 688.33

===== sample =====
eps 1.00 number [625, 625]
step 0, nums: [625, 625] reward: [-25.63 -26.43], total_reward: [-25.63 -26.43]
step 50, nums: [625, 625] reward: [-27.93 -27.03], total_reward: [-1362.88 -1357.58]
step 100, nums: [625, 625] reward: [-24.13 -26.33], total_reward: [-2701.53 -2703.03]
step 150, nums: [624, 625] reward: [-28.02 -25.33], total_reward: [-4026.69 -4034.38]
step 200, nums: [624, 624] reward: [-25.62 -24.32], total_reward: [-5343.19 -5351.61]
step 250, nums: [620, 624] reward: [-24.8 -24.52], total_reward: [-6660.58 -6645.71]
step 300, nums: [616, 623] reward: [-27.38 -25.52], total_reward: [-7943.51 -7941.5 ]
step 350, nums: [613, 622] reward: [-24.27 -27.21], total_reward: [-9234.7 -9234.86]
step 400, nums: [606, 619] reward: [-23.73 -25.1 ], total_reward: [-10486.65 -10484.31]
step 450, nums: [605, 618] reward: [-26.33 -26.19], total_reward: [-11764.21 -11796.6 ]
step 500, nums: [601, 617] reward: [-24.91 -25.49], total_reward: [-13033.18 -13079.04]
step 550, nums: [600, 614] reward: [-26.4 -25.87], total_reward: [-14280.83 -14365.36]
steps: 551, total time: 29.94, step average 0.05
===== train =====
batch number: 6623 add: 339111 replay_len: 678331/1048576
batch 0, loss 0.000958, eval 0.165303
batch number: 6691 add: 342622 replay_len: 683807/1048576
batch 0, loss 0.001485, eval 0.283710
batch 1000, loss 0.000656, eval 0.202931
batch 1000, loss 0.001262, eval 0.292073
batch 2000, loss 0.000821, eval 0.212229
batch 2000, loss 0.001034, eval 0.331045
batch 3000, loss 0.000863, eval 0.247228
batch 3000, loss 0.002247, eval 0.345279
batch 4000, loss 0.000870, eval 0.242257
batch 4000, loss 0.002150, eval 0.378697
batch 5000, loss 0.002529, eval 0.290189
batch 5000, loss 0.002757, eval 0.409260
batch 6000, loss 0.003087, eval 0.417159
batch 6000, loss 0.002246, eval 0.272336
batches: 6623, total time: 593.40, 1k average: 89.60
batches: 6691, total time: 597.11, 1k average: 89.24
train_time 599.24
round 1 loss: [0.0, 0.0] num: [600, 614] reward: [-14280.83, -14365.36] value: [0.29, 0.46]
round time 629.18 total time 1317.51

===== sample =====
eps 1.00 number [625, 625]
step 0, nums: [625, 625] reward: [-27.43 -26.23], total_reward: [-27.43 -26.23]
step 50, nums: [625, 625] reward: [-27.13 -25.93], total_reward: [-1362.18 -1367.68]
step 100, nums: [625, 625] reward: [-24.63 -25.53], total_reward: [-2680.03 -2691.03]
step 150, nums: [622, 623] reward: [-25.91 -26.82], total_reward: [-3977.77 -3987.96]
step 200, nums: [618, 621] reward: [-24.29 -26.71], total_reward: [-5281.69 -5281.74]
step 250, nums: [615, 618] reward: [-26.07 -21.89], total_reward: [-6559.37 -6563.23]
step 300, nums: [612, 612] reward: [-24.76 -24.56], total_reward: [-7824.65 -7863.79]
step 350, nums: [612, 612] reward: [-26.26 -27.66], total_reward: [-9131.35 -9150.69]
step 400, nums: [610, 609] reward: [-25.25 -26.85], total_reward: [-10392.15 -10409.55]
step 450, nums: [610, 607] reward: [-28.85 -22.94], total_reward: [-11669.25 -11684.32]
step 500, nums: [610, 605] reward: [-24.95 -24.23], total_reward: [-12936.55 -12956.71]
step 550, nums: [609, 604] reward: [-22.95 -24.92], total_reward: [-14211.54 -14222.43]
steps: 551, total time: 31.51, step average 0.06
===== train =====
batch number: 6630 add: 339483 replay_len: 1017814/1048576
batch 0, loss 0.004495, eval 0.320383
batch 1000, loss 0.004926, eval 0.337636
batch 2000, loss 0.004059, eval 0.395189
batch 3000, loss 0.007036, eval 0.424905
batch 4000, loss 0.007425, eval 0.469079
batch 5000, loss 0.005349, eval 0.484610
batch 6000, loss 0.006193, eval 0.494254
batches: 6630, total time: 355.08, 1k average: 53.56
Traceback (most recent call last):
File "examples/train_battle.py", line 221, in
eps=eps) # for e-greedy
File "examples/train_battle.py", line 124, in play_a_round
total_loss[i], value[i] = models[i].fetch_train()
File "/home/dzl112/dameng/python/MAgent/python/magent/model.py", line 238, in fetch_train
return self.conn.recv()
File "/home/dzl112/anaconda3/envs/marl/lib/python3.6/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/dzl112/anaconda3/envs/marl/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/dzl112/anaconda3/envs/marl/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError

Kipsora · 2021-11-17T05:59:36Z

While I have seen that other people had this issue before, I am afraid that we cannot provide any concrete help on this issue because this project is not actively maintained and we are lacking of hands now. But from the traceback you posted, it seems that the process was reading an empty pipe (refer here).

An tentative workaround is to use wait() before recv(), I guess you could do (by applying the following patch):

diff --git a/python/magent/model.py b/python/magent/model.py
index a60e793..2f7de98 100644
--- a/python/magent/model.py
+++ b/python/magent/model.py
@@ -208,6 +208,7 @@ class ProcessingModel(BaseModel):
         -------
         actions: numpy array (int32)
         """
+        multiprocessing.connection.wait(self.conn)
         info = self.conn.recv()
         return NDArrayPackage(info).recv_from(self.conn)[0]
 
@@ -235,6 +236,7 @@ class ProcessingModel(BaseModel):
         value: float
             mean state value
         """
+        multiprocessing.connection.wait(self.conn)
         return self.conn.recv()
 
     def save(self, save_dir, epoch, block=True):

But I have to put a disclaimer here that this may not be the right solution. You could try and post if there is any problem. But please do not expect prompt response since again, this project is not actively maintained.

dameng123 · 2021-11-17T12:41:23Z

Thank you very much for your reply. I tried your above method, but the error was: TypeError:'Connection' object is not iterable', as follows:

(marl) dzl112@dzl112:~/dameng/python/MAgent$ python examples/train_battle.py --train
/home/dzl112/anaconda3/envs/marl/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/dzl112/anaconda3/envs/marl/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/dzl112/anaconda3/envs/marl/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:521: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/dzl112/anaconda3/envs/marl/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:522: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/dzl112/anaconda3/envs/marl/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/dzl112/anaconda3/envs/marl/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
From /home/dzl112/dameng/python/MAgent/python/magent/builtin/tf_model/dqn.py:185: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
2021-11-17 18:57:59.162981: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
From /home/dzl112/dameng/python/MAgent/python/magent/builtin/tf_model/dqn.py:185: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
2021-11-17 18:57:59.604186: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Namespace(alg='dqn', eval=False, greedy=False, load_from=None, map_size=125, n_round=2000, name='battle', render=False, render_every=10, save_every=5, train=True)
view_space (13, 13, 7)
feature_space (34,)
===== sample =====
eps 1.00 number [625, 625]
Traceback (most recent call last):
File "examples/train_battle.py", line 221, in
eps=eps) # for e-greedy
File "examples/train_battle.py", line 70, in play_a_round
acts[i] = models[i].fetch_action() # fetch actions (blocking)
File "/home/dzl112/dameng/python/MAgent/python/magent/model.py", line 211, in fetch_action
multiprocessing.connection.wait(self.conn)
File "/home/dzl112/anaconda3/envs/marl/lib/python3.6/multiprocessing/connection.py", line 904, in wait
for obj in object_list:
TypeError: 'Connection' object is not iterable

Kipsora · 2021-11-17T17:44:36Z

Could you try to wrap self.conn with a list? Like multiprocessing.connection.wait([self.conn])

dameng123 · 2021-11-23T07:12:59Z

Thank you very much for your reply. I tried this method again. When the program ran to round 880, the error "EOFError" still appeared. I tried to reduce the number of experience pools, reduce the number of agents to 100, and set the number of rounds to 1000, everything is running normally at this time. Before I ran "train_tiger.py", everything was normal.

Sriram94 mentioned this issue Nov 24, 2021

Questions want to be guided Sriram94/pomfrl#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EOFError on ubuntu 20.04 #89

EOFError on ubuntu 20.04 #89

dameng123 commented Nov 17, 2021

Kipsora commented Nov 17, 2021

dameng123 commented Nov 17, 2021

Kipsora commented Nov 17, 2021 •

edited

dameng123 commented Nov 23, 2021

EOFError on ubuntu 20.04 #89

EOFError on ubuntu 20.04 #89

Comments

dameng123 commented Nov 17, 2021

Kipsora commented Nov 17, 2021

dameng123 commented Nov 17, 2021

Kipsora commented Nov 17, 2021 • edited

dameng123 commented Nov 23, 2021

Kipsora commented Nov 17, 2021 •

edited