Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurrent PPO #179

Open
4 tasks done
fede72bari opened this issue May 5, 2023 · 4 comments
Open
4 tasks done

Recurrent PPO #179

fede72bari opened this issue May 5, 2023 · 4 comments
Labels
bug Something isn't working more information needed Please fill the issue template completely

Comments

@fede72bari
Copy link

馃悰 Bug

Running Recurrent PPO on CartPole in a background notebook in Kaggle after 6 hours the task crashed before finishing

To Reproduce

It was a simple test on cartpole environment. Here the code

# Create log dir
log_dir = "/tmp/gym13/"
os.makedirs(log_dir, exist_ok=True)

#env = gym.make("CartPole-v1")
#env._max_episode_steps = 500000

env = DummyVecEnv([lambda: gym.make("CartPole-v1")])
# Automatically normalize the input features and reward
env = VecNormalize(env, 
                   norm_obs=True, 
                   norm_reward=True) #, 
                   #clip_obs=10.)

# Logs will be saved in log_dir/monitor.csv
env = VecMonitor(env, log_dir)

total_steps = 2_000_000

# Logs will be saved in log_dir/monitor.csv
#env = Monitor(env, log_dir)

policy_kwargs = dict(activation_fn=th.nn.Mish, #ReLU,
                     net_arch=dict(pi=[64], vf=[64]))

model = RecurrentPPO("MlpLstmPolicy", 
            env, 
            verbose=0, 
            policy_kwargs=policy_kwargs,
            batch_size=128,
            learning_rate=0.0001,
            ent_coef = 0)
           # tensorboard_log="/ppo_cartpole_tensorboard/")
model.learn(total_timesteps=total_steps, progress_bar=True)

Relevant log output / Error message

18562.3s	585	Traceback (most recent call last):
18562.3s	586	  File "/opt/conda/lib/python3.10/site-packages/nbclient/client.py", line 762, in _async_poll_output_msg
18562.3s	587	    msg = await ensure_async(self.kc.iopub_channel.get_msg(timeout=None))
18562.3s	588	  File "/opt/conda/lib/python3.10/site-packages/nbclient/util.py", line 96, in ensure_async
18562.3s	589	    result = await obj
18562.3s	590	  File "/opt/conda/lib/python3.10/site-packages/jupyter_client/channels.py", line 310, in get_msg
18562.3s	591	    ready = await self.socket.poll(timeout)
18562.3s	592	asyncio.exceptions.CancelledError
18562.3s	593	
18562.3s	594	During handling of the above exception, another exception occurred:
18562.3s	595	
18562.3s	596	Traceback (most recent call last):
18562.3s	597	  File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
18562.3s	598	    return fut.result()
18562.3s	599	asyncio.exceptions.CancelledError
18562.3s	600	
18562.3s	601	The above exception was the direct cause of the following exception:
18562.3s	602	
18562.3s	603	Traceback (most recent call last):
18562.3s	604	  File "/opt/conda/lib/python3.10/site-packages/nbclient/client.py", line 735, in _async_poll_for_reply
18562.3s	605	    await asyncio.wait_for(task_poll_output_msg, self.iopub_timeout)
18562.3s	606	  File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
18562.3s	607	    raise exceptions.TimeoutError() from exc
18562.3s	608	asyncio.exceptions.TimeoutError
18562.3s	609	
18562.3s	610	During handling of the above exception, another exception occurred:
18562.3s	611	
18562.3s	612	Traceback (most recent call last):
18562.3s	613	  File "<string>", line 1, in <module>
18562.3s	614	  File "/opt/conda/lib/python3.10/site-packages/papermill/execute.py", line 113, in execute_notebook
18562.3s	615	    nb = papermill_engines.execute_notebook_with_engine(
18562.3s	616	  File "/opt/conda/lib/python3.10/site-packages/papermill/engines.py", line 49, in execute_notebook_with_engine
18562.3s	617	    return self.get_engine(engine_name).execute_notebook(nb, kernel_name, **kwargs)
18562.3s	618	  File "/opt/conda/lib/python3.10/site-packages/papermill/engines.py", line 367, in execute_notebook
18562.3s	619	    cls.execute_managed_notebook(nb_man, kernel_name, log_output=log_output, **kwargs)
18562.3s	620	  File "/opt/conda/lib/python3.10/site-packages/papermill/engines.py", line 436, in execute_managed_notebook
18562.3s	621	    return PapermillNotebookClient(nb_man, **final_kwargs).execute()
18562.3s	622	  File "/opt/conda/lib/python3.10/site-packages/papermill/clientwrap.py", line 45, in execute
18562.3s	623	    self.papermill_execute_cells()
18562.3s	624	  File "/opt/conda/lib/python3.10/site-packages/papermill/clientwrap.py", line 72, in papermill_execute_cells
18562.3s	625	    self.execute_cell(cell, index)
18562.3s	626	  File "/opt/conda/lib/python3.10/site-packages/nbclient/util.py", line 84, in wrapped
18562.3s	627	    return just_run(coro(*args, **kwargs))
18562.3s	628	  File "/opt/conda/lib/python3.10/site-packages/nbclient/util.py", line 62, in just_run
18562.3s	629	    return loop.run_until_complete(coro)
18562.3s	630	  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
18562.3s	631	    return future.result()
18562.3s	632	  File "/opt/conda/lib/python3.10/site-packages/nbclient/client.py", line 949, in async_execute_cell
18562.3s	633	    exec_reply = await self.task_poll_for_reply
18562.3s	634	  File "/opt/conda/lib/python3.10/site-packages/nbclient/client.py", line 739, in _async_poll_for_reply
18562.3s	635	    raise CellTimeoutError.error_from_timeout_and_cell(
18562.3s	636	nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 4 seconds.
18562.3s	637	The message was: Timeout waiting for IOPub output.
18562.3s	638	Here is a preview of the cell contents:
18562.3s	639	-------------------
18562.3s	640	['# Create log dir', 'log_dir = "/tmp/gym13/"', 'os.makedirs(log_dir, exist_ok=True)', '', '#env = gym.make("CartPole-v1")']
18562.3s	641	...
18562.3s	642	['#     # VecEnv resets automatically', '#     # if done:', '#     #   obs = env.reset()', '', '# env.close()']
18562.3s	643	-------------------
18562.3s	644	
18564.5s	645	/opt/conda/lib/python3.10/site-packages/traitlets/traitlets.py:2930: FutureWarning: --Exporter.preprocessors=["remove_papermill_header.RemovePapermillHeader"] for containers is deprecated in traitlets 5.0. You can pass `--Exporter.preprocessors item` ... multiple times to add items to a list.
18564.5s	646	  warn(
18564.5s	647	[NbConvertApp] WARNING | Config option `kernel_spec_manager_class` not recognized by `NbConvertApp`.
18564.5s	648	[NbConvertApp] Converting notebook __notebook__.ipynb to notebook
18564.9s	649	[NbConvertApp] Writing 87181 bytes to __notebook__.ipynb
18566.7s	650	/opt/conda/lib/python3.10/site-packages/traitlets/traitlets.py:2930: FutureWarning: --Exporter.preprocessors=["nbconvert.preprocessors.ExtractOutputPreprocessor"] for containers is deprecated in traitlets 5.0. You can pass `--Exporter.preprocessors item` ... multiple times to add items to a list.
18566.7s	651	  warn(
18566.7s	652	[NbConvertApp] WARNING | Config option `kernel_spec_manager_class` not recognized by `NbConvertApp`.
18566.7s	653	[NbConvertApp] Converting notebook __notebook__.ipynb to html
18567.7s	654	[NbConvertApp] Writing 365517 bytes to __results__.html
18567.9s	655	锟絒0m

System Info

No response

Checklist

  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • I have provided a minimal working example to reproduce the bug
  • I've used the markdown code blocks for both code and stack traces.
@fede72bari fede72bari added the bug Something isn't working label May 5, 2023
@araffin araffin added the more information needed Please fill the issue template completely label May 5, 2023
@araffin
Copy link
Member

araffin commented May 5, 2023

Hello,
the traceback is not complete.
I would suspect that the problem might come from Kaggle notebook, there is probably a timeout.

@fede72bari
Copy link
Author

I copied just the pertinent part of the log, the not copied part above refers to the initial seconds when some modules were installed. In between there is nothing. The strange thing is that I had run much longer scripts in the background on Kaggle, up to 10-11 hours without any timeout.

@araffin
Copy link
Member

araffin commented May 5, 2023

I copied just the pertinent part of the log

the traceback doesn't tell anything about why the process was terminated and nothing might relate it to SB3, it just contains a mix of timeout and cancelled errors.

@fede72bari
Copy link
Author

fede72bari commented May 5, 2023

exactly, the only pieces of information are contextual: for other longer runs, I have not experienced a similar timeout problem. Let's see if others encounter the same problem with Kaggle background notebook and SB3 Recurrent PPO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working more information needed Please fill the issue template completely
Projects
None yet
Development

No branches or pull requests

2 participants