Environment hangs up when spawning from different processes #75

GraphicsHunter · 2019-03-18T02:28:53Z

Hello,

I'm trying to run the obstacle tower environment through the large-scale-curiosity project. However, it seems to hangup when it tries to create the environment from its subprocesses. It prints out that the CrashReporter is initalized and the mono config paths, then does nothing for a while and hangs up with the following image:

This is run on a MBP 13'' 2018, without a GPU. Any way to troubleshoot and debug this? I can't really do anything as there aren't really logged anything from inside the environment.

awjuliani · 2019-03-19T17:07:24Z

Hi @tianfanzhu

Can you confirm that you are running the latest version of Obstacle Tower (v1.2)? Also, does it work when using the basic usage python notebook we provide as an example?

GraphicsHunter · 2019-03-19T21:00:04Z

Hi @awjuliani ,

I am indeed running this on the latest version, v1.2. Also, I found out from the basic usage notebook that the screen is gray, as shown above, until the env is reset or stepped.

binoalien · 2019-03-21T09:47:48Z

I have the same problem.
iMac end 2015
osx 10.14.3

NancyFulda · 2019-03-30T16:08:50Z

Me too. Except I'm running this via the Unity Obstacle Tower Challenge run.py script, and at startup I see the game character appear and then fall off the blank screen into nothingness. After that, empty gray screen.

iMac running 10.14.4

NancyFulda · 2019-03-30T16:10:29Z

However, when I click on the obstacletower.app file directly, it runs flawlessly.

harperj · 2019-04-01T19:08:25Z

Hi all, it may be difficult to tell whether everyone is experiencing the same issue. A couple of important things to note:

When running multiple environments, the worker_id value (in the environment constructor) must be set to a different integer value for each environment. This is because the gym wrapper and the Unity executable communicate with one another via GRPC over a particular port and each reserves that port.
When running run.py, if you are running in evaluation mode the run.py script must be launched before the environment executable.

So with that in mind:
@tianfanzhu can you confirm whether your environment construction sets a different worker_id value for each environment?

@NancyFulda are you running in evaluation mode or just directly running the run.py script? If you're directly running the script, could you look for a file called UnitySDK.log in the same folder as the ObstacleTower.app file and share the contents?

NancyFulda · 2019-04-01T19:41:51Z

Hi @harperj, thanks for looking into this!

I'm directly executing run.py. Interestingly, the behavior this morning is different than it was on Saturday (maybe I rebooted in between??) I still see grayness, but the game character does not appear anymore. However, the run.py script no longer hangs, but instead prints out the reward for each episode.

Is this the expected behavior? It would be nice to be able to watch the agent's character navigate the world (to see where it's messing up), but since the environment seems to be executing at faster-than-real-time speed, maybe the grayed out screen is normal?

The UnitySDK.log contents are as follows:

4/1/2019 1:35:59 PM

Log
Academy resetting

Log
Seed: 52

Log
Seed: 47

Log
Academy resetting

Log
Seed: 26

Log
Seed: 91

Log
Academy resetting

Log
Seed: 65

Log
Seed: 17

Log
Academy resetting

Log
Seed: 44

Log
You reached floor: 1

Log
Seed: 64

Log
Academy resetting

Log
Seed: 34

Log
Seed: 58

Log
Academy resetting

Log
Seed: 85

harperj · 2019-04-01T20:04:58Z

@NancyFulda This is the expected behavior. When training, the camera isn't turned on in order to improve performance. You can see the camera by turning on realtime mode in the environment (realtime_mode=True in the constructor).

NancyFulda · 2019-04-01T20:31:38Z

@harperj Ah, that worked perfectly! Everything seems to be in order now. Thank you!

stevenh-tw · 2019-05-07T13:54:03Z

Hi @harperj @awjuliani I also encountered the same issue:

I tried to use ML-Agent 0.8.1 by simply let options['--env'] = 'ObstacleTower/ObstalceTower'
and set options['--num-envs'] = 2

After launching 2 envs, 1 env had the agent just spawning and falling down, another env just 'not responding', and my cpu and gpu usage of the falling-down agent env is very high.

This issue occurs in my Windows machine (Windows10), but it has no problem with the same setting on my Mac, also I've checked that I'm using ObstacleTower-v1.3

Here's the reference video [https://youtu.be/u-J7mlwlmr0]

Sohojoe · 2019-05-09T16:34:00Z

I was able to get large-scale-curiosity + Obstacle Challenge working up to about 32 agents

make sure worker_id is unique for each instance
timeout_wait=6000
add a sleep(2) between creating each instance (i.e. 2 seconds)
some worker_id may clash with windows - for me i needed to add if rank >= 35: rank += 1
I copied the render module from OpenAI.Gym to visualize training (realtime_mode=True slows down training)

@karta1297963 what you see in your video is what happens when the Unity environment does not sync with Python. Even with everything I did above, I still see this 1 in 5 times when starting off a run (even with different code bases)

harperj · 2019-05-09T17:58:02Z

Like @Sohojoe said, this looks like an issue with the connection between Python and Obstacle Tower / Unity. It could be that the port is in use for something else, that the worker_id is not being set correctly, or that the environment takes longer than the timeout_wait to start up. You could potentially have your script fail gracefully and re-launch on timeout as well, or try a new worker_id if you have a reserved port that conflicts.

stevenh-tw · 2019-05-10T04:29:16Z

@Sohojoe @harperj thanks for helping,
I've tried the solution @Sohojoe mentioned but it didn't work, later I tried to cross-validate the compatibility between mlagent-env v0.8 and unity instance built with mlagent v0.6 (like obstacle tower)

I built 2 instances with mlagent default task - Pyramids with SDK v0.6 and v0.8 respectively, turns out one with v0.6 has the same sync issue while v0.8 instance doesn't.
Then I compare the git history seems like v0.8 have the ability to customize gRPC communication message, I guess it's the reason python and unity don't sync (but somehow with only 1 environment the issue doesn't occur)

I guess the possible solutions:

Wait for ObstalceTower update to mlagent v0.8
Use mlagent-env v0.6 and somehow make it works with mlagent v0.8 SubprocessUnityEnvironment

Sohojoe · 2019-05-22T17:09:46Z

@karta1297963 - what platform / OS are you using?

stevenh-tw · 2019-05-22T17:20:31Z

@Sohojoe I'm using Windows 10.
I currently have a workaround by using the OpenAI baseline - SubprocVecEnv class, it works! but seems like this approach cannot have the step function return both visual and vector observation at the same time.

Sohojoe · 2019-05-24T17:20:45Z

@karta1297963 - create a simple repro that spawns many instances as an example of how i do it - https://github.com/Sohojoe/many_towers

awjuliani self-assigned this Mar 19, 2019

awjuliani added the help wanted Extra attention is needed label Mar 19, 2019

harperj self-assigned this Apr 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Environment hangs up when spawning from different processes #75

Environment hangs up when spawning from different processes #75

GraphicsHunter commented Mar 18, 2019

awjuliani commented Mar 19, 2019

GraphicsHunter commented Mar 19, 2019

binoalien commented Mar 21, 2019

NancyFulda commented Mar 30, 2019

NancyFulda commented Mar 30, 2019

harperj commented Apr 1, 2019

NancyFulda commented Apr 1, 2019

harperj commented Apr 1, 2019

NancyFulda commented Apr 1, 2019

stevenh-tw commented May 7, 2019

Sohojoe commented May 9, 2019

harperj commented May 9, 2019

stevenh-tw commented May 10, 2019

Sohojoe commented May 22, 2019

stevenh-tw commented May 22, 2019

Sohojoe commented May 24, 2019

Environment hangs up when spawning from different processes #75

Environment hangs up when spawning from different processes #75

Comments

GraphicsHunter commented Mar 18, 2019

awjuliani commented Mar 19, 2019

GraphicsHunter commented Mar 19, 2019

binoalien commented Mar 21, 2019

NancyFulda commented Mar 30, 2019

NancyFulda commented Mar 30, 2019

harperj commented Apr 1, 2019

NancyFulda commented Apr 1, 2019

harperj commented Apr 1, 2019

NancyFulda commented Apr 1, 2019

stevenh-tw commented May 7, 2019

Sohojoe commented May 9, 2019

harperj commented May 9, 2019

stevenh-tw commented May 10, 2019

Sohojoe commented May 22, 2019

stevenh-tw commented May 22, 2019

Sohojoe commented May 24, 2019