Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multiprocess to speed up training and playing. #82

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

gigayaya
Copy link
Contributor

@gigayaya gigayaya commented Sep 5, 2018

I do some change for pit.py.
In current version, it will cost a lot of time when playing many games(such as 200 games).
So I use multiprocess to improve the throughput of pit.py.

@JernejHabjan
Copy link
Contributor

Hi gigayaya,

I am recieving unresolved KeyError exception when running your pit.py.
I downloaded your repo, uncommented "ParallelPlay(g)" in pit.py and pressed run.
I am using following packages with Python 3.6 64 bit
image

And error:

Traceback (most recent call last):
  File "C:/Users/Jernej/Documents/GitHub/alpha-zero-general_multiPit/pit.py", line 111, in <module>
    ParallelPlay(g)
  File "C:/Users/Jernej/Documents/GitHub/alpha-zero-general_multiPit/pit.py", line 100, in ParallelPlay
    result.append(i.get())
  File "C:\Program Files\Python36\lib\multiprocessing\pool.py", line 644, in get
    raise self._value
  File "C:\Program Files\Python36\lib\multiprocessing\pool.py", line 424, in _handle_tasks
    put(task)
  File "C:\Program Files\Python36\lib\multiprocessing\connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "C:\Program Files\Python36\lib\multiprocessing\reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "C:\Users\Jernej\Documents\GitHub\alpha-zero-general_multiPit\utils.py", line 3, in __getattr__
    return self[name]
KeyError: '__getstate__'

I hope you resolve this error.
Have a nice day!

@gigayaya
Copy link
Contributor Author

Hi @JernejHabjan ,

Thank you for found this bug.
I push a new commit to fix this bug.
Can you try again? Thanks!

@JernejHabjan
Copy link
Contributor

Hi @gigayaya

Wow, thanks for quick reply!
Yes I tried again and it works great.

Thanks for this patch, good job

@51616
Copy link

51616 commented Nov 18, 2018

Is there any chance I can use this during training?

@gigayaya
Copy link
Contributor Author

gigayaya commented Nov 19, 2018

Is there any chance I can use this during training?

Yes, I have a multiprocess version to speed up training.
Maybe I can make a new PR after this PR.

@51616
Copy link

51616 commented Nov 19, 2018

Cool! @gigayaya That would be amazing since I think self-play is the bottleneck of this training loop.
How much faster is it if you do self-play in parallel? Is is just 2 times faster if I have 2 cores cpu?

@gigayaya
Copy link
Contributor Author

gigayaya commented Nov 19, 2018

@51616
It is not possible to get 2 times faster with only 2 cores.
Here are some record of self-play time with my multiprocess version.

Game: 6x6 Othello
numEps: 128
numMCTSSims: 2000
19 layers ResNet
AMD 1950x
MSI 1080Ti

Cost time(self-play 128 games):
16 core: 192 mins.
32 core: 108 mins.

I did not record the time with 1 process(original version), because it will take too much time.
The speed up of my multiprocess version is not very high so I do not make a PR yet.
But if you think this is ok for you, I can make a PR.

@51616
Copy link

51616 commented Nov 19, 2018

I would love to see the implementation, of course. @gigayaya
You can just commit to this PR and I can read the code just fine. 👍
Also, Have you heard about Ray RLlib ? https://ray.readthedocs.io/en/latest/
Is it a better solution to this kind of self-play or is it just the same as what you are doing?
Thank you for your response.

@gigayaya
Copy link
Contributor Author

gigayaya commented Nov 21, 2018

@51616

I commit the multiprocess training version.
It takes 2 minutes 20 seconds to training 6x6 Othello with 25 simulations in 4 cores and the original version takes 3 minutes 20 seconds.
You can use this version to speed up training.
This change is based on multiprocess and Ray RLlib is a distributed execution framework.
There are two different method to speed up the process, but you can use these together.

@gigayaya gigayaya changed the title Use multiprocess play many games at pit.py Use multiprocess to speed up training and playing. Nov 21, 2018
@51616
Copy link

51616 commented Nov 21, 2018

Thanks a lot @gigayaya . I will be back on my RL project after my final exam in 2 weeks.
See you then :)

@51616
Copy link

51616 commented Dec 12, 2018

@gigayaya As I understand, your code is parallel in game play but not during MCTS right?
Is it possible to do both? or is it limited with hardware setup (need more cpus or gpus to do both)?
Also, I would like to know cpu and gpu utilization during your training process if it is fully utilize or not?

@gigayaya
Copy link
Contributor Author

gigayaya commented Dec 12, 2018

@51616
Yes, I parallel play many games during self-play not MCTS, because it's very difficult to implement multi-thread MCTS.
In my case, I use ResNet as my NN and it cost about 35M VRAM per model.(19 blocks, 32 channels, 8x8 Othello)
My CPU is AMD Threadripper 1950x, it has 16 cores 32 threads. So I create 32 process during self-play and each process cost about 1.3G RAM.

This is my Task Manager during self-play
This is my Task Manager during self-play

@51616
Copy link

51616 commented Dec 12, 2018

@gigayaya Thanks a lot! I really appreciate your help and explanation. :)

@51616
Copy link

51616 commented Jan 8, 2019

@gigayaya Can i ask you why it has to create new model for each simulation during self-play?

Use ResNet as default NN because cost less VRAM then CNN.
@gigayaya
Copy link
Contributor Author

gigayaya commented Jan 9, 2019

@51616

That's my bad. Ideally, it should create many self-play process and one model process.
When each self-play process need predict initial policy and value during searching, it should call the model process to do.
But after many try, I cannot implement this structure. So I create a new model every self-play process.
Maybe someone know how to do it, but now this is the best what I can do.
This structure will cause a problem, it will face out of memory issue due to many model save at VRAM one time.
So I change to use ResNet as default NN. I will cost less VRAM then CNN and fix this problem.
Hope this can help your question.

@51616
Copy link

51616 commented Jan 11, 2019

@gigayaya I can implement many self-play process with one model process using pytorch multiprocess instead of default library. But sometimes when i resume(load) from lastest model it doesn't use gpu at all.

@im0qianqian
Copy link

Wow, I have similar needs recently, thank you for your code.
I found that you created a process for each self-play in the Coach and maintained them with a pool of size numSelfPlayPool. Then the neural network is initialized in every AsyncSelfPlay, I think this initialization process is very time consuming (if my thoughts are wrong, please tell me).
I have an idea that we only create numSelfPlayPool (cpu core num) processes, and then use the same model in each process to complete numEps/numSelfPlayPool self-play.
In my opinion, you only need to change the AsyncSelfPlay function.
Thank you!

@51616
Copy link

51616 commented May 6, 2019

@im0qianqian Normally python subprocess needs its own memory for each process thus the model will be copy even if you initialize only a single model in the main process. This is an very inefficient implementation of multiprocessing and very tricky and required low-level implementation for parallelization.

@im0qianqian
Copy link

@51616 Thank you for your reply, I think you misunderstood what I meant. I mean we can create a small number of processes (like the number of CPU cores), perform an initialization process in each process, and then simulate multiple self-plays in each process.

@51616
Copy link

51616 commented May 6, 2019

@im0qianqian I see. If you don't wanna create a model each time a new game is started you can modify the code pretty easily since the there's nothing complex there. But still, this implementation as I said, is not efficient yet but at least it utilizes multiprocessing of CPUs.

@im0qianqian
Copy link

@51616 Thanks a lot! I really appreciate your explanation.

@gigayaya
Copy link
Contributor Author

gigayaya commented May 8, 2019

Hi @51616 and @im0qianqian , thanks for your reply.
I commit a new version of my mutliprocess version.

Now, each process play many games instead 1 game. So do not need initial a NN to VRAM every time when create a new process.
Use numPerProcessSelfPlay to decide how many games will one process play during self-play phase. So the total number of games during self-playing is:

Total self-play games = numSelfPlayProcess * numPerProcessSelfPlay

Also, use numPerProcessAgainst to decide how many games will one process play during against-play phase.

Total against games = numAgainstPlayProcess * numPerProcessAgainst

If we want a very efficient way to speed up AlphaZero approach, we should parallel MCTS.
But I do not have an efficient way to implement this. So I write the low-level version to speed up my training.
Hope these code can help you. Cheers :)

@im0qianqian
Copy link

@gigayaya Thank you for your code.
Because I have been working on a similar project recently, I have also added multi-process optimization for my project. I tried to test it in Google Colab and found that for me, the two processes are probably the best choice.
In this project, we can see that if temp=0, we call mcts.getActionProb(canonicalBoard, temp=0), we will get the position with the highest probability of winning the strategy of the current game, if we call mcts.getActionProb(canonicalBoard, temp=0) again, it will continue to execute on the basis of the variables Nsa, Qsa, Ps, Ns obtained last time, so the results may be different.
We know that if temp=0, then there will be no randomness in the result.
If you enter a fixed set of values in an existing static network, the output of the network should be the same. This actually involves how we parallel MCTS.
However, if you use multiple processes in test self-play and set temp=0 (we think the temp value should be 0 when testing), then each process creates a separate mcts, assuming we have 4 processes, maybe the final output of these four processes is the same, because we have not parallelized MCTS, we just did exactly the same thing in each process.
So, for now, setting numAgainstPlayProcess = 1 is probably the best choice (because multiple processes are doing exactly the same thing in nature, they don't contribute to the final result).
Of course, this is what I understand. If I have a mistake, please let me know, thank you.

@gigayaya
Copy link
Contributor Author

gigayaya commented May 16, 2019

@im0qianqian I think this is alpha-zero-general's problem.
For now, the result of MCTS does not have randomness. It will make MCTS always make same move when playing if we always create new process for that.
But in DeepMind's paper, they randomly rotate the board before send into the NN to predict value and policy.
This PR looks like try to fix this issue: 0193e71

Hope this can help you. :)

@gms2009
Copy link

gms2009 commented Mar 22, 2020

I tried this, 16 processes each play 4 games, the first process play like normal, the second process will return after 1 getNextState in playGame for the first 4 games and throw away then play 4 normal games. the third process will throw away first 8 games... it will produce pwin nwin other than multiple of 16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants