Use multiprocess to speed up training and playing. #82

gigayaya · 2018-09-05T07:22:02Z

I do some change for pit.py.
In current version, it will cost a lot of time when playing many games(such as 200 games).
So I use multiprocess to improve the throughput of pit.py.

JernejHabjan · 2018-09-12T17:42:36Z

Hi gigayaya,

I am recieving unresolved KeyError exception when running your pit.py.
I downloaded your repo, uncommented "ParallelPlay(g)" in pit.py and pressed run.
I am using following packages with Python 3.6 64 bit

And error:

Traceback (most recent call last):
  File "C:/Users/Jernej/Documents/GitHub/alpha-zero-general_multiPit/pit.py", line 111, in <module>
    ParallelPlay(g)
  File "C:/Users/Jernej/Documents/GitHub/alpha-zero-general_multiPit/pit.py", line 100, in ParallelPlay
    result.append(i.get())
  File "C:\Program Files\Python36\lib\multiprocessing\pool.py", line 644, in get
    raise self._value
  File "C:\Program Files\Python36\lib\multiprocessing\pool.py", line 424, in _handle_tasks
    put(task)
  File "C:\Program Files\Python36\lib\multiprocessing\connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "C:\Program Files\Python36\lib\multiprocessing\reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "C:\Users\Jernej\Documents\GitHub\alpha-zero-general_multiPit\utils.py", line 3, in __getattr__
    return self[name]
KeyError: '__getstate__'

I hope you resolve this error.
Have a nice day!

gigayaya · 2018-09-12T18:28:15Z

Hi @JernejHabjan ,

Thank you for found this bug.
I push a new commit to fix this bug.
Can you try again? Thanks!

JernejHabjan · 2018-09-12T19:36:19Z

Hi @gigayaya

Wow, thanks for quick reply!
Yes I tried again and it works great.

Thanks for this patch, good job

51616 · 2018-11-18T07:29:28Z

Is there any chance I can use this during training?

gigayaya · 2018-11-19T05:44:59Z

Is there any chance I can use this during training?

Yes, I have a multiprocess version to speed up training.
Maybe I can make a new PR after this PR.

51616 · 2018-11-19T08:16:20Z

Cool! @gigayaya That would be amazing since I think self-play is the bottleneck of this training loop.
How much faster is it if you do self-play in parallel? Is is just 2 times faster if I have 2 cores cpu?

gigayaya · 2018-11-19T12:17:30Z

@51616
It is not possible to get 2 times faster with only 2 cores.
Here are some record of self-play time with my multiprocess version.

Game: 6x6 Othello
numEps: 128
numMCTSSims: 2000
19 layers ResNet
AMD 1950x
MSI 1080Ti

Cost time(self-play 128 games):
16 core: 192 mins.
32 core: 108 mins.

I did not record the time with 1 process(original version), because it will take too much time.
The speed up of my multiprocess version is not very high so I do not make a PR yet.
But if you think this is ok for you, I can make a PR.

51616 · 2018-11-19T14:43:07Z

I would love to see the implementation, of course. @gigayaya
You can just commit to this PR and I can read the code just fine. 👍
Also, Have you heard about Ray RLlib ? https://ray.readthedocs.io/en/latest/
Is it a better solution to this kind of self-play or is it just the same as what you are doing?
Thank you for your response.

gigayaya · 2018-11-21T09:35:17Z

@51616

I commit the multiprocess training version.
It takes 2 minutes 20 seconds to training 6x6 Othello with 25 simulations in 4 cores and the original version takes 3 minutes 20 seconds.
You can use this version to speed up training.
This change is based on multiprocess and Ray RLlib is a distributed execution framework.
There are two different method to speed up the process, but you can use these together.

51616 · 2018-11-21T19:36:39Z

Thanks a lot @gigayaya . I will be back on my RL project after my final exam in 2 weeks.
See you then :)

51616 · 2018-12-12T14:58:40Z

@gigayaya As I understand, your code is parallel in game play but not during MCTS right?
Is it possible to do both? or is it limited with hardware setup (need more cpus or gpus to do both)?
Also, I would like to know cpu and gpu utilization during your training process if it is fully utilize or not?

gigayaya · 2018-12-12T15:39:50Z

@51616
Yes, I parallel play many games during self-play not MCTS, because it's very difficult to implement multi-thread MCTS.
In my case, I use ResNet as my NN and it cost about 35M VRAM per model.(19 blocks, 32 channels, 8x8 Othello)
My CPU is AMD Threadripper 1950x, it has 16 cores 32 threads. So I create 32 process during self-play and each process cost about 1.3G RAM.

This is my Task Manager during self-play

51616 · 2018-12-12T16:03:39Z

@gigayaya Thanks a lot! I really appreciate your help and explanation. :)

51616 · 2019-01-08T13:57:48Z

@gigayaya Can i ask you why it has to create new model for each simulation during self-play?

Use ResNet as default NN because cost less VRAM then CNN.

gigayaya · 2019-01-09T16:10:29Z

@51616

That's my bad. Ideally, it should create many self-play process and one model process.
When each self-play process need predict initial policy and value during searching, it should call the model process to do.
But after many try, I cannot implement this structure. So I create a new model every self-play process.
Maybe someone know how to do it, but now this is the best what I can do.
This structure will cause a problem, it will face out of memory issue due to many model save at VRAM one time.
So I change to use ResNet as default NN. I will cost less VRAM then CNN and fix this problem.
Hope this can help your question.

51616 · 2019-01-11T08:14:14Z

@gigayaya I can implement many self-play process with one model process using pytorch multiprocess instead of default library. But sometimes when i resume(load) from lastest model it doesn't use gpu at all.

im0qianqian · 2019-05-06T04:24:10Z

Wow, I have similar needs recently, thank you for your code.
I found that you created a process for each self-play in the Coach and maintained them with a pool of size numSelfPlayPool. Then the neural network is initialized in every AsyncSelfPlay, I think this initialization process is very time consuming (if my thoughts are wrong, please tell me).
I have an idea that we only create numSelfPlayPool (cpu core num) processes, and then use the same model in each process to complete numEps/numSelfPlayPool self-play.
In my opinion, you only need to change the AsyncSelfPlay function.
Thank you!

51616 · 2019-05-06T05:07:52Z

@im0qianqian Normally python subprocess needs its own memory for each process thus the model will be copy even if you initialize only a single model in the main process. This is an very inefficient implementation of multiprocessing and very tricky and required low-level implementation for parallelization.

im0qianqian · 2019-05-06T05:49:07Z

@51616 Thank you for your reply, I think you misunderstood what I meant. I mean we can create a small number of processes (like the number of CPU cores), perform an initialization process in each process, and then simulate multiple self-plays in each process.

51616 · 2019-05-06T08:21:32Z

@im0qianqian I see. If you don't wanna create a model each time a new game is started you can modify the code pretty easily since the there's nothing complex there. But still, this implementation as I said, is not efficient yet but at least it utilizes multiprocessing of CPUs.

im0qianqian · 2019-05-06T08:24:58Z

@51616 Thanks a lot! I really appreciate your explanation.

Each process play many games replace create many process to play 1 game.

gigayaya · 2019-05-08T10:08:22Z

Hi @51616 and @im0qianqian , thanks for your reply.
I commit a new version of my mutliprocess version.

Now, each process play many games instead 1 game. So do not need initial a NN to VRAM every time when create a new process.
Use numPerProcessSelfPlay to decide how many games will one process play during self-play phase. So the total number of games during self-playing is:

Total self-play games = numSelfPlayProcess * numPerProcessSelfPlay

Also, use numPerProcessAgainst to decide how many games will one process play during against-play phase.

Total against games = numAgainstPlayProcess * numPerProcessAgainst

If we want a very efficient way to speed up AlphaZero approach, we should parallel MCTS.
But I do not have an efficient way to implement this. So I write the low-level version to speed up my training.
Hope these code can help you. Cheers :)

im0qianqian · 2019-05-09T05:32:50Z

@gigayaya Thank you for your code.
Because I have been working on a similar project recently, I have also added multi-process optimization for my project. I tried to test it in Google Colab and found that for me, the two processes are probably the best choice.
In this project, we can see that if temp=0, we call mcts.getActionProb(canonicalBoard, temp=0), we will get the position with the highest probability of winning the strategy of the current game, if we call mcts.getActionProb(canonicalBoard, temp=0) again, it will continue to execute on the basis of the variables Nsa, Qsa, Ps, Ns obtained last time, so the results may be different.
We know that if temp=0, then there will be no randomness in the result.
If you enter a fixed set of values in an existing static network, the output of the network should be the same. This actually involves how we parallel MCTS.
However, if you use multiple processes in test self-play and set temp=0 (we think the temp value should be 0 when testing), then each process creates a separate mcts, assuming we have 4 processes, maybe the final output of these four processes is the same, because we have not parallelized MCTS, we just did exactly the same thing in each process.
So, for now, setting numAgainstPlayProcess = 1 is probably the best choice (because multiple processes are doing exactly the same thing in nature, they don't contribute to the final result).
Of course, this is what I understand. If I have a mistake, please let me know, thank you.

gigayaya · 2019-05-16T09:46:55Z

@im0qianqian I think this is alpha-zero-general's problem.
For now, the result of MCTS does not have randomness. It will make MCTS always make same move when playing if we always create new process for that.
But in DeepMind's paper, they randomly rotate the board before send into the NN to predict value and policy.
This PR looks like try to fix this issue: 0193e71

Hope this can help you. :)

gms2009 · 2020-03-22T18:36:49Z

I tried this, 16 processes each play 4 games, the first process play like normal, the second process will return after 1 getNextState in playGame for the first 4 games and throw away then play 4 normal games. the third process will throw away first 8 games... it will produce pwin nwin other than multiple of 16

gigayaya added 2 commits September 5, 2018 14:54

Add a switch to show the bar or not while playGames.

4cf3e5c

Use mutilprocess play many games.

98e7a53

fix key error.

9ce1048

Multiprocess for self-play during training.

0c614ab

gigayaya changed the title ~~Use multiprocess play many games at pit.py~~ Use multiprocess to speed up training and playing. Nov 21, 2018

Use ResNet as default NN

299d9ca

Use ResNet as default NN because cost less VRAM then CNN.

gigayaya added 3 commits May 8, 2019 17:43

Use small channel size(512->32).

124da54

Modify the mutilprocess apporach.

e3b5872

Each process play many games replace create many process to play 1 game.

Merge branch 'master' into master

c41a313

peldszus mentioned this pull request Aug 30, 2020

Parallelization Improvement Suggestion #208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use multiprocess to speed up training and playing. #82

Use multiprocess to speed up training and playing. #82

gigayaya commented Sep 5, 2018

JernejHabjan commented Sep 12, 2018

gigayaya commented Sep 12, 2018

JernejHabjan commented Sep 12, 2018

51616 commented Nov 18, 2018

gigayaya commented Nov 19, 2018 •

edited

51616 commented Nov 19, 2018

gigayaya commented Nov 19, 2018 •

edited

51616 commented Nov 19, 2018

gigayaya commented Nov 21, 2018 •

edited

51616 commented Nov 21, 2018

51616 commented Dec 12, 2018

gigayaya commented Dec 12, 2018 •

edited

51616 commented Dec 12, 2018

51616 commented Jan 8, 2019

gigayaya commented Jan 9, 2019 •

edited

51616 commented Jan 11, 2019

im0qianqian commented May 6, 2019

51616 commented May 6, 2019

im0qianqian commented May 6, 2019

51616 commented May 6, 2019

im0qianqian commented May 6, 2019

gigayaya commented May 8, 2019 •

edited

im0qianqian commented May 9, 2019

gigayaya commented May 16, 2019 •

edited

gms2009 commented Mar 22, 2020

Use multiprocess to speed up training and playing. #82

Are you sure you want to change the base?

Use multiprocess to speed up training and playing. #82

Conversation

gigayaya commented Sep 5, 2018

JernejHabjan commented Sep 12, 2018

gigayaya commented Sep 12, 2018

JernejHabjan commented Sep 12, 2018

51616 commented Nov 18, 2018

gigayaya commented Nov 19, 2018 • edited

51616 commented Nov 19, 2018

gigayaya commented Nov 19, 2018 • edited

51616 commented Nov 19, 2018

gigayaya commented Nov 21, 2018 • edited

51616 commented Nov 21, 2018

51616 commented Dec 12, 2018

gigayaya commented Dec 12, 2018 • edited

51616 commented Dec 12, 2018

51616 commented Jan 8, 2019

gigayaya commented Jan 9, 2019 • edited

51616 commented Jan 11, 2019

im0qianqian commented May 6, 2019

51616 commented May 6, 2019

im0qianqian commented May 6, 2019

51616 commented May 6, 2019

im0qianqian commented May 6, 2019

gigayaya commented May 8, 2019 • edited

im0qianqian commented May 9, 2019

gigayaya commented May 16, 2019 • edited

gms2009 commented Mar 22, 2020

gigayaya commented Nov 19, 2018 •

edited

gigayaya commented Nov 19, 2018 •

edited

gigayaya commented Nov 21, 2018 •

edited

gigayaya commented Dec 12, 2018 •

edited

gigayaya commented Jan 9, 2019 •

edited

gigayaya commented May 8, 2019 •

edited

gigayaya commented May 16, 2019 •

edited