Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify distributed key-value store usage #450

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sethah
Copy link

@sethah sethah commented Mar 9, 2018

The current tutorial on using the distributed key-value store fails to mention that you need to start scheduler and server processes and how to do that. It implies that you can make a one-line change to the local kv-store code and it will work, which isn't true. This PR aims to clarify that the other processes must be started, how to do it, and a bit of information on how MXNet determines the role of a process.

@zackchase
Copy link
Owner

Hi @mli can you review this PR as it relates to distributed training?

@zackchase
Copy link
Owner

@mli ....

@@ -167,8 +167,28 @@
"\n",
"Not only can we synchronize data within a machine, with the key-value store we can facilitate inter-machine communication. To use it, one can create a distributed kvstore by using the following command: (Note: distributed key-value store requires `MXNet` to be compiled with the flag `USE_DIST_KVSTORE=1`, e.g. `make USE_DIST_KVSTORE=1`.)\n",
"\n",
"In the distributed setting, `MXNet` launches three kinds of processes (each time, running `python myprog.py` will create a process). One is a *worker*, which runs the user program, such as the code in the previous section. The other two are the *server*, which maintains the data pushed into the store, and the *scheduler*, which monitors the aliveness of each node.\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"In the distributed setting, `MXNet` launches three kinds of processes (each time, running `python myprog.py` will create a process). One is a *worker*, which runs the user program, such as the code in the previous section. The other two are the *server*, which maintains the data pushed into the store, and the *scheduler*, which monitors the aliveness of each node.\n",
"In the distributed setting, `MXNet` launches three kinds of processes (each time, running `python myprog.py` will create a process). One is a *worker*, which runs the user program, such as the code in the previous section. The other two are the *server*, which maintains the data pushed into the store, and the *scheduler*, which monitors the status of each node.\n",

Copy link

@ChaiBapchya ChaiBapchya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nitpicks

@@ -167,8 +167,28 @@
"\n",
"Not only can we synchronize data within a machine, with the key-value store we can facilitate inter-machine communication. To use it, one can create a distributed kvstore by using the following command: (Note: distributed key-value store requires `MXNet` to be compiled with the flag `USE_DIST_KVSTORE=1`, e.g. `make USE_DIST_KVSTORE=1`.)\n",
"\n",
"In the distributed setting, `MXNet` launches three kinds of processes (each time, running `python myprog.py` will create a process). One is a *worker*, which runs the user program, such as the code in the previous section. The other two are the *server*, which maintains the data pushed into the store, and the *scheduler*, which monitors the aliveness of each node.\n",
"\n",
"To use the distributed key-value store, we must first start a scheduler process and at least one server process. When the MXNet library is imported in a process, it checks what the process's role is through the `DMLC_ROLE` environment variable. Starting a server or scheduler is as simple as importing MXNet with the appropriate environment variables set.\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"To use the distributed key-value store, we must first start a scheduler process and at least one server process. When the MXNet library is imported in a process, it checks what the process's role is through the `DMLC_ROLE` environment variable. Starting a server or scheduler is as simple as importing MXNet with the appropriate environment variables set.\n",
"To use the distributed key-value store, we must first start a scheduler process and at least one server process. When the MXNet library is imported in a process, it checks what the process role is through the `DMLC_ROLE` environment variable. Starting a server or scheduler is as simple as importing MXNet with the appropriate environment variables set.\n",

"In the distributed setting, `MXNet` launches three kinds of processes (each time, running `python myprog.py` will create a process). One is a *worker*, which runs the user program, such as the code in the previous section. The other two are the *server*, which maintains the data pushed into the store, and the *scheduler*, which monitors the aliveness of each node.\n",
"\n",
"It's up to users which machines to run these processes on. But to simplify the process placement and launching, MXNet provides a tool located at [tools/launch.py](https://github.com/dmlc/mxnet/blob/master/tools/launch.py). \n",
"It's up to users which machines to run the worker, scheduler, and server processes on. But to simplify the process placement and launching, MXNet provides a tool located at [tools/launch.py](https://github.com/dmlc/mxnet/blob/master/tools/launch.py). \n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"It's up to users which machines to run the worker, scheduler, and server processes on. But to simplify the process placement and launching, MXNet provides a tool located at [tools/launch.py](https://github.com/dmlc/mxnet/blob/master/tools/launch.py). \n",
"It's up to users to decide which machines to run the worker, scheduler, and server processes on. But to simplify the process placement and launching, MXNet provides a tool located at [tools/launch.py](https://github.com/dmlc/mxnet/blob/master/tools/launch.py). \n",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants