Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ParameterServerController for parameter server python api #1051

Merged
merged 11 commits into from
Jan 11, 2017

Conversation

jacquesqiao
Copy link
Member

@jacquesqiao jacquesqiao commented Jan 3, 2017

1,code clean, ParameterServer2Main.cpp and TrainerMain.cpp have duplicated code.
2, the duplicated code use many gflags to control the init of parameter server, so add a wrapper for pserver, change gflags to proto config.
refs #1039

More about this PR (copied from IM chat with @jacquesqiao):

1,ParameterServer2Main.cpp and TrainerMain.cpp 这两个文件有一大段重复代码。使用很多gflags + if/else控制parameterserver的初始化逻辑,比价容易造成混乱
所以现在统一用proto 来配置,兼容之前的gflags方式,在后边的python api中,就可以通过python构造proto的方式来初始化parameter server,而不用通过 initPaddle("-flags=1", "flags2=2")的方式进行初始化了

@@ -0,0 +1,70 @@
/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我现在对util这样的命名特别紧张——我发现一般都是大家懒得想明白应该叫什么的时候就叫util了。

在这里,看上去是想叫 PServerController 什么的。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool,同意。

PServerController是个好名字。
感觉程序员起名字是一个非常痛苦的事情。一痛苦就会用一些比较常用的名字,比如Utils

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/**
* @brief start all pserver thread in this PServerUtil.
*/
void start();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们的naming convention指定的怎么样了? @reyoung

在这里constructor是camel形式,但是methods都是小写。显然不一致呀。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里因为构造函数和类名必须一致。类名必须是UpperCamelCase。比如 "SomeClass".而函数名是"lowerCamelCase",比如"someMethod"。

这样做的好处是,我们可以通过判断出一个东西是不是类型了。

比如

class SomeClass {
public:
    class Helper {
    };

   static Helper helper();
};

SomeClass::Helper 是类型,而SomeClass::helper是函数。

@@ -0,0 +1,43 @@
/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

上面有文件叫 PServerUtils.*,这里叫ParameterServer,显然不一致呀。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个配置文件确实是用来配置parameter server的,目前的pserverutil封装了几个parameter server线程,根据config来创建这些线程。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的意思是到底应该叫 pserver 还是 parameter server 呢?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经按照命名规范修改为ParameterServerController

@@ -0,0 +1,70 @@
/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool,同意。

PServerController是个好名字。
感觉程序员起名字是一个非常痛苦的事情。一痛苦就会用一些比较常用的名字,比如Utils

/**
* @brief start all pserver thread in this PServerUtil.
*/
void start();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里因为构造函数和类名必须一致。类名必须是UpperCamelCase。比如 "SomeClass".而函数名是"lowerCamelCase",比如"someMethod"。

这样做的好处是,我们可以通过判断出一个东西是不是类型了。

比如

class SomeClass {
public:
    class Helper {
    };

   static Helper helper();
};

SomeClass::Helper 是类型,而SomeClass::helper是函数。

void join();

private:
std::vector<std::shared_ptr<ParameterServer2>> pservers_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::vector<std::shared_ptr<ParameterServer2>> =>std::vector<std::unique_ptr<ParameterServer2>>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


namespace paddle {

class PServerUtil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class PServerUtil => class PServerUtil final

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

PServerController* PServerController::createByGflags() {
auto& pServerConfig = *paddle::PServerController::initConfigByGflags();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里有内存泄露

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里直接用栈变量不可以么?也就是

ParameterServerConfig config;
config.set_nics(FLAGS_nics);
...

这样。同时,initConfigByGflags 只被 createByGFlags 调用,没必要extract成一个private的static member function了吧。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

PServerController::~PServerController() { this->join(); }

ParameterServerConfig* PServerController::initConfigByGflags() {
ParameterServerConfig* config = new ParameterServerConfig();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

尽量不要用C++的new关键词。

最省事的方法是:

auto config = std::make_shared<ParameterServerConfig>();

或者是

auto config = std::make_unique<ParameterServerConfig>();  // since c++ 14

虽然目前没有make_unique,不过回头我加上吧。。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对了,勉强在Cpp里面和java new语意一致的东西是 std::make_shared<类型名>(参数)。只是std::make_unique会快一点。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

赞,多谢,已经修改了,不过make_unique打算如何引入?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

直接加上这个函数也可以。。http://stackoverflow.com/questions/17902405/how-to-implement-make-unique-function-in-c11

判断一下C++版本, if __cplusplus != 14,那么就加上make_unique。

Copy link
Collaborator

@reyoung reyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Cool.

@jacquesqiao jacquesqiao changed the title Add pserver util for pserver python api Add ParameterServerController for parameter server python api Jan 6, 2017

namespace paddle {

class ParameterServerController final {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新增的class应该有class comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

required int32 trainer_id = 1;
}

message ParameterServerConfig {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该有个注释说明这个proto message的用意。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


package paddle;

message ParameterClientConfig {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该有个注释说明这个proto message的用意。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done,写的比较简单

}

void ParameterServerController::start() {
LOG(INFO) << "pserver sizes : " << pservers_.size();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pserver sizes" ==> "number of pserver instances"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

LOG(INFO) << "pserver sizes : " << pservers_.size();
int i = 0;
for (const auto& pserver : pservers_) {
LOG(INFO) << "pserver started : " << i;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG(INFO) << "Staring pserver " << i;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


ParameterServerController::~ParameterServerController() { this->join(); }

ParameterServerController* ParameterServerController::createByGflags() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createFromGflags

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done,这个地方之前确实纠结过

}
}

void ParameterServerController::join() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

join 对应的是 fork, start 对应的是 wait。这里的join 应该改名为 wait 才对。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done,好主意

@backyes
Copy link
Contributor

backyes commented Jan 11, 2017

@jacquesqiao

trainer和pserver除了日常的通信,还有一个同步控制(核心目的是为了实现async-sgd、sync-sgd、pass起始、batch 起始等核心逻辑)的逻辑,从这个ParameterServerController实现来看,主要是起到对启动pserver进程的控制,没有同步的控制逻辑。

问题:

  • 如果这个api 将来支持动态增减pserver的功能,那么这个api应该要支持对同步逻辑的感知。 因为pserver动态增减以后, 同步逻辑会有变化(比如同步节点数变化了),trainer是否应该同步这个ParameterServerController来获取pserver同步节点数的信息? 如果是,那么这个api应该是否要考虑对pserver同步逻辑的感知。

@backyes
Copy link
Contributor

backyes commented Jan 11, 2017

@jacquesqiao

现在版本中有一个耦合在trainer中的类似『ParameterServerController』的功能, 参见:

https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/trainer/RemoteParameterUpdater.cpp#L623

它主要强调对SGD算法(sync-sgd, async-sgd)层面的逻辑控制。

@jacquesqiao
Copy link
Member Author

jacquesqiao commented Jan 11, 2017

@backyes

trainer和pserver除了日常的通信,还有一个同步控制(核心目的是为了实现async-sgd、sync-sgd、pass起始、batch 起始等核心逻辑)的逻辑,从这个ParameterServerController实现来看,主要是起到对启动pserver进程的控制,没有同步的控制逻辑。

对的,目前ParameterServerController主要是对parameter server那些thread的封装,干了一些初始化和简单的控制功能,而且只能控制一个ps进程。

如果这个api 将来支持动态增减pserver的功能,那么这个api应该要支持对同步逻辑的感知。 因为pserver动态增减以后, 同步逻辑会有变化(比如同步节点数变化了),trainer是否应该同步这个ParameterServerController来获取pserver同步节点数的信息? 如果是,那么这个api应该是否要考虑对pserver同步逻辑的感知。

后面可以考虑,不过trainer的这个controller貌似是训练这边的一个逻辑,直接和ParameterServer2通信,后面的具体实现,可以再考虑。

@jacquesqiao jacquesqiao merged commit f8a529c into PaddlePaddle:develop Jan 11, 2017
zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this pull request Sep 25, 2019
* test=develop,adding a description of the CMAKE command
wangxicoding pushed a commit to wangxicoding/Paddle that referenced this pull request Dec 9, 2021
…#1051)

* fix cmake by use abs path

* add boost dependency

Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com>
lizexu123 pushed a commit to lizexu123/Paddle that referenced this pull request Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants