Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/python.paddle.v2 #1108

Closed
wants to merge 16 commits into from

Conversation

reyoung
Copy link
Collaborator

@reyoung reyoung commented Jan 10, 2017

Rearrange Paddle Packages to paddle.v2. Use import paddle.v2 as paddle currently.

@reyoung reyoung requested review from backyes, beckett1124, wangkuiyi and jacquesqiao and removed request for backyes and beckett1124 January 10, 2017 07:20
import random

import paddle.v2 as paddle
import py_paddle.swig_paddle as api
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我总觉得这个swig_paddle需要稍微封装一下

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看到下边已经改了

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

inference = fc_layer(input=hidden2, size=10, act=SoftmaxActivation())
cost = classification_cost(
input=inference, label=data_layer(
imgs = paddle.config.data_layer(name='pixel', size=784)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

network的config和上边那些optimizer的config是不是分开比较好

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实都是生成protobuf的东西。先放到一个包里面了。

@@ -4,7 +4,8 @@ packages=['paddle',
'paddle.proto',
'paddle.trainer',
'paddle.trainer_config_helpers',
'paddle.utils']
'paddle.utils',
'paddle.v2']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

貌似有个缩进的问题

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@reyoung reyoung mentioned this pull request Jan 11, 2017
@jacquesqiao
Copy link
Member

lgtm

@reyoung reyoung mentioned this pull request Jan 11, 2017
Copy link
Collaborator

@wangkuiyi wangkuiyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

之前我们讨论说先用目前的“不需要定义多个.py文件“的API把所有demo都重写一遍,然后再来看应该如何完善API。

但是我刚才又想了一下,是不是API里如果有一些明显的问题,可以先修正问题,然后再来重写mnist之后的下一个demo。这样效率更高?

在写mnist的时候,我们不要 import * from 已有的Python packages,而是 copy-n-paste 已有的package 到 paddle.v2。这样我们就可以在”把mnist demo写得顾名思义“这个过程里,修改copy 过来的实现。当我们针对每个demo重复这个过程之后,我们是不是就得到了一个完备的v2 API了。

_temp_optimizer_ = api.ParameterOptimizer.create(opt_config)
opt_config_proto = paddle.config.parse_optimizer(optimizer_config)
opt_config = paddle.raw.OptimizationConfig.createFromProto(opt_config_proto)
_temp_optimizer_ = paddle.raw.ParameterOptimizer.create(opt_config)
enable_types = _temp_optimizer_.getParameterTypes()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一段没看明白——我们需要optimizer吗?为什么只create了一个temproary optimzier,然后取了其中一个property(enable_types),用来生成一个 gradient machine 就完了。我们到底需不需要optimizer这个概念呢?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不需要optimizer这个概念。

_temp_optimizer_ = api.ParameterOptimizer.create(opt_config)
opt_config_proto = paddle.config.parse_optimizer(optimizer_config)
opt_config = paddle.raw.OptimizationConfig.createFromProto(opt_config_proto)
_temp_optimizer_ = paddle.raw.ParameterOptimizer.create(opt_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L73 到 L75 看上去是要按照根据一些config信息创建一个optimizer?如果是这样为什么需要 protobuf message 和一个高阶函数 parse_optimizer 呢?是不是下面这样就可以了:

optimizer = paddle.v2.optimizer.Adam(
    learning_rate=1e-4,
    batch_size=1000,
    model_averager=paddle.v2.config.ModelAverage(average_window=0.5),
    regularizator=paddle.v2.regularizer.L2(rate=0.5))

这样也不需要定义 def optimizer_config() 这个函数了。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当然,这样更好些。

model_config, api.CREATE_MODE_NORMAL, enable_types)
model_config = paddle.config.parse_network(network_config)
m = paddle.raw.GradientMachine.createFromConfigProto(
model_config, paddle.raw.CREATE_MODE_NORMAL, enable_types)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

类似的,如果这里只是要create一个gradient mahcine,貌似应该是:

images  = paddle.v2.layer.data(name='pixel', size=784)
hidden1 = paddle.v2.layer.fc(input=images, size=200)
hidden2 = paddle.v2.layer.fc(input=hidden1, size=200)
classes = paddle.v2.layer.fc(input=hidden2, size=10, act=paddle.config.SoftmaxActivation())
cost    = paddle.v2.cost.classification(
    input=classes, 
    label=paddle.v2.layer.data(name='label', size=10))
gm = paddle.gradient_machine.create(cost)

这样也不需要定义 network_config 函数了?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样实现,Paddle需要比较大规模的重构。

主要是现在Paddle解析网络配置的过程,是调用fc之类的函数,在这个函数里面写了一个全局的变量。
而传cost的方式,相当于在这个cost变量里,记录了网络的所有拓扑。这需要我们的返回值记录原来全局变量中的信息。

并且,其实还有一个问题,就是有可能一个神经网络会有多个输出值。可能不只有一个cost。使用传函数的办法也并不算非常不优雅。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在目前config_parser的实现里边,还必须是一个函数,或者是一个独立的文件,而且这里的改造代价比较大。https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer/config_parser.py#L3444
这是一个几千行代码的文件,后续应该是要重构掉的。

Copy link
Collaborator Author

@reyoung reyoung Jan 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangkuiyi 如果确定传变量的方式没有问题的话,那么我们要不就重构一下这个解析了?


# This type check is not useful. Only enable type hint in IDE.
# Such as PyCharm
assert isinstance(m, api.GradientMachine)
assert isinstance(m, paddle.raw.GradientMachine)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解如果 m/gm 是上面一行 gm = paddle.v2.gradient_machine.create(...) 这样产生的,读者自然相信 gm 的类型是一个 gradient machine,也就不需要这个 assertation 来增加程序的可读性了。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这行完全可以不需要。主要是为IDE增加的type信息。常见的IDE可以根据这个type信息做代码提示。

写这行的主要原因是我们用swig生成的python代码。swig并没有按照Python的风格注释函数返回值的类型。如果我们从C-API写的话,就没这个问题了。


# Initialize Parameter by numpy.
init_parameter(network=m)
m.randParameters()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randParameters是一个函数吗?按照Python style guide,函数名应该是 randomize_parameters

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是。但这里因为使用的是SWIG编译的C++头文件,所以还是按照C++的命名方式了。如果用C-API暴露,就没有这个问题了。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randParameters是从cpp文件中通过swig暴露出来的接口


# Create Local Updater. Local means not run in cluster.
# For a cluster training, here we can change to createRemoteUpdater
# in future.
updater = api.ParameterUpdater.createLocalUpdater(opt_config)
assert isinstance(updater, api.ParameterUpdater)
updater = paddle.raw.ParameterUpdater.createLocalUpdater(opt_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updater = paddle.v2.parameter_updater.create(
    optimizer = paddle.v2.optimizer.Adam(),
    learning_rate=1e-4,
    batch_size=1000,
    model_averager=paddle.v2.config.ModelAverage(average_window=0.5),
    regularizator=paddle.v2.regularizer.L2(rate=0.5))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没问题。有道理。

updater = api.ParameterUpdater.createLocalUpdater(opt_config)
assert isinstance(updater, api.ParameterUpdater)
updater = paddle.raw.ParameterUpdater.createLocalUpdater(opt_config)
assert isinstance(updater, paddle.raw.ParameterUpdater)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

类似的,如果 updater 是上面一行那样create的,读者应该不依赖这一行来了解updater的类型了。


# Initialize ParameterUpdater.
updater.init(m)

# DataProvider Converter is a utility convert Python Object to Paddle C++
# Input. The input format is as same as Paddle's DataProvider.
converter = DataProviderConverter(
input_types=[dp.dense_vector(784), dp.integer_value(10)])
converter = paddle.data.DataProviderConverter(input_types=[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么需要暴露一个 converter 出来呢?按照上面comment的解释——把一个Python obj转换为一个C++ obj——这个样的逻辑不应该出现在API里,暴露给用户吧?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,可以封装到数据的Iterator里面。

@@ -0,0 +1,12 @@
from paddle.trainer_config_helpers import *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里 import *,让我们丧失了对 paddle.v2 package 里的内容的掌握——只要有人修改了 paddle.trianer_config_helpers 里的内容,这里的symbols就发生变化了吧?

我建议,我们趁此机会,先把 mnist 这个 demo 需要的内容 copy-n-paste 过来。然后依据把 mnist demo 写的让读者能“顾名思义”的原则,修改 copy 过来的库。

随后我们一个一个demo的过,重复上述过程,得到的 paddle.v2 应该就是我们想要的了吧。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不过其实paddle.trainer_config_helpers里面暴露的符号是严格控制的。里面使用了__all__来控制符号暴露。

copy and paste这个包,会少暴露非常多东西。比如,我们教程里面的MNIST可能使用全连接做的。用户可能想改成卷积之类的操作。但是,如果只是copy and paste demo需要的接口的话,卷积很可能就没复制过去。这用户就缺乏了这部分灵活性了。

同时,和之前的一个comment类似,如果我们真的需要使用『返回值』而不是『函数』来去定义网络结构的话,那其实所有的配置解析都要重写一下。copy and paste反而不好,不如直接重写一个解析过程。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我建议copy-n-paste,就是为了“重写”,而不只是为现有symbols在v2 package下面建立一个link。

@@ -178,7 +170,7 @@ def main():
test_data_generator = input_order_converter(read_from_mnist(test_file))
for data_batch in generator_to_batch(test_data_generator, 512):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果能写成下面这样会更容易让用户理解:
data_reader = read_from_mnist(test_file)
for data_batch in data_reader:
m.forward(data_batch, outArgs, paddle.raw.PASS_TEST)
...

input_order_converter 和 converter这两个东西会让用户难以理解。用户能做的多半只是照抄代码,但是并不知道为什么要用这些converter. 前面创建optimizer的那段代码也是同样的问题。

一个好的API最好能做到每一步调用都让用户很容易理解为什么要做。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有道理。好的。

这两个可以直接patch到forward这个函数里面。

yield items[:min(i + 1, size)]


class IDataPool(object):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacquesqiao @wangkuiyi

Please review this interface.

这个是Paddle API返回数据的接口。只有两个函数。另外,其实在Python里面,没有『接口』这个概念。这里写一下只是为了code清晰一点。

Copy link
Collaborator

@wangkuiyi wangkuiyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

两个感觉:

  1. over engineering -- 有些classses和functions在我看来是不需要的。

  2. 新增的概念,例如Layer和Model和DataPool是没有commets来说明“用法”或者说“要支持的语法”的。

  3. 这个PR越长越大,目前10个文件了,不知道将来会有多少个文件。一般一个PR如果不想把reviewer的脑袋撑爆,3个文件比较合适。这个PR如果拆成多个——比如Model一个,Layer一个,DataPool一个,例子一个,会比较合适。

@@ -0,0 +1,12 @@
from paddle.trainer_config_helpers import *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我建议copy-n-paste,就是为了“重写”,而不只是为现有symbols在v2 package下面建立一个link。

]


class IDataPool(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如注释里说的,这个“接口”意义不大。如果要教会用户如何写一个data pool,貌似只需要两行注释:

# To create your data pool, please define a class with two methods: next, which returns a batch of data, and reset, which resets the pool.

yield tmp


class NaiveDataPool(IDataPool):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果意思是"load all samples into memory",那就叫 InMemoryDataPool 好了。

return self.__pool__[begin:end]


def create_data_pool(file_reader,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不同类型的data pool可能有各自不同的配置参数,不需要统一成同一组。我理解,用户自己调用 InMemoryDataPool 的构造函数创建一个 pool 就行,不需要 create_data_pool 这样一个函数。这样有违最简化的原则。

import collections


class Layer(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该有一些注释,说明Layer应该怎么被使用。

之前 @hedaoyuan 在 review Error(Status)那个PR的时候提醒过需要预先定义语法(syntax)。我觉得非常有道理。

for offset in xrange(0, len(self.data), self.batch_size):
limit = min(offset + self.batch_size, len(self.data))
yield self.data[offset:limit]
batch_evaluator = model.make_evaluator()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两个evaluator看上去比较confusing。这里没有参数,所以看上去两个evaluator的功能是一样的。后面调用了.start和 .finish,但是也看不出来具体在干什么。根据名字(evaluator),猜测是用某种test data来评测模型,但是也看不出来用的是什么test data?

for each_item in generator:
yield each_item['pixel'], each_item['label']
# Training process.
model.start()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model不能start吧?这里想做的是 model.start_training? 在training开始之前需要做什么呢?


for pass_id in xrange(2):
model.start_pass()
Copy link
Collaborator

@wangkuiyi wangkuiyi Feb 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好像我们之前讨论过,不应该由用户控制pass这个概念,因为pass是由数据集合大小决定的?我隐约记得之前讨论过,可以写成如下方式:

for minibatch, last_batch in enumerate(training_data):
    ....
    if last_batch:
        log("Print model quality: %f", evaluate(model, testing_data))
    ....


m.finish()
model.finish()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model.finish 要做什么呢?

@qingqing01 qingqing01 mentioned this pull request Feb 13, 2017
@reyoung
Copy link
Collaborator Author

reyoung commented Mar 9, 2017

Closed because this PR's job is done by ourselves in last month.

@reyoung reyoung closed this Mar 9, 2017
lizexu123 pushed a commit to lizexu123/Paddle that referenced this pull request Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants