Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/chapter4/chapter4_questions&keywords #53

Open
qiwang067 opened this issue May 24, 2021 · 8 comments
Open

/chapter4/chapter4_questions&keywords #53

qiwang067 opened this issue May 24, 2021 · 8 comments

Comments

@qiwang067
Copy link
Contributor

https://datawhalechina.github.io/easy-rl/#/chapter4/chapter4_questions&keywords

Description

@Sunnyzhr
Copy link

Sunnyzhr commented Aug 6, 2021

$\text { 因此 } \nabla \mathrm{p}{\theta}(\tau)=\nabla \log \mathrm{p}{\theta}\left(\mathrm{a}{\mathrm{t}}^{\mathrm{n}} \mid \mathrm{s}{\mathrm{t}}^{\mathrm{n}}\right)$

是不是写错了?

@yyysjz1997
Copy link
Contributor

谢谢你的留言,应该是没有写错的,具体的公式推导可见教程 “第四章 策略梯度”。

@Strawberry47
Copy link

谢谢博主 Thanks♪(・ω・)ノ

@SaleJuice
Copy link

keywords里的“Reinforce”是不是写成全大写的“REINFORCE”更好些。与之前的笔记更衔接些。

@yyysjz1997
Copy link
Contributor

是的是的,这里的REINFORCE表示一种基于策略梯度并使用回合更新的强化学习的经典算法,应该区别于Reinforce,谢谢你的建议,已经改正~

@SCurry-30
Copy link

Policy Gradient

@chensisi0730
Copy link

就我觉得符号体系混乱吗?策略一会是p 一会是π,并且和前三章体系也不同,

@qiwang067
Copy link
Contributor Author

就我觉得符号体系混乱吗?策略一会是p 一会是π,并且和前三章体系也不同,

用 p 来表示策略是为了方便读者理解,后续会考虑统一符号(加上对应注解);
关于体系的问题,其实是从不同的角度来讲解知识,后面会考虑统一风格

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants