Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/chapter9/chapter9_questions&keywords #59

Open
qiwang067 opened this issue May 24, 2021 · 6 comments
Open

/chapter9/chapter9_questions&keywords #59

qiwang067 opened this issue May 24, 2021 · 6 comments

Comments

@qiwang067
Copy link
Contributor

https://datawhalechina.github.io/easy-rl/#/chapter9/chapter9_questions&keywords

Description

@Strawberry47
Copy link

Thanks♪(・ω・)ノ

@Strawberry47
Copy link

请问actor-critic是off-policy吗

@qiwang067
Copy link
Contributor Author

请问actor-critic是off-policy吗

您好,A2C 和 A3C 都是 on-policy(同策略) 的

@15138922051
Copy link

A3C的code有吗,谢谢楼主

@qiwang067
Copy link
Contributor Author

A3C的code有吗,谢谢楼主

有A2C的 code:
https://github.com/datawhalechina/easy-rl/tree/master/codes/A2C

@chenjiaqiang-a
Copy link

代码库里实现的a2c算法和理论公式有些出入,我按照理论公式实现了一版,训练出来效果还不错,不知道这样的实现是否有问题,可以帮我看一下吗?

def update(self):
    state_pool, action_pool, reward_pool, next_state_pool, done_pool = self.memory.sample(len(self.memory), True)
    self.memory.clear()

    states = torch.tensor(state_pool, dtype=torch.float32, device=self.device)
    actions = torch.tensor(action_pool, dtype=torch.float32, device=self.device)
    next_states = torch.tensor(next_state_pool, dtype=torch.float32, device=self.device)
    rewards = torch.tensor(reward_pool, dtype=torch.float32, device=self.device)
    masks = torch.tensor(1.0 - np.float32(done_pool), device=self.device)

    probs, values = self.model(states)
    _, next_values = self.model(next_states)

    dist = Categorical(probs)
    log_probs = dist.log_prob(actions)
    advantages = rewards + self.gamma * next_values.squeeze().detach() * masks - values.squeeze()
    actor_loss = -(log_probs * advantages.detach()).mean()
    critic_loss = advantages.pow(2).mean()
    loss = actor_loss + self.critic_factor * critic_loss - self.entropy_coef * dist.entropy().mean()

    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants