refactor(gry): refactor reward model #636

ruoyuGao · 2023-04-05T02:26:18Z

Description

It is a draft pr used for refactoring the reward model

Things finished

refactor red(passed test case, not config file for red, in new entry)
refactor rnd(finished test in cartpole, in new entry)
refactor gail(added to new entry, finished cartpole test)
refactor icm(finished test in cartpole, in new entry)
refactor pdeil(just change print to log, skip test, in new entry)
refactor pwil(just changed print to log, skip test, in new entry)
refactor trex(added to new entry, only trex_onppo_cartpole not work)
refactor ngu(finished test in cartpole, in new entry)
refactor drex(fix cartpole config, works now, new entry)
refactor gcl(add to new entry)

Refactoring

New system Design

Pipeline

Check List

merge the latest version source branch/repo, and resolve all the conflicts
pass style check
pass all the tests

codecov · 2023-04-05T04:52:51Z

Codecov Report

Merging #636 (919c01b) into main (6b188c9) will increase coverage by 1.51%.
The diff coverage is 91.53%.

❗ Current head 919c01b differs from pull request most recent head b78e36c. Consider uploading reports for the commit b78e36c to get more accurate results

@@            Coverage Diff             @@
##             main     #636      +/-   ##
==========================================
+ Coverage   82.06%   83.57%   +1.51%     
==========================================
  Files         586      580       -6     
  Lines       47515    47428      -87     
==========================================
+ Hits        38991    39640     +649     
+ Misses       8524     7788     -736

Flag	Coverage Δ
unittests	`83.57% <91.53%> (+1.51%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
ding/entry/__init__.py	`100.00% <ø> (ø)`
ding/reward_model/guided_cost_reward_model.py	`29.34% <24.32%> (-57.14%)`	⬇️
ding/reward_model/pdeil_irl_model.py	`86.73% <83.33%> (+3.40%)`	⬆️
ding/entry/serial_entry_reward_model_onpolicy.py	`88.46% <85.71%> (-1.02%)`	⬇️
ding/entry/tests/test_serial_entry_reward_model.py	`89.38% <88.40%> (-1.32%)`	⬇️
ding/reward_model/network.py	`92.92% <92.92%> (ø)`
ding/reward_model/trex_reward_model.py	`98.40% <94.64%> (+7.06%)`	⬆️
ding/entry/serial_entry_reward_model_offpolicy.py	`94.87% <100.00%> (+5.26%)`	⬆️
ding/policy/ngu.py	`87.24% <100.00%> (+72.44%)`	⬆️
ding/reward_model/__init__.py	`100.00% <100.00%> (ø)`
... and 12 more

... and 266 files with indirect coverage changes

ding/reward_model/network.py

ding/reward_model/reword_model_utils.py

… ruoyugao

PaParaZz1 · 2023-05-13T02:41:37Z

dizoo/minigrid/envs/minigrid_wrapper.py

@@ -32,3 +33,72 @@ def observation(self, obs):
        # print('vis_mask:' + vis_mask)
        image = grid.encode(vis_mask)
        return {**obs, "image": image}
+
+
+class ObsPlusPrevActRewWrapper(gym.Wrapper):


why add this wrapper here, rather than use the wrapper in ding/envs

because in ding/envs, we use gym for the wrapper, but for minigrid we need gymnasium instead of gym. And in order to make a terrible influence on other env, I add this wrapper to minigrid wrapper.

dizoo/classic_control/cartpole/config/cartpole_trex_onppo_config.py

PaParaZz1 · 2023-05-13T02:43:26Z

dizoo/classic_control/cartpole/config/cartpole_trex_onppo_config.py

@@ -10,16 +10,18 @@
    ),
    reward_model=dict(
        type='trex',
+        exp_name='cartpole_trex_onppo_seed0',


why exp_name here

in our original implementation, we used exp_name to build the tb logger. So it uses the whole config file. our new implementation only uses the reward model config, so I add this part to the reward model.

ding/policy/ngu.py

ding/entry/tests/test_application_entry_drex_collect_data.py

ding/reward_model/base_reward_model.py

ding/reward_model/drex_reward_model.py

PaParaZz1 · 2023-05-13T02:54:30Z

ding/reward_model/gail_irl_model.py

@@ -201,6 +133,7 @@ def load_expert_data(self) -> None:
        with open(self.cfg.data_path + '/expert_data.pkl', 'rb') as f:
            self.expert_data_loader: list = pickle.load(f)
        self.expert_data = self.concat_state_action_pairs(self.expert_data_loader)
+        self.expert_data = torch.unbind(self.expert_data, dim=0)


why unbind here

because we re-use the concat_state_action_pair function, and its return is different from the original function in Gail. So I used unbind here.

ding/entry/serial_entry_reward_model_offpolicy.py

… ruoyugao

PaParaZz1 · 2023-06-09T03:48:26Z

ding/entry/serial_entry_reward_model_offpolicy.py

+    max_train_iter: Optional[int] = int(1e10),
+    max_env_step: Optional[int] = int(1e10),
+    cooptrain_reward: Optional[bool] = True,
+    pretrain_reward: Optional[bool] = False,


add comments for new arguments

PaParaZz1 · 2023-06-09T03:49:28Z

ding/entry/serial_entry_reward_model_offpolicy.py

+        # update reward_model, when you want to train reward_model inloop
+        if cooptrain_reward:
+            reward_model.train()
+            # clear buffer per fix iters to make sure replay buffer's data count isn't too few.


clear buffer per fixed iters to make sure the data for RM training is not too offpolicy

PaParaZz1 · 2023-06-09T03:52:53Z

ding/entry/serial_entry_reward_model_offpolicy.py

@@ -108,11 +111,11 @@ def serial_pipeline_reward_model_offpolicy(
            # collect data for reward_model training
            reward_model.collect_data(new_data)


add if if cooptrain_reward

PaParaZz1 · 2023-06-09T03:54:49Z

ding/entry/tests/test_serial_entry_reward_model.py

+    try:
+        serial_pipeline_reward_model_offpolicy(config, seed=0, max_train_iter=2)
+    except Exception:
+        assert False, "pipeline fail"


add finally branch

PaParaZz1 · 2023-06-09T03:58:22Z

ding/reward_model/reword_model_utils.py

@@ -0,0 +1,106 @@
+from typing import Optional, List, Any


file name typo reword

PaParaZz1 · 2023-06-09T04:50:42Z

dizoo/atari/config/serial/pitfall/pitfall_ngu_config.py

@@ -22,44 +22,49 @@
        stop_value=int(1e5),


remove pitfall and mnotezuma config

PaParaZz1 · 2023-06-09T04:52:12Z

dizoo/atari/config/serial/pong/pong_gail_dqn_config.py

+
+    # train reward model
+    serial_pipeline_reward_model_offpolicy(main_config, create_config)


wrong usage here

PaParaZz1 · 2023-06-09T04:55:58Z

dizoo/mujoco/config/halfcheetah_bdq_config.py

@@ -22,7 +22,6 @@
            action_bins_per_branch=2,  # mean the action shape is 6, 2 discrete actions for each action dimension


why modify this

It may be modified by format.sh, will I need to change it back?

PaParaZz1 · 2023-06-09T04:57:41Z

dizoo/classic_control/cartpole/config/cartpole_dqn_config.py

@@ -24,6 +24,7 @@
            update_per_collect=5,
            batch_size=64,
            learning_rate=0.001,
+            learner=dict(hook=dict(save_ckpt_after_iter=100)),


why add this

because when we do unit test at drex, we need to modify the learner.hook.save_ckpt_after_iter, if we do not have this, the unit test will be failed, so I add this.

puyuan1996 · 2023-06-09T09:28:30Z

ding/entry/serial_entry_reward_model_offpolicy.py

+    max_train_iter: Optional[int] = int(1e10),
+    max_env_step: Optional[int] = int(1e10),
+    cooptrain_reward: Optional[bool] = True,
+    pretrain_reward: Optional[bool] = False,


pretrain_reward -> pretrain_reward_model?

puyuan1996 · 2023-06-09T09:29:08Z

ding/entry/serial_entry_reward_model_offpolicy.py

+    model: Optional[torch.nn.Module] = None,
+    max_train_iter: Optional[int] = int(1e10),
+    max_env_step: Optional[int] = int(1e10),
+    cooptrain_reward: Optional[bool] = True,


cooptrain_reward -> joint_train_reward_model?

puyuan1996 · 2023-06-09T09:33:02Z

ding/reward_model/icm_reward_model.py

-        self.tb_logger.add_scalar('icm_reward/action_accuracy', accuracy, self.train_cnt_icm)
-        loss = self.reverse_scale * inverse_loss + forward_loss
-        self.tb_logger.add_scalar('icm_reward/total_loss', loss, self.train_cnt_icm)
+        inverse_loss, forward_loss, accuracy = self.reward_model.learn(data_states, data_next_states, data_actions)
        loss = self.reverse_scale * inverse_loss + forward_loss


self.reverse_scale -> self.reverse_loss_weight

puyuan1996 · 2023-06-09T09:33:23Z

ding/reward_model/icm_reward_model.py

-        self.tb_logger.add_scalar('icm_reward/action_accuracy', accuracy, self.train_cnt_icm)
-        loss = self.reverse_scale * inverse_loss + forward_loss
-        self.tb_logger.add_scalar('icm_reward/total_loss', loss, self.train_cnt_icm)
+        inverse_loss, forward_loss, accuracy = self.reward_model.learn(data_states, data_next_states, data_actions)


在这里 accuracy的含义是？增加注释，以及换一下变量名

puyuan1996 · 2023-06-09T09:34:52Z

ding/reward_model/icm_reward_model.py

-                    item['reward'] = item['reward'] / self.cfg.extrinsic_reward_norm_max
-            elif self.intrinsic_reward_type == 'assign':
-                item['reward'] = icm_rew
+        train_data_augmented = combine_intrinsic_exterinsic_reward(train_data_augmented, icm_reward, self.cfg)


icm_reward -> normalized_icm_reward?

puyuan1996 · 2023-06-09T09:48:10Z

ding/reward_model/ngu_reward_model.py

        self.only_use_last_five_frames = config.only_use_last_five_frames_for_icm_rnd

-    def _train(self) -> None:
+    def _train(self) -> torch.Tensor:
        # sample episode's timestep index
        train_index = np.random.randint(low=0, high=self.train_obs.shape[0], size=self.cfg.batch_size)

        train_obs: torch.Tensor = self.train_obs[train_index].to(self.device)  # shape (self.cfg.batch_size, obs_dim)


这里的: torch.Tensor或许可以去掉，在上面写上overview格式的注释

这里具体指的是什么呢，为什么写了注释之后就可以不控制返回参数的类型

puyuan1996 · 2023-06-09T09:50:26Z

ding/reward_model/reword_model_utils.py

+    """
+    states_data = []
+    actions_data = []
+    #check data(dict) has key obs and action


空格使用 bash format.sh ding 格式化代码

puyuan1996 · 2023-06-09T09:51:51Z

ding/reward_model/rnd_reward_model.py

+    def clear_data(self, iter: int) -> None:
+        assert hasattr(
+            self.cfg, 'clear_buffer_per_iters'
+        ), "Reward Model does not have clear_buffer_per_iters, Clear failed"


报错，可以给出修改建议，例如你需要参考xxx, 实现xxx方法

puyuan1996 · 2023-06-09T09:53:07Z

dizoo/atari/config/serial/montezuma/montezuma_ngu_config.py

+            type='rnd-ngu',
+        ),
+        episodic_reward_model=dict(
+            # means if using rescale trick to the last non-zero reward


这段注释可以用gpt4优化一下语法

puyuan1996 · 2023-06-09T09:54:17Z

dizoo/box2d/lunarlander/config/lunarlander_ngu_config.py

+            type='rnd-ngu',
+        ),
+        episodic_reward_model=dict(
+            # means if using rescale trick to the last non-zero reward


这段注释可以用gpt4优化一下语法

优化完毕

ruoyuGao added 2 commits April 4, 2023 22:20

refactor network and red reward model

c372c07

create reward model utils

6718e4a

ruoyuGao marked this pull request as draft April 5, 2023 04:26

PaParaZz1 added the refactor refactor module or component label Apr 6, 2023

PaParaZz1 requested changes Apr 6, 2023

View reviewed changes

ding/reward_model/network.py Outdated Show resolved Hide resolved

ding/reward_model/network.py Outdated Show resolved Hide resolved

ding/reward_model/reword_model_utils.py Outdated Show resolved Hide resolved

ding/reward_model/reword_model_utils.py Outdated Show resolved Hide resolved

polish network and reward model utils, provide test for them

be7039a

ruoyuGao changed the title ~~WIP: polish(gry): refactor reward model~~ WIP: refactor(gry): refactor reward model Apr 6, 2023

PaParaZz1 mentioned this pull request Apr 7, 2023

Roadmap for DI-engine #548

Open

ruoyuGao added 21 commits April 10, 2023 01:36

refactor network for two method: learn and forward

a4de466

Merge branch 'main' into ruoyugao

d615c14

refactor rnd

7a8ec6e

refactor gail

55c7be8

fix gail for unit test

ff60716

refactor icm

6b80392

fix wrong unit test in test_reward_model_utils

25d49b5

refactor gcl and pwil

c081ff0

refactor pdeil

f1218cd

add hidden_size_list to gail

d9060c2

change gail test for new config

179182a

refactor trex network

d067731

fix style and wrong import

29f0d55

fix style for trex

4ec0bd3

Merge branch 'main' into ruoyugao

800f090

Merge branch 'ruoyugao' of https://github.com/ruoyuGao/DI-engine into…

c64b5c7

… ruoyugao

fix unit test for trex onppo

660af32

Merge branch 'main' into ruoyugao

1b0d579

refactor ngu and provide cartpole config file

b4e81dd

change reward entry

eddc80d

change trex entry to new entry, combine old trex test to new test

6e2b867

ruoyuGao changed the title ~~WIP: refactor(gry): refactor reward model~~ refactor(gry): refactor reward model May 4, 2023

fix style for drex unittest

9e63ef1

ruoyuGao force-pushed the ruoyugao branch from a6baf30 to 9e63ef1 Compare May 5, 2023 01:30

ruoyuGao added 10 commits May 5, 2023 00:40

fix drex unittest

ca2e2db

fix bug in minigrid env

8716afe

add explain for rm utils

9036141

move RM unittest into one file

6b9754a

Merge branch 'main' into ruoyugao

a52a1c0

add drex config

a5c7989

Merge branch 'ruoyugao' of https://github.com/ruoyuGao/DI-engine into…

d631237

… ruoyugao

fix ngu wrapper bug in minigrid

f42d131

fix ngu wrapper bug in minigrid

edff260

Merge branch 'main' into ruoyugao

6ab66e1

PaParaZz1 requested changes May 13, 2023

View reviewed changes

ruoyuGao added 9 commits May 22, 2023 01:48

refactor gcl, add it to reward entry

cb0c627

refactor gcl config and bash format other config

016fbb3

Merge branch 'ruoyugao' of https://github.com/ruoyuGao/DI-engine into…

cf50148

… ruoyugao

fix bug for test, remove wrong comment

919c01b

polish code for ngu, drex, base rm and entry

a1d0b3a

Merge branch 'main' into ruoyugao

0a0af3c

polish code for all rm

e310b4c

fix style for ngu

92dc227

polish comment for config files

a4f364d

PaParaZz1 requested changes Jun 9, 2023

View reviewed changes

add gcl unit test

1f06dec

puyuan1996 reviewed Jun 9, 2023

View reviewed changes

ruoyuGao added 4 commits June 19, 2023 02:43

polish RM

a547b3b

fix style for rnd and icm

97da5c6

fix style for rnd and icm

774b2a4

fix style for icm

b78e36c

		@@ -108,11 +111,11 @@ def serial_pipeline_reward_model_offpolicy(
		# collect data for reward_model training
		reward_model.collect_data(new_data)


		# train reward model
		serial_pipeline_reward_model_offpolicy(main_config, create_config)

		@@ -22,7 +22,6 @@
		action_bins_per_branch=2, # mean the action shape is 6, 2 discrete actions for each action dimension

refactor(gry): refactor reward model #636

Are you sure you want to change the base?

refactor(gry): refactor reward model #636

Conversation

ruoyuGao commented Apr 5, 2023 • edited

Description

Things finished

Refactoring

New system Design

Pipeline

Check List

codecov bot commented Apr 5, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

空格 使用 bash format.sh ding 格式化代码

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruoyuGao commented Apr 5, 2023 •

edited

codecov bot commented Apr 5, 2023 •

edited

空格使用 bash format.sh ding 格式化代码