[Finetune] replace fine-tuning DefaultTrainer with transformers.Trainer #204

harborn · 2024-04-25T08:30:50Z

refactoring: replace fine-tuning DefaultTrainer with transformers.Trainer

this update will:

disable DefaultTrainer, which only contains only a very small subset of training functionality.
enable completely arguments support for different training task.

minmingzhu · 2024-04-29T07:29:36Z

llm_on_ray/finetune/finetune.py

-    try:
-        common.logger.info("trainer prepare start")
-        model.training = True
-        trainer.prepare(model, tokenizer, datasets, optimizer, accelerator)


Does remove prepare function in default_trainer.py?

minmingzhu · 2024-04-29T07:33:35Z

llm_on_ray/finetune/finetune.py

-    if accelerate_mode == "FSDP":
-        fsdp_plugin = FullyShardedDataParallelPlugin(
-            state_dict_config=FullStateDictConfig(offload_to_cpu=False, rank0_only=False),
-            optim_state_dict_config=FullOptimStateDictConfig(
-                offload_to_cpu=False, rank0_only=False
-            ),


transformers.Trainer how to distinguish FSDP and Deepspeed?

for FSDP training, we should change TrainingArguments's option as following:
fsdp_config with a json config file: fsdp_config.json
fsdp auto_wrap set True

KepingYan · 2024-04-30T09:04:43Z

llm_on_ray/finetune/finetune.py

+    args = {
+        "output_dir": config["General"]["output_dir"],
+        "gradient_checkpointing": config["General"]["enable_gradient_checkpointing"],
+        "save_strategy": save_strategy,
+        "bf16": config["Training"]["mixed_precision"] == "bf16",
+        "num_train_epochs": config["Training"]["epochs"],
+        "per_device_train_batch_size": config["Training"]["batch_size"],
+        "per_device_eval_batch_size": config["Training"]["batch_size"],
+        "learning_rate": config["Training"]["learning_rate"],
+        "logging_steps": config["Training"]["logging_steps"],
+        "lr_scheduler_type": config["Training"]["lr_scheduler"],
+        "weight_decay": config["Training"]["weight_decay"],
+        "gradient_accumulation_steps": config["Training"]["gradient_accumulation_steps"],
+    }


Can you add the max_train_steps parameter, otherwise, the UI will not be able to demo finetuning task in a short time.

llm-on-ray/llm_on_ray/ui/start_ui.py

Lines 631 to 632 in b0a5840

if max_train_step != 0:

finetune_config["Training"]["max_train_steps"] = max_train_step

In addition, can other parameters of lr_scheduler such as num_warmup_steps be supported?

Updated.
in our yaml file, the option is max_train_steps
while for the TrainingArguments, the option is max_steps

KepingYan · 2024-04-30T09:39:39Z

llm_on_ray/finetune/finetune.py

+        # accelerate_env_vars = get_accelerate_environment_variable(config)
+        # runtime_env["env_vars"].update(accelerate_env_vars)


Is get_accelerate_environment_variable no longer needed? Let us remove this function.

Yes, no longer needed. removed.

KepingYan · 2024-04-30T09:53:08Z

llm_on_ray/common/dataprocesser/general_processer.py

@@ -176,13 +173,22 @@ def group_texts(examples):
                desc=f"Grouping texts in chunks of {block_size}",
            )

+        return tokenized_datasets
+
+    def convert_dataset(self, tokenizer, dataset):


general_processer is just generating dataloader, it may be better to call it 'prepare' or 'prepare_dataloader'. And please align function names in other files such as pretrain modules'.

KepingYan · 2024-04-30T09:54:08Z

llm_on_ray/common/trainer/default_trainer.py

-        train_dataloader, eval_dataloader = self.dataprocesser.prepare(tokenizer, dataset)
+        train_dataloader, eval_dataloader = self.dataprocesser.convert_dataset(tokenizer, dataset)


If we no longer use default_trainer, should this file be removed?

Maybe default_trainer.py will be used later, if sure this file can be removed, will delete it later.

KepingYan · 2024-04-30T09:58:22Z

llm_on_ray/finetune/finetune.py

+        trainer = Trainer(
+            model=model,
+            args=training_args,
+            train_dataset=tokenized_datasets["train"],
+            eval_dataset=tokenized_datasets["validation"],
+            tokenizer=tokenizer,
+            data_collator=data_collator,
        )


How is resuming finetuning from checkpoint supported?

checkpoint and model result saving are control by argument in training_args, there are options:
save_only_model, save_strategy, save_steps, output_dir, save_steps, etc.
all this options can be used to match different need of all kinds saving and loading models and checkpoints.
For more details, please see here.https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/trainer#transformers.TrainingArguments

KepingYan · 2024-04-30T10:10:57Z

llm_on_ray/finetune/finetune.py

-    optimizer = common.optimizer.Optimizer.registory.get("DefaultOptimizer")()(
-        model,
-        config={
-            "name": config["Training"]["optimizer"],
-            "config": {"lr": config["Training"]["learning_rate"]},
-        },


Why remove optimizer?

optimizer will create in transformers.Trainer or optimum.habana.transformers.GaudiTrainer

KepingYan · 2024-04-30T10:12:57Z

llm_on_ray/finetune/finetune.py

+        trainer = Trainer(
+            model=model,
+            args=training_args,
+            train_dataset=tokenized_datasets["train"],
+            eval_dataset=tokenized_datasets["validation"],
+            tokenizer=tokenizer,
+            data_collator=data_collator,


I think we should pass optimizer parameter into the Trainer according to the user's configuration.

optimizer's parameters are parts of TrainingArguments, and optimizer will be create before calling epoch-step-loop in Trainer.train()

But I don’t see the optimizer configured in TrainingArguments. How does user set optimizer parameter?

carsonwang · 2024-05-06T08:25:44Z

llm_on_ray/finetune/finetune.py

+        args.update({"use_lazy_mode": config["Training"]["hpu_execution_mode"] == "lazy"})
+        args.update({"pipelining_fwd_bwd": True})
+        args.update({"throughput_warmup_steps": 3})
+        args.update({"adam_epsilon": 1e-8})


The above three configs are hard coded? Are these values also used in our previous implementation?

I removed this hard coded value. I am thinking should add those options to our yaml config file.

carsonwang · 2024-05-06T08:28:13Z

llm_on_ray/finetune/finetune.py

+def convert_to_training_args(cls, config):
+    device = config["Training"]["device"]
+    accelerate_mode = config["Training"]["accelerate_mode"]
+    checkpoint_dir = config["General"]["checkpoint_dir"]


It seems this is not set.

this option in our yaml config file is used to save checkpoint files, while for Trainer, checkpoint files will be saved to output_dir, so checkpoint_dir seems meaningless.
I will change this option to save_strategy, it will control how the checkpoint files saving.

carsonwang · 2024-05-06T09:21:00Z

llm_on_ray/finetune/finetune.py

+        "logging_steps": config["Training"]["logging_steps"],
+        "lr_scheduler_type": config["Training"]["lr_scheduler"],
+        "weight_decay": config["Training"]["weight_decay"],
+        "gradient_accumulation_steps": config["Training"]["gradient_accumulation_steps"],


Do we have default values for all these configurations in config? Previously we wrote config["Training"].get("gradient_accumulation_steps", 1)

yes, this option has default value 1 in finetune_config.py

KepingYan · 2024-05-11T05:28:31Z

llm_on_ray/finetune/finetune.py

-                "CCL_ZE_IPC_EXCHANGE": "sockets",
-                "CCL_WORKER_COUNT": str(ccl_worker_count),
-                "CCL_LOG_LEVEL": "info",


Why are these ccl configurations no longer needed?

KepingYan · 2024-05-13T05:53:46Z

LGTM

harborn changed the title ~~[finetune] replace fine-tuning DefaultTrainer with transformers.Trainer~~ [Finetune] replace fine-tuning DefaultTrainer with transformers.Trainer Apr 25, 2024

minmingzhu reviewed Apr 29, 2024

View reviewed changes

KepingYan reviewed Apr 30, 2024

View reviewed changes

carsonwang reviewed May 6, 2024

View reviewed changes

harborn force-pushed the replace-trainer branch 3 times, most recently from 55e7968 to 4adf98a Compare May 10, 2024 02:22

harborn added 10 commits May 10, 2024 10:15

replace fine-tuning DefaultTrainer with transformers.Trainer

4433e17

update

3ce3fa7

update

d64f041

update

762a08a

update

be5f9af

update

81b12db

update

50eb674

update

a6169a8

update

e87e1a2

update

4adf98a

KepingYan reviewed May 11, 2024

View reviewed changes

harborn added 7 commits May 11, 2024 09:53

update

f95e392

update

cc089a0

update

58a9dd2

update

a3feef5

update

efa5cee

update

7aa495f

update

86cc053

KepingYan approved these changes May 13, 2024

View reviewed changes

carsonwang approved these changes May 13, 2024

View reviewed changes

harborn merged commit 3523011 into intel:main May 13, 2024
25 checks passed

harborn added 2 commits May 13, 2024 10:34

enable resume from checkpoint

94e1c6f

update docs

02a4712

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Finetune] replace fine-tuning DefaultTrainer with transformers.Trainer #204

[Finetune] replace fine-tuning DefaultTrainer with transformers.Trainer #204

harborn commented Apr 25, 2024

minmingzhu Apr 29, 2024

minmingzhu Apr 29, 2024

harborn May 8, 2024

KepingYan Apr 30, 2024

harborn May 8, 2024

KepingYan Apr 30, 2024

harborn May 8, 2024

KepingYan Apr 30, 2024

harborn May 8, 2024

KepingYan Apr 30, 2024

harborn May 8, 2024

KepingYan Apr 30, 2024

harborn May 8, 2024

KepingYan Apr 30, 2024

harborn May 8, 2024

KepingYan Apr 30, 2024

harborn May 8, 2024

KepingYan May 9, 2024

carsonwang May 6, 2024

harborn May 8, 2024

carsonwang May 6, 2024

harborn May 8, 2024 •

edited

carsonwang May 6, 2024

harborn May 8, 2024

KepingYan May 11, 2024

KepingYan commented May 13, 2024

	if max_train_step != 0:
	finetune_config["Training"]["max_train_steps"] = max_train_step

		# accelerate_env_vars = get_accelerate_environment_variable(config)
		# runtime_env["env_vars"].update(accelerate_env_vars)

		train_dataloader, eval_dataloader = self.dataprocesser.prepare(tokenizer, dataset)
		train_dataloader, eval_dataloader = self.dataprocesser.convert_dataset(tokenizer, dataset)

[Finetune] replace fine-tuning DefaultTrainer with transformers.Trainer #204

[Finetune] replace fine-tuning DefaultTrainer with transformers.Trainer #204

Conversation

harborn commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harborn May 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KepingYan commented May 13, 2024

harborn May 8, 2024 •

edited