[trainer] allow processor instead of tokenizer #30864

sanchit-gandhi · 2024-05-16T15:08:37Z

What does this PR do?

Fixes #23222 by allowing the user to pass the argument processor to the Trainer and Seq2SeqTrainer (instead of tokenizer).

This is much more intuitive when training multimodal models, as before we were encouraging users to pass:

trainer = Trainer(
    ...
    tokenizer=processor,
    ...
)

Which is super confusing. After this PR, users can do:

trainer = Trainer(
    ...
    processor=processor,
    ...
)

Which is much more sensible.

The Trainer does three things with the tokenizer:

Passes it to the data collator with padding:

transformers/src/transformers/trainer.py

Lines 513 to 517 in f4014e7

    
           default_collator = ( 
        
               DataCollatorWithPadding(tokenizer) 
        
               if tokenizer is not None and isinstance(tokenizer, (PreTrainedTokenizerBase, SequenceFeatureExtractor)) 
        
               else default_data_collator 
        
           )

Gets the model input name:

transformers/src/transformers/trainer.py

Line 853 in f4014e7

model_input_name = self.tokenizer.model_input_names[0] if self.tokenizer is not None else None
Saves it during training:

transformers/src/transformers/trainer.py

Lines 3398 to 3399 in f4014e7

if self.tokenizer is not None and self.args.should_save:

self.tokenizer.save_pretrained(output_dir)

We can do all of these things directly with the processor as well. Therefore, all we have to do is set the following in the init method of the Trainer:

self.tokenizer = processor if processor is not None else tokenizer

muellerzr

Thanks for doing this! I think it makes sense, but let's guard some user behavior 🤗

muellerzr · 2024-05-16T15:18:16Z

src/transformers/trainer.py

+        self.tokenizer = processor if processor is not None else tokenizer
+        if processor is not None and hasattr(processor, "feature_extractor"):
+            tokenizer = processor.feature_extractor


We should add a check here to ensure if the user has passed in both tokenizer and processor, to raise an error about you must only pass in one

IMO it's ok to let the user pass both and have the processor take precedence (there's no harm in this for the user)

However we never save their original tokenizer. This can lead to confusion down the road because their tokenizer is essentially never used. I'd rather guard this explicitly.;

IMO it's ok to let the user pass both and have the processor take precedence (there's no harm in this for the user)

I disagree here, it makes the behaviour ambiguous. In effect, this PR means we're deprecating the use of the tokenizer argument, so we should make it clear which argument is preferred and push the user towards that. Or at least throw a warning

HuggingFaceDocBuilderDev · 2024-05-16T15:33:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Thanks for working on this!

At the moment, this solution is a bit too audio specific and introduces silent behaviour. Instead, we should explicitly push users to use processor and processor should accept all of our processing classes: processors, image processors, feature extractors, tokenizers

amyeroberts · 2024-05-16T16:00:24Z

src/transformers/trainer.py

+        if processor is not None and hasattr(processor, "feature_extractor"):
+            tokenizer = processor.feature_extractor


This is super audio specific and can create surprising behaviour. If I passed in a processor=processor, I would expect the processor to be used, not the feature extractor. Instead, if the previous scripts e.g. examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py want just the feature extractor to be passed in, then that should be specified when calling the trainer i.e. processor=feature_extractor

Note that all this is doing is setting the padding method in the default data collator:

transformers/src/transformers/trainer.py

Lines 513 to 517 in 15c74a2

default_collator = (

DataCollatorWithPadding(tokenizer)

if tokenizer is not None and isinstance(tokenizer, (PreTrainedTokenizerBase, SequenceFeatureExtractor))

else default_data_collator

)

There's no pad method defined for processors, so the processor cannot be used here. Only sequence feature extractors and tokenizers have a pad method defined, so they are the only two viable options.

This is why we look for the corresponding attributes in the processor:

transformers/src/transformers/trainer.py

Lines 525 to 528 in f8dc983

if hasattr(processor, "feature_extractor"):

tokenizer = processor.feature_extractor

elif hasattr(processor, "tokenizer"):

tokenizer = processor.tokenizer

It doesn't just define padding behaviour. If tokenizer is set, then this object is also uploaded on push_to_hub calls. If we do tokenizer = processor.feature_extractor then a user specifies a processor but only the feature extractor is uploaded

Note here that we're setting:

tokenizer = processor.feature_extractor

Not:

self.tokenizer = processor.feature_extractor

=> the feature extractor is strictly used for padding purposes in the data collator, and not set to an attribute of the trainer

In fact, since we set:

self.tokenizer = processor

the processor is correctly the attribute which is both saved and pushed to the hub.

Agree though that this behaviour is somewhat "silent' to the user and can be improved on (will iterate on this once we have a design established)

amyeroberts · 2024-05-16T16:03:08Z

src/transformers/trainer.py

@@ -510,6 +516,10 @@ def __init__(
        ):
            self.place_model_on_device = False

+        self.tokenizer = processor if processor is not None else tokenizer


Instead of setting self.tokenizer to processor, we should:

update all the references of self.tokenizer to self.processor in the trainer. The removes ambiguity for anyone reading the code

If tokenizer is passed in as an argument, raise a warning saying it's deprecated in favour of processor

Add a property tokenizer which returns self.processor alongside a warning saying self.tokenizer is deprecated

Note that tokenizer is not deprecated. If we're fine-tuning LLM's there's no notion of a processor, only a tokenizer. The processor is only relevant when we're training a multimodal model, such as an ASR model.

This is why we maintain the tokenizer attribute in the Trainer. What I propose we do is have two attributes:

self.tokenizer -> used for LLMs where there is only a tokenizer. Will be None for multimodal models

self.processor -> used for multimodal models where there is a processor. Will be None for LLMs

I would much rather have self.processor :) Or even be more clear: self.multimodal_processor

sanchit-gandhi · 2024-05-16T16:14:30Z

Thanks for the review @amyeroberts! Note that we are not replacing tokenizer with processor (this would break all NLP use-cases), but rather allowing multimodal users the option of passing the processor directly, which allows it to be saved by the Trainer. Let me know what you think of the proposed design here: #30864 (comment)

Likewise, we only extract processor.feature_extractor or processor.tokenizer to do the padding in the data collator -> these are the only two classes with valid padding methods, so there's no surprising behaviour here

amyeroberts · 2024-05-16T17:49:18Z

@sanchit-gandhi I realise there might be some context missing. There was already a PR which was added to enable this for image_processors #29896 - which was ultimately undone in #30129.

The ultimate realisation was that adding an argument like processor has to be thought out carefully to make sure the behaviour is as clear as possible and behaves as expected for all trainer use cases. One consideration, should we add image_processor and feature_extractor alongside processor? This is definitely clearer, as it disambiguates, but then means we have to handle many objects and their combinations.

As I mentioned above, this doesn't just affect data collation, but also the objects uploaded to the hub on push_to_hub calls, so it's important that the user can pass everything they need to reload their model to the trainer.

There's an ongoing pull request here #30102, which @NielsRogge has been working on. I agree with his suggestion of a preprocessor argument, which means the users can pass any of our processing objects with a single argument.

sanchit-gandhi · 2024-05-17T08:57:32Z

I'd missed that PR - thanks for the context @amyeroberts, that's super helpful! Replied directly on the PR: #30102 (comment)

sanchit-gandhi added 4 commits May 16, 2024 15:56

pass proc to trainer(s)

4d8d487

update examples

cde0339

make style

53ec49d

prefer processor

e5ee509

sanchit-gandhi requested a review from muellerzr May 16, 2024 15:08

muellerzr approved these changes May 16, 2024

View reviewed changes

muellerzr requested a review from amyeroberts May 16, 2024 15:18

amyeroberts reviewed May 16, 2024

View reviewed changes

guard

f8dc983

sanchit-gandhi mentioned this pull request May 17, 2024

[Trainer] Rename tokenizer to processor, add deprecation #30102

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[trainer] allow processor instead of tokenizer #30864

[trainer] allow processor instead of tokenizer #30864

sanchit-gandhi commented May 16, 2024

muellerzr left a comment

muellerzr May 16, 2024

sanchit-gandhi May 16, 2024

muellerzr May 16, 2024

amyeroberts May 16, 2024 •

edited

HuggingFaceDocBuilderDev commented May 16, 2024

amyeroberts left a comment

amyeroberts May 16, 2024

sanchit-gandhi May 16, 2024 •

edited

amyeroberts May 16, 2024 •

edited

sanchit-gandhi May 17, 2024

sanchit-gandhi May 17, 2024

amyeroberts May 16, 2024

sanchit-gandhi May 16, 2024 •

edited

muellerzr May 16, 2024

sanchit-gandhi commented May 16, 2024

amyeroberts commented May 16, 2024

sanchit-gandhi commented May 17, 2024

	default_collator = (
	DataCollatorWithPadding(tokenizer)
	if tokenizer is not None and isinstance(tokenizer, (PreTrainedTokenizerBase, SequenceFeatureExtractor))
	else default_data_collator
	)

	if self.tokenizer is not None and self.args.should_save:
	self.tokenizer.save_pretrained(output_dir)

		if processor is not None and hasattr(processor, "feature_extractor"):
		tokenizer = processor.feature_extractor

	if hasattr(processor, "feature_extractor"):
	tokenizer = processor.feature_extractor
	elif hasattr(processor, "tokenizer"):
	tokenizer = processor.tokenizer

[trainer] allow processor instead of tokenizer #30864

Are you sure you want to change the base?

[trainer] allow processor instead of tokenizer #30864

Conversation

sanchit-gandhi commented May 16, 2024

What does this PR do?

muellerzr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts May 16, 2024 • edited

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 16, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanchit-gandhi May 16, 2024 • edited

Choose a reason for hiding this comment

amyeroberts May 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanchit-gandhi May 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanchit-gandhi commented May 16, 2024

amyeroberts commented May 16, 2024

sanchit-gandhi commented May 17, 2024

amyeroberts May 16, 2024 •

edited

sanchit-gandhi May 16, 2024 •

edited

amyeroberts May 16, 2024 •

edited

sanchit-gandhi May 16, 2024 •

edited