[BUG] Cannot log custom transformer model on Azure Databricks when using mlflow.transformers.log_model() #11813

yuxinxu77 · 2024-04-24T23:09:54Z

Issues Policy acknowledgement

I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Other

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

Client: 2.9.2

System information

OS Platform and Distribution: Azure Databricks, runtime is 14.3 LTS ML, compute is GPU single node all purpose cluster, running inside a Databricks notebook
Python version: 3.10.12
transformers version: I tried both 4.35.2 and another 4.36.* version, both gives the same error

Describe the problem

I'm trying to fine-tune an open-source, customized BERT-based model on some toy data they provided in their GitHub. The goal is to log the training metrics and specs, and eventually log the fine-tuned model.

Somehow, when I run

with mlflow.start_run():
  trainer.train()
  trainer.save_model()
  print(transformers.__version__)
  components = {"model": AutoModel.from_pretrained(training_args.output_dir),
                "tokenizer": trainer.tokenizer}
  mlflow.transformers.log_model(transformers_model = components, artifact_path=training_args.output_dir, task = 'feature-extraction')

, the line AutoModel.from_pretrained(training_args.output_dir) gives me an error "OSError: No such device (os error 19)".

Yet when I run

with mlflow.start_run():
 model = AutoModel.from_pretrained(training_args.output_dir)

this error disappears.

And when I run AutoModel.from_pretrained(training_args.output_dir) standalone without wrapping with with mlflow.start_run():, the error still does not show up.

I want to make the first code block work, since currently the error has forced me to manually log the model in a separate run after training.

The code imports modules/py files that are provided here: the useful ones are arguments.py, data.py, modeling.py, and trainer.py. I put them all inside a folder called finetune at the same directory level as the databricks notebook I'm running.

Tracking information

REPLACE_ME

Code to reproduce issue

import logging
import os
from pathlib import Path

import transformers
from transformers import AutoConfig, AutoModel, AutoTokenizer, set_seed, pipeline

from finetune.arguments import ModelArguments, DataArguments, \
    RetrieverTrainingArguments as TrainingArguments
from finetune.data import TrainDatasetForEmbedding, EmbedCollator
from finetune.modeling import BiEncoderModel
from finetune.trainer import BiTrainer

import mlflow

logger = logging.getLogger(__name__)

model_args = ModelArguments(model_name_or_path = 'BAAI/bge-small-en-v1.5'
                            )
data_args = DataArguments(train_data = 'toy_finetune_data.jsonl',
                          train_group_size = 2,
                          query_max_len = 64,
                          passage_max_len = 256,
                          query_instruction_for_retrieval = ''
                          )
training_args = TrainingArguments(negatives_cross_device=True,
                                  temperature = 0.02,
                                  fix_position_embedding = True,
                                  sentence_pooling_method = 'cls',
                                  normlized = True,
                                  use_inbatch_neg = True,
                                  output_dir = 'finetune_output',
                                  learning_rate = 1e-5,
                                  num_train_epochs = 5,
                                  per_device_train_batch_size = 1,
                                  dataloader_drop_last = True,
                                  logging_steps = 10,
                                  save_steps = 1000)

if (
  os.path.exists(training_args.output_dir)
  and os.listdir(training_args.output_dir)
  and training_args.do_train
  and not training_args.overwrite_output_dir):
  raise ValueError(
    f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome.")

logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
)

logger.warning(
    "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
    training_args.local_rank,
    training_args.device,
    training_args.n_gpu,
    bool(training_args.local_rank != -1),
    training_args.fp16,
)
logger.info("Training/evaluation parameters %s", training_args)
logger.info("Model parameters %s", model_args)
logger.info("Data parameters %s", data_args)

set_seed(training_args.seed)

num_labels = 1
tokenizer = AutoTokenizer.from_pretrained(
    model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
    use_fast=False,
)
config = AutoConfig.from_pretrained(
    model_args.config_name if model_args.config_name else model_args.model_name_or_path,
    num_labels=num_labels,
    cache_dir=model_args.cache_dir,
)
logger.info('Config: %s', config)

model = BiEncoderModel(model_name=model_args.model_name_or_path,
                       normlized=training_args.normlized,
                       sentence_pooling_method=training_args.sentence_pooling_method,
                       negatives_cross_device=training_args.negatives_cross_device,
                       temperature=training_args.temperature,
                       use_inbatch_neg=training_args.use_inbatch_neg)

if training_args.fix_position_embedding:
  for k, v in model.named_parameters():
    if "position_embeddings" in k:
      logging.info(f"Freeze the parameters for {k}")
      v.requires_grad = False

train_dataset = TrainDatasetForEmbedding(args=data_args, tokenizer=tokenizer)

trainer = BiTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=EmbedCollator(
            tokenizer,
            query_max_len=data_args.query_max_len,
            passage_max_len=data_args.passage_max_len
        ),
        tokenizer=tokenizer
    )

Path(training_args.output_dir).mkdir(parents=True, exist_ok=True)

mlflow.autolog()

with mlflow.start_run():
  trainer.train()
  trainer.save_model()
  print(transformers.__version__)
  components = {"model": AutoModel.from_pretrained(training_args.output_dir),
                "tokenizer": trainer.tokenizer}
  mlflow.transformers.log_model(transformers_model = components, artifact_path=training_args.output_dir, task = 'feature-extraction')

Stack trace

OSError: No such device (os error 19)
File <command-1783537123306777>, line 5
      3 trainer.save_model()
      4 print(transformers.__version__)
----> 5 components = {"model": AutoModel.from_pretrained(training_args.output_dir),
      6             "tokenizer": trainer.tokenizer}
      7 mlflow.transformers.log_model(transformers_model = components, artifact_path=training_args.output_dir, task = 'feature-extraction')
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:566, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    564 elif type(config) in cls._model_mapping.keys():
    565     model_class = _get_model_class(config, cls._model_mapping)
--> 566     return model_class.from_pretrained(
    567         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    568     )
    569 raise ValueError(
    570     f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
    571     f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
    572 )
File /databricks/python_shell/dbruntime/huggingface_patches/transformers.py:21, in _create_patch_function.<locals>.patched_from_pretrained(cls, *args, **kwargs)
     19 call_succeeded = False
     20 try:
---> 21     model = original_method.__func__(cls, *args, **kwargs)
     22     call_succeeded = True
     23     return model
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/modeling_utils.py:3148, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
   3128     resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
   3129         pretrained_model_name_or_path,
   3130         resolved_archive_file,
   (...)
   3140         _commit_hash=commit_hash,
   3141     )
   3143 if (
   3144     is_safetensors_available()
   3145     and isinstance(resolved_archive_file, str)
   3146     and resolved_archive_file.endswith(".safetensors")
   3147 ):
-> 3148     with safe_open(resolved_archive_file, framework="pt") as f:
   3149         metadata = f.metadata()
   3151     if metadata.get("format") == "pt":

Other info / logs

REPLACE_ME

What component(s) does this bug affect?

What interface(s) does this bug affect?

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

What language(s) does this bug affect?

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

daniellok-db · 2024-04-24T23:41:48Z

Hmm, this seems like it might be an issue with the transformers library when the code is unable to reach the huggingface cache directory: huggingface/transformers#25179

It looks like this can happen in case of a permissions error—I wonder if it's possible to try on a different cluster where you have higher permissions.

github-actions · 2024-05-02T00:13:00Z

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

yuxinxu77 added the bug Something isn't working label Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cannot log custom transformer model on Azure Databricks when using mlflow.transformers.log_model() #11813

[BUG] Cannot log custom transformer model on Azure Databricks when using mlflow.transformers.log_model() #11813

yuxinxu77 commented Apr 24, 2024 •

edited

daniellok-db commented Apr 24, 2024

github-actions bot commented May 2, 2024

[BUG] Cannot log custom transformer model on Azure Databricks when using mlflow.transformers.log_model() #11813

[BUG] Cannot log custom transformer model on Azure Databricks when using mlflow.transformers.log_model() #11813

Comments

yuxinxu77 commented Apr 24, 2024 • edited

Issues Policy acknowledgement

Where did you encounter this bug?

Willingness to contribute

MLflow version

System information

Describe the problem

Tracking information

Code to reproduce issue

Stack trace

Other info / logs

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

daniellok-db commented Apr 24, 2024

github-actions bot commented May 2, 2024

yuxinxu77 commented Apr 24, 2024 •

edited