Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot log custom transformer model on Azure Databricks when using mlflow.transformers.log_model() #11813

Open
6 of 23 tasks
yuxinxu77 opened this issue Apr 24, 2024 · 2 comments
Labels
area/artifacts Artifact stores and artifact logging area/model-registry Model registry, model registry APIs, and the fluent client calls for model registry area/models MLmodel format, model serialization/deserialization, flavors area/tracking Tracking service, tracking client APIs, autologging bug Something isn't working integrations/databricks Databricks integrations

Comments

@yuxinxu77
Copy link

yuxinxu77 commented Apr 24, 2024

Issues Policy acknowledgement

  • I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Other

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

  • Client: 2.9.2

System information

  • OS Platform and Distribution: Azure Databricks, runtime is 14.3 LTS ML, compute is GPU single node all purpose cluster, running inside a Databricks notebook
  • Python version: 3.10.12
  • transformers version: I tried both 4.35.2 and another 4.36.* version, both gives the same error

Describe the problem

I'm trying to fine-tune an open-source, customized BERT-based model on some toy data they provided in their GitHub. The goal is to log the training metrics and specs, and eventually log the fine-tuned model.

Somehow, when I run

with mlflow.start_run():
  trainer.train()
  trainer.save_model()
  print(transformers.__version__)
  components = {"model": AutoModel.from_pretrained(training_args.output_dir),
                "tokenizer": trainer.tokenizer}
  mlflow.transformers.log_model(transformers_model = components, artifact_path=training_args.output_dir, task = 'feature-extraction')

, the line AutoModel.from_pretrained(training_args.output_dir) gives me an error "OSError: No such device (os error 19)".

Yet when I run

with mlflow.start_run():
 model = AutoModel.from_pretrained(training_args.output_dir)

this error disappears.

And when I run AutoModel.from_pretrained(training_args.output_dir) standalone without wrapping with with mlflow.start_run():, the error still does not show up.

I want to make the first code block work, since currently the error has forced me to manually log the model in a separate run after training.

The code imports modules/py files that are provided here: the useful ones are arguments.py, data.py, modeling.py, and trainer.py. I put them all inside a folder called finetune at the same directory level as the databricks notebook I'm running.

Tracking information

REPLACE_ME

Code to reproduce issue

import logging
import os
from pathlib import Path

import transformers
from transformers import AutoConfig, AutoModel, AutoTokenizer, set_seed, pipeline

from finetune.arguments import ModelArguments, DataArguments, \
    RetrieverTrainingArguments as TrainingArguments
from finetune.data import TrainDatasetForEmbedding, EmbedCollator
from finetune.modeling import BiEncoderModel
from finetune.trainer import BiTrainer

import mlflow

logger = logging.getLogger(__name__)

model_args = ModelArguments(model_name_or_path = 'BAAI/bge-small-en-v1.5'
                            )
data_args = DataArguments(train_data = 'toy_finetune_data.jsonl',
                          train_group_size = 2,
                          query_max_len = 64,
                          passage_max_len = 256,
                          query_instruction_for_retrieval = ''
                          )
training_args = TrainingArguments(negatives_cross_device=True,
                                  temperature = 0.02,
                                  fix_position_embedding = True,
                                  sentence_pooling_method = 'cls',
                                  normlized = True,
                                  use_inbatch_neg = True,
                                  output_dir = 'finetune_output',
                                  learning_rate = 1e-5,
                                  num_train_epochs = 5,
                                  per_device_train_batch_size = 1,
                                  dataloader_drop_last = True,
                                  logging_steps = 10,
                                  save_steps = 1000)

if (
  os.path.exists(training_args.output_dir)
  and os.listdir(training_args.output_dir)
  and training_args.do_train
  and not training_args.overwrite_output_dir):
  raise ValueError(
    f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome.")

logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
)

logger.warning(
    "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
    training_args.local_rank,
    training_args.device,
    training_args.n_gpu,
    bool(training_args.local_rank != -1),
    training_args.fp16,
)
logger.info("Training/evaluation parameters %s", training_args)
logger.info("Model parameters %s", model_args)
logger.info("Data parameters %s", data_args)

set_seed(training_args.seed)

num_labels = 1
tokenizer = AutoTokenizer.from_pretrained(
    model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
    use_fast=False,
)
config = AutoConfig.from_pretrained(
    model_args.config_name if model_args.config_name else model_args.model_name_or_path,
    num_labels=num_labels,
    cache_dir=model_args.cache_dir,
)
logger.info('Config: %s', config)

model = BiEncoderModel(model_name=model_args.model_name_or_path,
                       normlized=training_args.normlized,
                       sentence_pooling_method=training_args.sentence_pooling_method,
                       negatives_cross_device=training_args.negatives_cross_device,
                       temperature=training_args.temperature,
                       use_inbatch_neg=training_args.use_inbatch_neg)

if training_args.fix_position_embedding:
  for k, v in model.named_parameters():
    if "position_embeddings" in k:
      logging.info(f"Freeze the parameters for {k}")
      v.requires_grad = False

train_dataset = TrainDatasetForEmbedding(args=data_args, tokenizer=tokenizer)

trainer = BiTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=EmbedCollator(
            tokenizer,
            query_max_len=data_args.query_max_len,
            passage_max_len=data_args.passage_max_len
        ),
        tokenizer=tokenizer
    )

Path(training_args.output_dir).mkdir(parents=True, exist_ok=True)

mlflow.autolog()

with mlflow.start_run():
  trainer.train()
  trainer.save_model()
  print(transformers.__version__)
  components = {"model": AutoModel.from_pretrained(training_args.output_dir),
                "tokenizer": trainer.tokenizer}
  mlflow.transformers.log_model(transformers_model = components, artifact_path=training_args.output_dir, task = 'feature-extraction')

Stack trace

OSError: No such device (os error 19)
File <command-1783537123306777>, line 5
      3 trainer.save_model()
      4 print(transformers.__version__)
----> 5 components = {"model": AutoModel.from_pretrained(training_args.output_dir),
      6             "tokenizer": trainer.tokenizer}
      7 mlflow.transformers.log_model(transformers_model = components, artifact_path=training_args.output_dir, task = 'feature-extraction')
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:566, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    564 elif type(config) in cls._model_mapping.keys():
    565     model_class = _get_model_class(config, cls._model_mapping)
--> 566     return model_class.from_pretrained(
    567         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    568     )
    569 raise ValueError(
    570     f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
    571     f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
    572 )
File /databricks/python_shell/dbruntime/huggingface_patches/transformers.py:21, in _create_patch_function.<locals>.patched_from_pretrained(cls, *args, **kwargs)
     19 call_succeeded = False
     20 try:
---> 21     model = original_method.__func__(cls, *args, **kwargs)
     22     call_succeeded = True
     23     return model
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/modeling_utils.py:3148, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
   3128     resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
   3129         pretrained_model_name_or_path,
   3130         resolved_archive_file,
   (...)
   3140         _commit_hash=commit_hash,
   3141     )
   3143 if (
   3144     is_safetensors_available()
   3145     and isinstance(resolved_archive_file, str)
   3146     and resolved_archive_file.endswith(".safetensors")
   3147 ):
-> 3148     with safe_open(resolved_archive_file, framework="pt") as f:
   3149         metadata = f.metadata()
   3151     if metadata.get("format") == "pt":

Other info / logs

REPLACE_ME

What component(s) does this bug affect?

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

What language(s) does this bug affect?

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations
@yuxinxu77 yuxinxu77 added the bug Something isn't working label Apr 24, 2024
@github-actions github-actions bot added area/artifacts Artifact stores and artifact logging area/model-registry Model registry, model registry APIs, and the fluent client calls for model registry area/models MLmodel format, model serialization/deserialization, flavors area/tracking Tracking service, tracking client APIs, autologging integrations/databricks Databricks integrations labels Apr 24, 2024
@daniellok-db
Copy link
Collaborator

Hmm, this seems like it might be an issue with the transformers library when the code is unable to reach the huggingface cache directory: huggingface/transformers#25179

It looks like this can happen in case of a permissions error—I wonder if it's possible to try on a different cluster where you have higher permissions.

Copy link

github-actions bot commented May 2, 2024

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts Artifact stores and artifact logging area/model-registry Model registry, model registry APIs, and the fluent client calls for model registry area/models MLmodel format, model serialization/deserialization, flavors area/tracking Tracking service, tracking client APIs, autologging bug Something isn't working integrations/databricks Databricks integrations
Projects
None yet
Development

No branches or pull requests

2 participants