Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] performing a custom split persists an additional feature labeled split #11932

Open
2 of 23 tasks
hopemiranda opened this issue May 7, 2024 · 1 comment
Open
2 of 23 tasks
Labels
area/recipes MLflow Recipes, Recipes APIs, Recipes configs, Recipe Templates bug Something isn't working

Comments

@hopemiranda
Copy link

hopemiranda commented May 7, 2024

Issues Policy acknowledgement

  • I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Databricks

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

Mlflow 2.12.1

System information

  • Databricks: 1 Driver 16 GB Memory, 2 Cores Runtime 13.3.x-cpu-ml-scala2.12 r5d.large 0.45 DBU/h

Describe the problem

When using a custom split instead of the default split_ratios the output dataframes for training_data, validation_data, and test_data of the split step results in an extra column labeled split

Tracking information

REPLACE_ME

Code to reproduce issue

# in split.py

import pandas as pd
from pandas import DataFrame, Series

def split_fn(df: DataFrame):
    df.loc[0:50, 'split'] = 'TRAINING'
    df.loc[50:100, 'split'] = "TEST"
    df.loc[100::, 'split'] = 'VALIDATION'
    custom_series = pd.Series(df.split)

    return custom_series

## ----------------------------

# in recipe.yaml
  split:
    using: "custom"
    split_method: split_fn

## ----------------------------

# checking the outputs
from mlflow.recipes import Recipe
r = Recipe(profile="databricks")
r.get_artifact("training_data") ## this will show the added column

## ----------------------------

# workaround solution in databricks notebook between steps `split` and `transform`

import mlflow
from mlflow.recipes.utils import (
    get_recipe_config,
    get_recipe_name,
    get_recipe_root_path,
)
from mlflow.recipes.utils.execution import get_step_output_path

_OUTPUT_TRAIN_FILE_NAME = "train.parquet"
_OUTPUT_VALIDATION_FILE_NAME = "validation.parquet"
_OUTPUT_TEST_FILE_NAME = "test.parquet"

training_path = get_step_output_path(get_recipe_root_path(), 'split', _OUTPUT_TRAIN_FILE_NAME)
validation_path = get_step_output_path(get_recipe_root_path(), 'split', _OUTPUT_VALIDATION_FILE_NAME)
test_path = get_step_output_path(get_recipe_root_path(), 'split', _OUTPUT_TEST_FILE_NAME)

train_df = pd.read_parquet(training_path)
validation_df = pd.read_parquet(validation_path)
test_df = pd.read_parquet(test_path)

train_df = train_df.drop(columns=["split"])
validation_df = validation_df.drop(columns=["split"])
test_df = test_df.drop(columns=["split"])

train_df.to_parquet(training_path)
validation_df.to_parquet(validation_path)
test_df.to_parquet(test_path)

Stack trace

REPLACE_ME

Other info / logs

# potential solution?
# drop split after it gets loaded in lines 
#https://github.com/mlflow/mlflow/blob/5cdae7c4321015620032d02a3b84fb6127247392/mlflow/recipes/steps/split.py#L353-L358
# by adding
train_df = train_df.drop(columns=["split"])
validation_df = validation_df.drop(columns=["split"])
test_df = test_df.drop(columns=["split"])

What component(s) does this bug affect?

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

What language(s) does this bug affect?

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations
@hopemiranda hopemiranda added the bug Something isn't working label May 7, 2024
@github-actions github-actions bot added the area/recipes MLflow Recipes, Recipes APIs, Recipes configs, Recipe Templates label May 7, 2024
@hopemiranda hopemiranda changed the title [BUG] performing a custom split persists and additional feature labeled split [BUG] performing a custom split persists an additional feature labeled split May 8, 2024
Copy link

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/recipes MLflow Recipes, Recipes APIs, Recipes configs, Recipe Templates bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant