[BUG] performing a custom split persists an additional feature labeled `split` #11932

hopemiranda · 2024-05-07T23:43:47Z

Issues Policy acknowledgement

I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Databricks

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

Mlflow 2.12.1

System information

Databricks: 1 Driver 16 GB Memory, 2 Cores Runtime 13.3.x-cpu-ml-scala2.12 r5d.large 0.45 DBU/h

Describe the problem

When using a custom split instead of the default split_ratios the output dataframes for training_data, validation_data, and test_data of the split step results in an extra column labeled split

Tracking information

REPLACE_ME

Code to reproduce issue

# in split.py

import pandas as pd
from pandas import DataFrame, Series

def split_fn(df: DataFrame):
    df.loc[0:50, 'split'] = 'TRAINING'
    df.loc[50:100, 'split'] = "TEST"
    df.loc[100::, 'split'] = 'VALIDATION'
    custom_series = pd.Series(df.split)

    return custom_series

## ----------------------------

# in recipe.yaml
  split:
    using: "custom"
    split_method: split_fn

## ----------------------------

# checking the outputs
from mlflow.recipes import Recipe
r = Recipe(profile="databricks")
r.get_artifact("training_data") ## this will show the added column

## ----------------------------

# workaround solution in databricks notebook between steps `split` and `transform`

import mlflow
from mlflow.recipes.utils import (
    get_recipe_config,
    get_recipe_name,
    get_recipe_root_path,
)
from mlflow.recipes.utils.execution import get_step_output_path

_OUTPUT_TRAIN_FILE_NAME = "train.parquet"
_OUTPUT_VALIDATION_FILE_NAME = "validation.parquet"
_OUTPUT_TEST_FILE_NAME = "test.parquet"

training_path = get_step_output_path(get_recipe_root_path(), 'split', _OUTPUT_TRAIN_FILE_NAME)
validation_path = get_step_output_path(get_recipe_root_path(), 'split', _OUTPUT_VALIDATION_FILE_NAME)
test_path = get_step_output_path(get_recipe_root_path(), 'split', _OUTPUT_TEST_FILE_NAME)

train_df = pd.read_parquet(training_path)
validation_df = pd.read_parquet(validation_path)
test_df = pd.read_parquet(test_path)

train_df = train_df.drop(columns=["split"])
validation_df = validation_df.drop(columns=["split"])
test_df = test_df.drop(columns=["split"])

train_df.to_parquet(training_path)
validation_df.to_parquet(validation_path)
test_df.to_parquet(test_path)

Stack trace

REPLACE_ME

Other info / logs

# potential solution?
# drop split after it gets loaded in lines 
#https://github.com/mlflow/mlflow/blob/5cdae7c4321015620032d02a3b84fb6127247392/mlflow/recipes/steps/split.py#L353-L358
# by adding
train_df = train_df.drop(columns=["split"])
validation_df = validation_df.drop(columns=["split"])
test_df = test_df.drop(columns=["split"])

What component(s) does this bug affect?

What interface(s) does this bug affect?

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

What language(s) does this bug affect?

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-15T00:13:33Z

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

hopemiranda added the bug Something isn't working label May 7, 2024

github-actions bot added the area/recipes MLflow Recipes, Recipes APIs, Recipes configs, Recipe Templates label May 7, 2024

hopemiranda changed the title ~~[BUG] performing a custom split persists and additional feature labeled split~~ [BUG] performing a custom split persists an additional feature labeled split May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] performing a custom split persists an additional feature labeled `split` #11932

[BUG] performing a custom split persists an additional feature labeled `split` #11932

hopemiranda commented May 7, 2024 •

edited

github-actions bot commented May 15, 2024

[BUG] performing a custom split persists an additional feature labeled split #11932

[BUG] performing a custom split persists an additional feature labeled split #11932

Comments

hopemiranda commented May 7, 2024 • edited

Issues Policy acknowledgement

Where did you encounter this bug?

Willingness to contribute

MLflow version

System information

Describe the problem

Tracking information

Code to reproduce issue

Stack trace

Other info / logs

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

github-actions bot commented May 15, 2024

[BUG] performing a custom split persists an additional feature labeled `split` #11932

[BUG] performing a custom split persists an additional feature labeled `split` #11932

hopemiranda commented May 7, 2024 •

edited