Skip to content

SdcaMaximumEntropy trainer goes into an infinite loop if it takes already transformed data view as an input #4926

@artemiusgreat

Description

@artemiusgreat

System information

  • OS version: Windows 10 Pro x64
  • .NET Version: .NET Core 3.0
  • ML.NET: 1.5.0-preview

Issue

What I did

  • create data-preparation pipeline
  • create trainer SdcaMaximumEntropy
  • execute pipeline, e.g. to debug transformed data view
  • add trainer to the pipeline and execute pipeline again, with the trainer included

What happened

If I execute pipeline once, e.g. load from enumerables into data view and then execute entire transformation chain that includes transformations and trainer, everything works fine.

If I execute pipeline twice, first time - separately, then - as a part of entire transformation chain, it consumes 3GB of RAM memory out of 16GB available, then training hangs indefinitely and never ends.
Fixed this temporarily by changing this MaximumNumberOfIterations option, but not sure if it's a good idea...

What I expect

I expect training to stop eventually, no matter how many times I execute pipeline.
Check the comment on the last line in the core below.

Source code

Source code is taken from this issue #4903

public IEstimator<ITransformer> GetPipeline(IEnumerable<string> columns)
{
  var pipeline = Context
    .Transforms
    .Conversion
    .MapValueToKey(new[] { new InputOutputColumnPair("Label", "Strategy") })
    .Append(Context.Transforms.Concatenate("Combination", columns.ToArray())) // merge "dynamic" colums into single property
    .Append(Context.Transforms.NormalizeMinMax(new[] { new InputOutputColumnPair("Features", "Combination") })) // normalize merged columns into Features
    .Append(Context.Transforms.SelectColumns(new string[] { "Label", "Features" })); // remove everything from data view, except transformed columns

  return pipeline;
}

public IEstimator<ITransformer> GetEstimator()
{
  var options = new SdcaMaximumEntropyMulticlassTrainer.Options
  {
    // MaximumNumberOfIterations = 100  // uncomment this to fix the issue
  };

  var estimator = Context
    .MulticlassClassification
    .Trainers
    .SdcaMaximumEntropy(options)
    .Append(Context.Transforms.Conversion.MapKeyToValue(new[]
    {
      new InputOutputColumnPair("Prediction", "PredictedLabel") // set trainer to use Prediction property as output
    }));

  return estimator;
}

public void TrainModel(IEnumerable<string> columns, IEnumerable<InputModel> items)
{
  var estimator = GetEstimator();
  var pipeline = GetPipeline(columns);
  var inputs = Context.Data.LoadFromEnumerable(items);  // create view 

  // If I stop execution here, everything is ok

  var model = pipeline.Append(estimator).Fit(inputs);  // works fine for the data view loaded from enumerables

  // Data preparation pipeline is a part of a transformation chain, so I don't need next 2 lines, but I don't understand why it's causing the issue
  
  var pipelineModel = pipeline.Fit(inputs);  
  var pipelineView = pipelineModel.Transform(inputs); // execute pipeline before the training
  var model = pipeline.Append(estimator).Fit(pipelineView); // use transformed pipelineView instead of initial inputs and ... go into infinite loop ... why?
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Doc bugs, questions, minor issues, etc.questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions