System information
- OS version: Windows 10 Pro x64
- .NET Version: .NET Core 3.0
- ML.NET: 1.5.0-preview
Issue
What I did
- create data-preparation pipeline
- create trainer SdcaMaximumEntropy
- execute pipeline, e.g. to debug transformed data view
- add trainer to the pipeline and execute pipeline again, with the trainer included
What happened
If I execute pipeline once, e.g. load from enumerables into data view and then execute entire transformation chain that includes transformations and trainer, everything works fine.
If I execute pipeline twice, first time - separately, then - as a part of entire transformation chain, it consumes 3GB of RAM memory out of 16GB available, then training hangs indefinitely and never ends.
Fixed this temporarily by changing this MaximumNumberOfIterations option, but not sure if it's a good idea...
What I expect
I expect training to stop eventually, no matter how many times I execute pipeline.
Check the comment on the last line in the core below.
Source code
Source code is taken from this issue #4903
public IEstimator<ITransformer> GetPipeline(IEnumerable<string> columns)
{
var pipeline = Context
.Transforms
.Conversion
.MapValueToKey(new[] { new InputOutputColumnPair("Label", "Strategy") })
.Append(Context.Transforms.Concatenate("Combination", columns.ToArray())) // merge "dynamic" colums into single property
.Append(Context.Transforms.NormalizeMinMax(new[] { new InputOutputColumnPair("Features", "Combination") })) // normalize merged columns into Features
.Append(Context.Transforms.SelectColumns(new string[] { "Label", "Features" })); // remove everything from data view, except transformed columns
return pipeline;
}
public IEstimator<ITransformer> GetEstimator()
{
var options = new SdcaMaximumEntropyMulticlassTrainer.Options
{
// MaximumNumberOfIterations = 100 // uncomment this to fix the issue
};
var estimator = Context
.MulticlassClassification
.Trainers
.SdcaMaximumEntropy(options)
.Append(Context.Transforms.Conversion.MapKeyToValue(new[]
{
new InputOutputColumnPair("Prediction", "PredictedLabel") // set trainer to use Prediction property as output
}));
return estimator;
}
public void TrainModel(IEnumerable<string> columns, IEnumerable<InputModel> items)
{
var estimator = GetEstimator();
var pipeline = GetPipeline(columns);
var inputs = Context.Data.LoadFromEnumerable(items); // create view
// If I stop execution here, everything is ok
var model = pipeline.Append(estimator).Fit(inputs); // works fine for the data view loaded from enumerables
// Data preparation pipeline is a part of a transformation chain, so I don't need next 2 lines, but I don't understand why it's causing the issue
var pipelineModel = pipeline.Fit(inputs);
var pipelineView = pipelineModel.Transform(inputs); // execute pipeline before the training
var model = pipeline.Append(estimator).Fit(pipelineView); // use transformed pipelineView instead of initial inputs and ... go into infinite loop ... why?
}
System information
Issue
What I did
What happened
If I execute pipeline once, e.g. load from enumerables into data view and then execute entire transformation chain that includes transformations and trainer, everything works fine.
If I execute pipeline twice, first time - separately, then - as a part of entire transformation chain, it consumes 3GB of RAM memory out of 16GB available, then training hangs indefinitely and never ends.
Fixed this temporarily by changing this
MaximumNumberOfIterationsoption, but not sure if it's a good idea...What I expect
I expect training to stop eventually, no matter how many times I execute pipeline.
Check the comment on the last line in the core below.
Source code
Source code is taken from this issue #4903