Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Named Entity Recognition: Generated code to retrain is not working #2877

Open
Polak149 opened this issue Mar 10, 2024 · 2 comments
Open

Named Entity Recognition: Generated code to retrain is not working #2877

Polak149 opened this issue Mar 10, 2024 · 2 comments

Comments

@Polak149
Copy link

System Information (please complete the following information):

  • Model Builder Version (available in Manage Extensions dialog): 17.18.2.2415501
  • Visual Studio Version: 17.9.2

Train function generated by Builder is not working for "Named Entity Recognition" and cause exception:

System.ArgumentOutOfRangeException: 'Cannot map column (name: Label, type: Key<UInt32, 0-0>) in data to the user-defined type, Microsoft.ML.Data.VBuffer`1[System.UInt32]. Arg_ParamName_Name'

Using builder, a was able to generate Named Entity Recognition mlnet model. Builder generated *.training.cs file with "Train" function:

/// <summary>
/// Train a new model with the provided dataset.
/// </summary>
/// <param name="outputModelPath">File path for saving the model. Should be similar to "C:\YourPath\ModelName.mlnet"</param>
/// <param name="inputDataFilePath">Path to the data file for training.</param>
/// <param name="separatorChar">Separator character for delimited training file.</param>
/// <param name="hasHeader">Boolean if training file has a header.</param>
public static void Train(string outputModelPath, string inputDataFilePath = RetrainFilePath, char separatorChar = RetrainSeparatorChar, bool hasHeader = RetrainHasHeader, bool allowQuoting = RetrainAllowQuoting)
{
    var mlContext = new MLContext();
    var data = LoadIDataViewFromFile(mlContext, inputDataFilePath, separatorChar, hasHeader, true);
    var model = RetrainModel(mlContext, data);
    SaveModel(mlContext, model, data, outputModelPath);
}

Trying to use this function cause an exception on:

/// <summary>
/// Retrain model using the pipeline generated as part of the training process.
/// </summary>
/// <param name="mlContext"></param>
/// <param name="trainData"></param>
/// <returns></returns>
public static ITransformer RetrainModel(MLContext mlContext, IDataView trainData)
{
    var pipeline = BuildPipeline(mlContext);
    var model = pipeline.Fit(trainData); // <-HERE AN EXCEPTION IS THROWN

    return model;
}

System.ArgumentOutOfRangeException: 'Cannot map column (name: Label, type: Key<UInt32, 0-0>) in data to the user-defined type, Microsoft.ML.Data.VBuffer`1[System.UInt32]. Arg_ParamName_Name'

Here is the example dataset i made for the sake of this post but every data set i have tried is not working:
test data example.txt

@Polak149
Copy link
Author

Polak149 commented Mar 19, 2024

The problem is generated in *.training.cs function 'LoadIDataViewFromFile' that is loading dataset.txt without tags. I was able to workaround this problem by creating own function to train:

private class Label(string key)
{
    public readonly string Key = key;
}

public static void TrainNER(string outputModelPath, string inputLabelsFilePath, string inputDataFilePath)
{
    IEnumerable<Label> GetLabels(string inputLabelsFilePath)
    {
        var lines = File.ReadLines(inputLabelsFilePath);
        return lines.Select(x => new Label(x));
    }
    IEnumerable<ModelInput> GetLine(string fileName)
    {
        using StreamReader sr = File.OpenText(fileName);
        string? line;
        while ((line = sr.ReadLine()) != null)
        {
            var split = line.Split('\t');
            yield return new ModelInput()
            {
                Sentence = split[0],
                Label = split[1..]
            };
        }
    }
    var mlContext = new MLContext();

    var labels = mlContext.Data.LoadFromEnumerable(GetLabels(inputLabelsFilePath));
    var dataView = mlContext.Data.LoadFromEnumerable(GetLine(inputDataFilePath));
    
    var chain = new EstimatorChain<ITransformer>();
var estimator = chain.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", keyData: labels))
       .Append(mlContext.MulticlassClassification.Trainers.NamedEntityRecognition(outputColumnName: "predicted_label", batchSize: 32, maxEpochs: 10))
       .Append(mlContext.Transforms.Conversion.MapKeyToValue("predicted_label"));
    using var transformer = estimator.Fit(dataView);
    
    // function automaticaly generated in *.training.cs
    SaveModel(mlContext, transformer, dataView, outputModelPath);
}

@LittleLittleCloud
Copy link
Contributor

@zewditu Can you take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants