Additional Topic Ideas / Suggestions #16

luisquintanilla · 2020-06-04T16:13:45Z

Creating this issue to track and get input on ways to improve or expand the contents of this workshop.

Which topics should we dive deeper into?
What are additional topics you'd like to see?

Improvements

Add checks to make sure prerequisites are installed correctly
Shorten activity time or have a way to get feedback when participants are done
Consider separate workshop depending on first-time ML users and seasoned ML users.

Topics

Transforms
- Custom Transforms
- Expression Transforms
Classical ML Scenarios
Deep Learning
Deployment
- Additional Targets / Applications
  - ASP.NET Core Web API
  - Azure Functions
  - Desktop
  - Mobile (Xamarin)
MLOps

luisquintanilla · 2020-06-04T16:17:49Z

@briacht @aslotte @bartczernicki @jwood803 tagging you for visibility. It would also be great to get input from community members.

jwood803 · 2020-06-04T16:38:12Z

Can go over some transforms that are available for pre-processing data. Maybe the custom mapping or expression transforms?

CameronVetter · 2020-06-04T18:19:10Z

First issue I ran into, was I thought I had the prereqs successfully installed but when we got to a place where I was going to use them I found out that I had missed and missed significantly enough that I had to try a few things before I got it working, luckily I could pause the workshop while I got that sorted out. The fix would have been to have included some test in the prereqs to prove you have it installed properly

CameronVetter · 2020-06-04T18:22:29Z

The other issue I ran into was the pace, after I got the prereqs in place the content felt really slow and the activity times were way too long. I ended up pausing the stream for an hour, doing something else and then coming back,. My suggested fix for this is to have two tracks / different workshops. One for people brand new to machine learning and ML .Net. The other for people that understand machine learning basics and want to dig into ML .Net, that could get much deeper than the current workshop.

bartczernicki · 2020-06-05T02:49:44Z

I think having some advanced scenarios/architectures that someone can just jump to or see some demos would be nice (This way you have a "forward looking", this is all the coolness you can do if you get started with ML.NET & this workshop). You can pivot that to "best practice tips" or "production considerations". It might not be the best fit here, but with .NET you have many more scenarios to go through as there are very few full stack R or Python systems built. (they exist, but are few and far between)

murilocurti · 2023-01-06T14:53:20Z

@luisquintanilla I have not found your contact. Could you help me with one problema consuming a GPT2 ONNX model using ML.Net Transform?

luisquintanilla · 2023-01-06T16:20:00Z

Hi @murilocurti

I suspect this is your post on StackOverflow.

Here is a sample that might help you get started. Decoding the outputs still isn't finished yet.

Basically the part missing still is the post-processing

Hope that helps.

murilocurti · 2023-01-06T17:17:45Z

Hi @luisquintanilla !!!!
That's exactly my post, I'm lookin your sample right now and I think it will helps a lot!!
I will try and return to you with the results. Thank you sooooo much!!!!!

murilocurti · 2023-01-06T22:39:52Z

@luisquintanilla !!! It almost workd :)

I've followed your sample and I'm getting error when the predition is evaluated: predictions.First()

See it below:

System.InvalidOperationException
HResult=0x80131509
Message=Splitter/consolidator worker encountered exception while consuming source data
Source=Microsoft.ML.Data
StackTrace:
at Microsoft.ML.Data.DataViewUtils.Splitter.Batch.SetAll(OutPipe[] pipes)
at Microsoft.ML.Data.DataViewUtils.Splitter.Cursor.MoveNextCore()
at Microsoft.ML.Data.RootCursorBase.MoveNext()
at Microsoft.ML.Data.TypedCursorable1.TypedCursor.MoveNext() at Microsoft.ML.Data.TypedCursorable1.RowCursorImplementation.MoveNext()
at Microsoft.ML.PipeEngine1.<RunPipe>d__2.MoveNext() at System.Linq.Enumerable.TryGetFirst[TSource](IEnumerable1 source, Boolean& found)
at System.Linq.Enumerable.First[TSource](IEnumerable`1 source)
at Program.
$(String[] args) in C:\Users\MuriloCurti\source\repos.ai\onnx\src\OnnxGPT2\Program.cs:line 52

This exception was originally thrown at this call stack:
[External Code]

Inner Exception 1:
ArgumentException: Length of memory (12) must match product of dimensions (20).

The program.cs file is at, and the exception is at line 52.
https://gist.github.com/murilocurti/98c24e079b69d5fe3ff404c7bd13445d

In this article, Nikola commented the exception: Length of memory (12) must match product of dimensions (20).
https://rubikscode.net/2021/10/25/using-huggingface-transformers-with-ml-net/

The exception occurred while calling Predict method of the PredictionEngine object. It turned out that the schema of the PredictionEngine is not correct, even though the VectorType had a correct shape

What do you think?

Thank you!!!

luisquintanilla · 2023-01-07T04:09:34Z

When padding data there was an error in the original sample. This should do it. Note that I'm using the LM-HEAD model instead of the standard one because that one gives you the scores for each of the words in the vocabulary.

using Microsoft.ML;
using Microsoft.ML.Tokenizers;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.Data;

var ctx = new MLContext();

var vocabFilePath = "vocab.json";
var mergeFilePath = "merges.txt";
var onnxModelFilePath = "gpt2-lm-head-10.onnx";

var tokenizer = new Tokenizer(new Bpe(vocabFilePath, mergeFilePath), RobertaPreTokenizer.Instance);

var shape = new Dictionary<string, int[]>()
{
    {"input1",new int[] {1,1,GPT2Settings.SeqLength}},
    {"output1", new int[] {1,1,GPT2Settings.SeqLength,50257}}
};

var onnxPipeline =
    ctx.Transforms.ApplyOnnxModel(
        modelFile: onnxModelFilePath,
        inputColumnNames: new[] { "input1" },
        outputColumnNames: new[] { "output1" },
        shapeDictionary: shape, gpuDeviceId: null, fallbackToCpu: true);

var inputs = new[] {
    "The brown fox jumped over ",
};

var data =
    inputs.Select(x => new ModelInput { OriginalInput = x, Ids = tokenizer.Encode(x).Ids.Select(n => (long)n).ToArray() });

var paddedData = data.Select(x => {
    var len = x.Ids.Count();

    var updatedInput = new ModelInput { OriginalInput = x.OriginalInput };
    if(len >= GPT2Settings.SeqLength)
    {
        var truncatedArray = x.Ids.Take(GPT2Settings.SeqLength);
        updatedInput.Ids = truncatedArray.ToArray();
    }
    else
    {
        var paddedArray = Enumerable.Repeat<long>(-50256L, GPT2Settings.SeqLength-len);
        var combinedArray = x.Ids.Concat(paddedArray);
        updatedInput.Ids = combinedArray.ToArray();
    }

    return updatedInput;
});

var test = paddedData.ToArray();

var idv = ctx.Data.LoadFromEnumerable(paddedData);

var output = onnxPipeline.Fit(idv).Transform(idv);

var prev = output.Preview();

var predictions = ctx.Data.CreateEnumerable<ModelOutput>(output, reuseRowObject: false);

IEnumerable<float> ApplySoftmax(IEnumerable<float> input)
{
    var sum = input.Sum(x => (float)Math.Exp(x));
    var softmax = input.Select(x => (float)Math.Exp(x) / sum);
    return softmax.ToArray();
}

var nextWords =
    predictions.ToArray()[0].Output1
        .Chunk(50257)
        .AsEnumerable()
        .Select(x =>
        {
            var results =
                ApplySoftmax(x)
                    .Select((c, i) => new { Index = i, Confidence = c })
                    .OrderByDescending(l => l.Confidence)
                    .Take(GPT2Settings.SeqLength)
                    .Select(l => new { Label = tokenizer.Decode(l.Index, true), Confidence = l.Confidence })
                    .First();

            return results;
        })
        .ToArray();

var originalString = inputs.First();
var nextWordIdx = inputs.First().Split(' ').Count();

Console.WriteLine($"{originalString} {nextWords[nextWordIdx].Label}");

struct GPT2Settings
{
    public const int SeqLength = 12;
}

public class ModelInput
{
    public string OriginalInput { get; set; }

    [ColumnName("input1")]
    [VectorType(1, 1, GPT2Settings.SeqLength)]
    public long[] Ids { get; set; }
}

public class ModelOutput
{
    [ColumnName("output1")]
    [VectorType(1 * 1 * GPT2Settings.SeqLength * 50257)]
    public float[] Output1 { get; set; }
}

murilocurti · 2023-01-09T16:20:46Z

@luisquintanilla It worked! I'm running with a lot work since the weekend, and I will return to you as soon as I have a good time.
Thank you soooo much!

murilocurti · 2023-01-15T10:45:32Z

Hi @luisquintanilla finally I got time again here.

How I said, it worked. Was possible to load the model and the response, but I can't understand why the response is, apparently, trucated or not currently decoded. Do you have any idea?

See the outputs below:

Thanks!

input: .NET Conf is
output: irmed

input: The brown dog jumped over
output: dog

input: My name is Luis and I like
output: !

input: In the darkest depths of mordor

output: C

murilocurti · 2023-01-15T16:00:39Z

Looks like the problem is related to the tokenizer and merges file.

I've re-downloaded the files and the content is ok.
Even though I tested you example at https://devblogs.microsoft.com/dotnet/announcing-ml-net-2-0/

using Microsoft.ML;
using Microsoft.ML.Tokenizers;

// Initialize MLContext
var mlContext = new MLContext();

// Define vocabulary file paths
var vocabFilePath = @"C:\Tokenizers\vocab.json";
var mergeFilePath = @"C:\Tokenizers\merges.txt";

// Initialize Tokenizer
var tokenizer = new Tokenizer(new Bpe(vocabFilePath, mergeFilePath),RobertaPreTokenizer.Instance);

// Define input for tokenization
var input = "the brown fox jumped over the lazy dog!";

// Encode input
var tokenizerEncodedResult = tokenizer.Encode(input);

// Decode results
tokenizer.Decode(tokenizerEncodedResult.Ids);

And the result was:

the!brown!fox!jumped!over!the!lazy!dog!

luisquintanilla · 2023-01-18T01:39:40Z

@murilocurti thanks for looking into it! You're right. I totally missed that. Can you add a comment with the link to the files you used. Thank you.

murilocurti · 2023-01-18T01:47:15Z

@luisquintanilla for sure!
I'm using vocabulary resources from the HuggingFace GPT-2 repository the same as you mentioned in the blog post.

The direct download links are the following:

https://huggingface.co/gpt2/raw/main/merges.txt
https://huggingface.co/gpt2/raw/main/vocab.json

Thank you!
ps: my software engineering knowledge is not enough to solve all this problems with machine learning, but I'm learning a lot with your help. Thank you again 🙏

luisquintanilla · 2023-01-18T01:53:19Z

Happy to help. We're learning together here 🙂

luisquintanilla mentioned this issue Mar 9, 2023

[API Proposal]: System.Numerics Distance, Vector, Math Operations dotnet/runtime#83209

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional Topic Ideas / Suggestions #16

Additional Topic Ideas / Suggestions #16

luisquintanilla commented Jun 4, 2020 •

edited

luisquintanilla commented Jun 4, 2020 •

edited

jwood803 commented Jun 4, 2020

CameronVetter commented Jun 4, 2020

CameronVetter commented Jun 4, 2020

bartczernicki commented Jun 5, 2020

murilocurti commented Jan 6, 2023

luisquintanilla commented Jan 6, 2023

murilocurti commented Jan 6, 2023

murilocurti commented Jan 6, 2023

luisquintanilla commented Jan 7, 2023

murilocurti commented Jan 9, 2023

murilocurti commented Jan 15, 2023

murilocurti commented Jan 15, 2023 •

edited

luisquintanilla commented Jan 18, 2023

murilocurti commented Jan 18, 2023

luisquintanilla commented Jan 18, 2023

Additional Topic Ideas / Suggestions #16

Additional Topic Ideas / Suggestions #16

Comments

luisquintanilla commented Jun 4, 2020 • edited

Improvements

Topics

luisquintanilla commented Jun 4, 2020 • edited

jwood803 commented Jun 4, 2020

CameronVetter commented Jun 4, 2020

CameronVetter commented Jun 4, 2020

bartczernicki commented Jun 5, 2020

murilocurti commented Jan 6, 2023

luisquintanilla commented Jan 6, 2023

murilocurti commented Jan 6, 2023

murilocurti commented Jan 6, 2023

luisquintanilla commented Jan 7, 2023

murilocurti commented Jan 9, 2023

murilocurti commented Jan 15, 2023

murilocurti commented Jan 15, 2023 • edited

luisquintanilla commented Jan 18, 2023

murilocurti commented Jan 18, 2023

luisquintanilla commented Jan 18, 2023

luisquintanilla commented Jun 4, 2020 •

edited

luisquintanilla commented Jun 4, 2020 •

edited

murilocurti commented Jan 15, 2023 •

edited