How do I access the n-grams produced by FeaturizeText in Microsoft.ML?

391 Views Asked by At

I managed to get a first text analyser running in Microsoft.ML. I would like to get to the list of ngrams determined by the model, but I can only get the numerical vectors "counting" occurrences without knowing what they refer to.

Here is the core of my working code so far:

var mlContext = new MLContext();
var articles = SampleData.Articles.Select(a => new TextData{ Text=a }).ToArray();
var dataview = mlContext.Data.LoadFromEnumerable(articles);
var options = new TextFeaturizingEstimator.Options() {
  OutputTokensColumnName = "OutputTokens",
  CaseMode = TextNormalizingEstimator.CaseMode.Lower,
  KeepDiacritics = false,
  KeepPunctuations = false,
  KeepNumbers = false,
  Norm = TextFeaturizingEstimator.NormFunction.L2,
  StopWordsRemoverOptions = new StopWordsRemovingEstimator.Options() {
    Language = TextFeaturizingEstimator.Language.Dutch,
  },
  WordFeatureExtractor = new WordBagEstimator.Options() {
    NgramLength = 4,
    SkipLength = 1,
    UseAllLengths = true,
    MaximumNgramsCount = new int[] { 20, 10, 10, 10 },
    Weighting = NgramExtractingEstimator.WeightingCriteria.TfIdf,
  },
  CharFeatureExtractor = null,
};
var textPipeline = mlContext.Transforms.Text   
  .FeaturizeText("Features", options, "Text");
var textTransformer = textPipeline.Fit(dataview);
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData, TransformedTextData>(textTransformer);
foreach (var article in articles)
{
  var prediction = predictionEngine.Predict(article);
  Console.WriteLine($"Article: {article.Text.Substring(0, 30)}...");
  Console.WriteLine($"Number of Features: {prediction.Features.Length}");
  Console.WriteLine($"Features: {string.Join(",", prediction.Features.Take(50).Select(f => f.ToString("0.00")))}\n");
}
1

There are 1 best solutions below

0
On BEST ANSWER

Well, I figured it out, and wanted to share it here if anyone might bump into this same issue. First, you create your model as usual. Take notice of the name of the column where you put the output of the Ngrams step (in our case "ProduceNgrams").

Then the combination of "Schema.GetSlotNames" and "slotNames.GetValues" does the trick of fetching the desired ngrams:

var textPipeline =
    mlContext.Transforms.Text.NormalizeText("Tokens", "Text", TextNormalizingEstimator.CaseMode.Lower, false, false, false)
    .Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens"))
    .Append(mlContext.Transforms.Text.RemoveDefaultStopWords("Tokens", language: StopWordsRemovingEstimator.Language.Dutch))
    .Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens"))
    .Append(mlContext.Transforms.Text.ProduceNgrams("NgramFeatures", "Tokens"))
    .Append(mlContext.Transforms.Text.LatentDirichletAllocation("LDAFeatures", "NgramFeatures", 
      numberOfTopics: 10
    ))
    .Append(mlContext.Transforms.NormalizeLpNorm("Features", "LDAFeatures"));

var textTransformer = textPipeline.Fit(dataview);
var transformedDataView = textTransformer.Transform(dataview);

VBuffer<ReadOnlyMemory<char>> slotNames = default;
transformedDataView.Schema["NgramFeatures"].GetSlotNames(ref slotNames);
var ngrams = slotNames.GetValues().ToArray().Select(x => x.Span.ToString()); //.Replace('|',' '));
Console.WriteLine($"Ngrams: {string.Join(", ", ngrams)}\n");

var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData, TransformedTextData>(textTransformer);
var articlesWithFeatures = new List<(TextData, TransformedTextData)>();
foreach (var article in articles)
{
  var articleWithFeatures = predictionEngine.Predict(article);
  Console.WriteLine($"Article: {article.Text.Substring(0, 30)}...");
  Console.WriteLine($"Number of Features: {articleWithFeatures.Features.Length}");
  Console.WriteLine($"Features: {string.Join(",", articleWithFeatures.Features.Take(50).Select(f => f.ToString("0.00")))}\n");

  articlesWithFeatures.Add((article, articleWithFeatures));
}