Home About Blog Projects Contact

C#_automl_classification

30 Dec 2019

Motivation for doing this research

Microsoft has created a script to run their AutoML. I have used this script extensively, and I found the best estimator for my dataset was the FastTree estimator. I also was compelled to drill down with this estimator to see if I could squeeze out more performance.

In this notebook I run AutoML for a much longer time to observe the training error curve. I want to show a plot of this curve, but that will have to wait for a future version of this notebook.

The other reason for doing this analysis was to obtain a trained model that could be executed in milliseconds. Microsoft’s ML.NET does this. The previous posting in this blog uses Python and TPOT. This is my gold standard model. It took over a week of machine time to train the model. It’s too bad because I cannot use that model. It takes over 4 seconds to make a prediction using new data. With ML.NET I now have a model that is an order of magnitude faster.

#r "nuget:Microsoft.ML"
#r "nuget:Microsoft.ML.AutoML"

using System;
using System.Diagnostics;
using System.Linq;

using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.AutoML;

Routine to print the training progress output

public class BinaryExperimentProgressHandler:IProgress<RunDetail<BinaryClassificationMetrics>> {
  private int _iterationIndex;
  private double best_f1;

  public void Report(RunDetail<BinaryClassificationMetrics> iterationResult) {
    if (_iterationIndex++ == 0) {
      Console.WriteLine("     Trainer                               Accuracy    AUC       AUPRC   F1-score  Duration");
    }
    var trainerName = iterationResult.TrainerName;
    var accuracy = iterationResult.ValidationMetrics.Accuracy;
    var auc      = iterationResult.ValidationMetrics.AreaUnderRocCurve;
    var aupr     = iterationResult.ValidationMetrics.AreaUnderPrecisionRecallCurve;
    var f1       = iterationResult.ValidationMetrics.F1Score;
    var runtimeInSeconds = iterationResult.RuntimeInSeconds;
    if (f1 > best_f1) {
      best_f1 = f1;
      Console.WriteLine("{0, 4} {1, -35} {2, 9:F4} {3, 9:F4} {4, 9:F4} {5, 9:F4} {6, 9:F4}",
                        _iterationIndex, trainerName, accuracy, auc, aupr, f1, runtimeInSeconds);
    } else {
      Console.Write("{0, 4}\r", _iterationIndex);
    }
  }
}

Routine to print a summary of the model metrics

private static void PrintMetrics(BinaryClassificationMetrics metrics) {
  Console.WriteLine("  Accuracy........................ {0:f6}", metrics.Accuracy);
  Console.WriteLine("  AreaUnderPrecisionRecallCurve... {0:f6}", metrics.AreaUnderPrecisionRecallCurve);
  Console.WriteLine("  AreaUnderRocCurve............... {0:f6}", metrics.AreaUnderRocCurve);
  Console.WriteLine("  F1Score......................... {0:f6}", metrics.F1Score);
  Console.WriteLine("  NegativePrecision............... {0:f6}", metrics.NegativePrecision);
  Console.WriteLine("  NegativeRecall.................. {0:f6}", metrics.NegativeRecall);
  Console.WriteLine("  PositivePrecision............... {0:f6}", metrics.PositivePrecision);
  Console.WriteLine("  PositiveRecall.................. {0:f6}", metrics.PositiveRecall);

  Console.WriteLine("\nConfusion Matrix:\n{0}", metrics.ConfusionMatrix.GetFormattedConfusionTable());
}

Define the Model input

public class ModelInput     {
  [ColumnName("BoxRatio"), LoadColumn(0)]
  public float BoxRatio { get; set; }
  [ColumnName("Thrust"), LoadColumn(1)]
  public float Thrust { get; set; }
  [ColumnName("Acceleration"), LoadColumn(2)]
  public float Acceleration { get; set; }
  [ColumnName("Velocity"), LoadColumn(3)]
  public float Velocity { get; set; }
  [ColumnName("OnBalRun"), LoadColumn(4)]
  public float OnBalRun { get; set; }
  [ColumnName("vwapGain"), LoadColumn(5)]
  public float VwapGain { get; set; }
  [ColumnName("Altitude"), LoadColumn(6)]
  public bool Altitude { get; set; }
}

Define the Model output

public class ModelOutput     {
  [ColumnName("PredictedLabel")]
  public bool Prediction { get; set; }
  public float Score { get; set; }
}

Routine to load the Bottle Rocket dataset and train the model

The program prints only the improved F1 scores, and the summarizes then results.

  public static void Run() {
    var sw = Stopwatch.StartNew();
    var mlContext = new MLContext(seed: 1);

    // STEP 1: Load data
    var trainDataView = mlContext.Data.LoadFromTextFile<ModelInput>(
      path: @"H:\HedgeTools\Datasets\rocket-train-classify-smote.csv",
      hasHeader: true,
      separatorChar: ',');

    var testDataView = mlContext.Data.LoadFromTextFile<ModelInput>(
      path: @"H:\HedgeTools\Datasets\rocket-test-classify.csv",
      hasHeader: true,
      separatorChar: ',');

    var settings = new BinaryExperimentSettings {
      MaxExperimentTimeInSeconds = 30*60,
      OptimizingMetric = BinaryClassificationMetric.F1Score,
      CacheDirectory = null
    };
    settings.Trainers.Clear();
    settings.Trainers.Add(BinaryClassificationTrainer.FastTree);

    // STEP 2: Run AutoML experiment
    Console.WriteLine("\nRunning AutoML binary classification experimeent:");
    ExperimentResult<BinaryClassificationMetrics> experimentResult = mlContext.Auto()
        .CreateBinaryClassificationExperiment(settings)
        .Execute(trainData: trainDataView,
                 labelColumnName: "Altitude",
                 progressHandler: new BinaryExperimentProgressHandler());

    // Step 3: Print metric from the best model
    var bestRun = experimentResult.BestRun;
    sw.Stop();
    Console.WriteLine("Total time: {0}", sw.Elapsed);
    Console.WriteLine("Total models produced.... {0}", experimentResult.RunDetails.Count());
    Console.WriteLine("Best model's trainer..... {0}", bestRun.TrainerName);
    Console.WriteLine("Metrics of best model from validation data:");
    PrintMetrics(bestRun.ValidationMetrics);

    // Step 4: Evaluate test data
    IDataView testDataViewWithBestScore = bestRun.Model.Transform(testDataView);
    var testMetrics = mlContext.BinaryClassification.EvaluateNonCalibrated(data: testDataViewWithBestScore,
                                                                           labelColumnName: "Altitude");
    Console.WriteLine("\nMetrics of best model on test data:");
    PrintMetrics(testMetrics);

    var crossValidationResults = mlContext.BinaryClassification.CrossValidateNonCalibrated(testDataView,
                                                                                           bestRun.Estimator,
                                                                                           numberOfFolds: 6,
                                                                                           labelColumnName: "Altitude");
    var metricsInMultipleFolds = crossValidationResults.Select(r => r.Metrics);
    var AccuracyValues = metricsInMultipleFolds.Select(m => m.Accuracy);
    var accuracyValues = AccuracyValues as double[] ?? AccuracyValues.ToArray();
    var AccuracyAverage = accuracyValues.Average();
    double average = accuracyValues.Average();
    double sumOfSquaresOfDifferences = accuracyValues.Select(val => (val - average) * (val - average)).Sum();
    double AccuraciesStdDeviation = Math.Sqrt(sumOfSquaresOfDifferences / (accuracyValues.Length - 1));
    double confidenceInterval95 = 1.96 * AccuraciesStdDeviation / Math.Sqrt((accuracyValues.Length - 1));
    var AccuraciesConfidenceInterval95 = confidenceInterval95;
    Console.WriteLine("\nCross Validation Metrics:");
    Console.WriteLine("    Average Accuracy: {0:f4}, Standard deviation: {1:f4}, Confidence Interval 95%: {2:f4}",
                      AccuracyAverage, AccuraciesStdDeviation, AccuraciesConfidenceInterval95);

    // Step 5: Save the model
    var mlModel = bestRun.Model;
    mlContext.Model.Save(mlModel, trainDataView.Schema, "MLModel.zip");
    Console.WriteLine("The model is saved.");
  }

Now train the model

Run();
Running AutoML binary classification experimeent:
     Trainer                               Accuracy    AUC       AUPRC   F1-score  Duration
   1 FastTreeBinary                         0.8480    0.9254    0.9291    0.8537    1.4517
   4 FastTreeBinary                         0.8743    0.9390    0.9418    0.8805    2.8786
  13 FastTreeBinary                         0.9380    0.9827    0.9828    0.9412    3.2096
  29 FastTreeBinary                         0.9410    0.9838    0.9811    0.9437    4.3452
  34 FastTreeBinary                         0.9416    0.9831    0.9801    0.9443    4.4156
  35 FastTreeBinary                         0.9422    0.9837    0.9817    0.9452    4.8248
  41 FastTreeBinary                         0.9428    0.9830    0.9818    0.9458    4.9269
  45 FastTreeBinary                         0.9440    0.9844    0.9852    0.9463    4.4346
  73 FastTreeBinary                         0.9434    0.9836    0.9814    0.9464    4.9082
  85 FastTreeBinary                         0.9446    0.9844    0.9838    0.9471    4.2477
 102 FastTreeBinary                         0.9446    0.9845    0.9855    0.9476    6.4925
 116 FastTreeBinary                         0.9476    0.9867    0.9879    0.9501    4.3367
 156 FastTreeBinary                         0.9482    0.9879    0.9864    0.9507    5.6110
 164 FastTreeBinary                         0.9493    0.9871    0.9887    0.9517    6.6294
 204 FastTreeBinary                         0.9511    0.9875    0.9881    0.9536    5.4857
 245 FastTreeBinary                         0.9523    0.9877    0.9860    0.9547    6.7041
Total time: 00:30:03.8017430
Total models produced.... 417
Best model's trainer..... FastTreeBinary
Metrics of best model from validation data:
  Accuracy........................ 0.952324
  AreaUnderPrecisionRecallCurve... 0.986018
  AreaUnderRocCurve............... 0.987704
  F1Score......................... 0.954700
  NegativePrecision............... 0.961783
  NegativeRecall.................. 0.937888
  PositivePrecision............... 0.944009
  PositiveRecall.................. 0.965636
Confusion Matrix:
TEST POSITIVE RATIO:	0.5203 (873.0/(873.0+805.0))
Confusion table
          ||======================
PREDICTED || positive | negative | Recall
TRUTH     ||======================
 positive ||      843 |       30 | 0.9656
 negative ||       50 |      755 | 0.9379
          ||======================
Precision ||   0.9440 |   0.9618 |


Metrics of best model on test data:
  Accuracy........................ 0.975520
  AreaUnderPrecisionRecallCurve... 0.985299
  AreaUnderRocCurve............... 0.998282
  F1Score......................... 0.886724
  NegativePrecision............... 0.999103
  NegativeRecall.................. 0.973776
  PositivePrecision............... 0.801762
  PositiveRecall.................. 0.991826
Confusion Matrix:
TEST POSITIVE RATIO:	0.0966 (367.0/(367.0+3432.0))
Confusion table
          ||======================
PREDICTED || positive | negative | Recall
TRUTH     ||======================
 positive ||      364 |        3 | 0.9918
 negative ||       90 |    3 342 | 0.9738
          ||======================
Precision ||   0.8018 |   0.9991 |


Cross Validation Metrics:
    Average Accuracy: 0.9101, Standard deviation: 0.0145, Confidence Interval 95%: 0.0127
The model is saved.

Summary

These performance metrics are sufficient to my this model into HedgeTools.

blog comments powered by Disqus