ML.Net 准确率始终为100%

我使用的是来自Kaggle的问题对数据集,以及SdcaLogisticRegression。ML.Net的版本是14.0

我的电脑配置如下:

  1. 操作系统:Microsoft Windows 10 Pro
  2. 系统类型:基于x64的PC
  3. 内存:32.0 GB
  4. CPU:Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz,2208 MHz,6核,12逻辑处理器

Program.cs:

using System;using System.Collections.Generic;using System.IO;using System.Linq;using Microsoft.ML;using Microsoft.ML.Data;using static Microsoft.ML.DataOperationsCatalog;using Microsoft.ML.Trainers;using Microsoft.ML.Transforms.Text;namespace Csharp_machieneLearning{    class Program    {        private static IDataView TransData;        public static void Evaluate(MLContext mlContext, ITransformer model, IDataView splitTestSet)        {            Console.WriteLine("=============== Evaluating Model accuracy with Test data===============");            IDataView predictions = model.Transform(splitTestSet);            CalibratedBinaryClassificationMetrics metrics = mlContext.BinaryClassification.Evaluate(predictions, "is_duplicate");            Console.WriteLine();            Console.WriteLine("Model quality metrics evaluation");            Console.WriteLine("--------------------------------");            Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");            Console.WriteLine($"Auc: {metrics.AreaUnderRocCurve:P2}");            Console.WriteLine($"F1Score: {metrics.F1Score:P2}");            Console.WriteLine("=============== End of model evaluation ===============");        }        static void Main(string[] args)        {            MLContext mlContext = new MLContext();            Console.WriteLine($"=============== Loading Dataset  ===============");            IDataView file = mlContext.Data.LoadFromTextFile<QuestionPairs>(@"C:\Users\ludwi\source\repos\Csharp_machieneLearning\questions.csv", separatorChar: ',', hasHeader: true);            Console.WriteLine($"=============== Finished Loading Dataset  ===============");            IEstimator<ITransformer> pipeline = mlContext.Transforms.Conversion.ConvertType("is_duplicate", outputKind: DataKind.Boolean)                            //.Append(mlContext.Transforms.Conversion.MapValueToKey(inputColumnName: "is_duplicate", outputColumnName: "Label"))                            .Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question1", outputColumnName: "question1Featurized"))                            .Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question2", outputColumnName: "question2Featurized"))                            .Append(mlContext.Transforms.Concatenate("Features", "question1Featurized", "question2Featurized"))                            .Append(mlContext.Transforms.NormalizeMinMax("Features"));            IEstimator<ITransformer> estimator = mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: "is_duplicate", featureColumnName: "Features");            var transData = pipeline.Fit(file).Transform(file);            var data = mlContext.Data.TrainTestSplit(transData, testFraction: 0.25);            var model = estimator.Fit(data.TrainSet);            Evaluate(mlContext, model, data.TestSet);        }    }}

QuestionPairs.cs:

using System;using System.Collections.Generic;using System.Dynamic;using System.Text;using Microsoft.ML.Data;namespace Csharp_machieneLearning{    public class QuestionPairs    {        [LoadColumn(3)]        public string question1 { get; set; }        [LoadColumn(4)]        public string question2 { get; set; }        [LoadColumn(5)]        public string is_duplicate { get; set; }    }    public class QuestionPrediction : QuestionPairs    {        [ColumnName("PredictedLabel")]        public bool Prediction { get; set; }        public float Probability { get; set; }        public float Score { get; set; }    }}

输出:enter image description here


回答:

我认为问题可能出在ConvertType("is_duplicate", outputKind: DataKind.Boolean),所以我创建了一个自定义转换器:

Action<QuestionPairs, transformOutput> mapping = (input, output) => { output.Label = input.is_duplicate.Equals("1") ? true : false; };            IEstimator<ITransformer> pipeline = mlContext.Transforms.CustomMapping(mapping, contractName: null)                            .Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question1", outputColumnName: "question1Featurized"))                            .Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question2", outputColumnName: "question2Featurized"))                            .Append(mlContext.Transforms.Concatenate("Features", "question1Featurized", "question2Featurized"))                            //.Append(mlContext.Transforms.NormalizeMinMax("Features"))                            //.AppendCacheCheckpoint(mlContext)                            .Append(mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: nameof(customTransform.Label), featureColumnName: "Features"));

这似乎没有帮助,所以我在想程序是否正确加载了数据集。

因此,我添加了一个Preview()函数。

var file = pipeline.Preview(10);            foreach(var row in preview.RowView)            {                foreach(var column in row.Values)                {                    Console.WriteLine(column);                }                Console.WriteLine("=============================================================");            }

输出: output 如您所见,”is_dublicat”列有时包含一个字符串,该字符串原本应该是特征的一部分。这是由于特征句子中使用了”,”引起的。

经过快速搜索,我找到了LoadFromTextFile()函数的allowQuoting: true属性,结果看起来是预期的:output of the good data运行完整代码后,结果如预期那样。

enter image description here

Related Posts

如何对SVC进行超参数调优?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

如何在初始训练后向模型添加训练数据?

我想在我的scikit-learn模型已经训练完成后再…

使用Google Cloud Function并行运行带有不同用户参数的相同训练作业

我正在寻找一种方法来并行运行带有不同用户参数的相同训练…

加载Keras模型,TypeError: ‘module’ object is not callable

我已经在StackOverflow上搜索并阅读了文档,…

在计算KNN填补方法中特定列中NaN值的”距离平均值”时

当我从头开始实现KNN填补方法来处理缺失数据时,我遇到…

使用巨大的S3 CSV文件或直接从预处理的关系型或NoSQL数据库获取数据的机器学习训练/测试工作

已关闭。此问题需要更多细节或更清晰的说明。目前不接受回…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注