我使用的是来自Kaggle的问题对数据集,以及SdcaLogisticRegression。ML.Net的版本是14.0
我的电脑配置如下:
- 操作系统:Microsoft Windows 10 Pro
- 系统类型:基于x64的PC
- 内存:32.0 GB
- CPU:Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz,2208 MHz,6核,12逻辑处理器
Program.cs:
using System;using System.Collections.Generic;using System.IO;using System.Linq;using Microsoft.ML;using Microsoft.ML.Data;using static Microsoft.ML.DataOperationsCatalog;using Microsoft.ML.Trainers;using Microsoft.ML.Transforms.Text;namespace Csharp_machieneLearning{ class Program { private static IDataView TransData; public static void Evaluate(MLContext mlContext, ITransformer model, IDataView splitTestSet) { Console.WriteLine("=============== Evaluating Model accuracy with Test data==============="); IDataView predictions = model.Transform(splitTestSet); CalibratedBinaryClassificationMetrics metrics = mlContext.BinaryClassification.Evaluate(predictions, "is_duplicate"); Console.WriteLine(); Console.WriteLine("Model quality metrics evaluation"); Console.WriteLine("--------------------------------"); Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}"); Console.WriteLine($"Auc: {metrics.AreaUnderRocCurve:P2}"); Console.WriteLine($"F1Score: {metrics.F1Score:P2}"); Console.WriteLine("=============== End of model evaluation ==============="); } static void Main(string[] args) { MLContext mlContext = new MLContext(); Console.WriteLine($"=============== Loading Dataset ==============="); IDataView file = mlContext.Data.LoadFromTextFile<QuestionPairs>(@"C:\Users\ludwi\source\repos\Csharp_machieneLearning\questions.csv", separatorChar: ',', hasHeader: true); Console.WriteLine($"=============== Finished Loading Dataset ==============="); IEstimator<ITransformer> pipeline = mlContext.Transforms.Conversion.ConvertType("is_duplicate", outputKind: DataKind.Boolean) //.Append(mlContext.Transforms.Conversion.MapValueToKey(inputColumnName: "is_duplicate", outputColumnName: "Label")) .Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question1", outputColumnName: "question1Featurized")) .Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question2", outputColumnName: "question2Featurized")) .Append(mlContext.Transforms.Concatenate("Features", "question1Featurized", "question2Featurized")) .Append(mlContext.Transforms.NormalizeMinMax("Features")); IEstimator<ITransformer> estimator = mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: "is_duplicate", featureColumnName: "Features"); var transData = pipeline.Fit(file).Transform(file); var data = mlContext.Data.TrainTestSplit(transData, testFraction: 0.25); var model = estimator.Fit(data.TrainSet); Evaluate(mlContext, model, data.TestSet); } }}
QuestionPairs.cs:
using System;using System.Collections.Generic;using System.Dynamic;using System.Text;using Microsoft.ML.Data;namespace Csharp_machieneLearning{ public class QuestionPairs { [LoadColumn(3)] public string question1 { get; set; } [LoadColumn(4)] public string question2 { get; set; } [LoadColumn(5)] public string is_duplicate { get; set; } } public class QuestionPrediction : QuestionPairs { [ColumnName("PredictedLabel")] public bool Prediction { get; set; } public float Probability { get; set; } public float Score { get; set; } }}
回答:
我认为问题可能出在
ConvertType("is_duplicate", outputKind: DataKind.Boolean)
,所以我创建了一个自定义转换器:
Action<QuestionPairs, transformOutput> mapping = (input, output) => { output.Label = input.is_duplicate.Equals("1") ? true : false; }; IEstimator<ITransformer> pipeline = mlContext.Transforms.CustomMapping(mapping, contractName: null) .Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question1", outputColumnName: "question1Featurized")) .Append(mlContext.Transforms.Text.FeaturizeText(inputColumnName: "question2", outputColumnName: "question2Featurized")) .Append(mlContext.Transforms.Concatenate("Features", "question1Featurized", "question2Featurized")) //.Append(mlContext.Transforms.NormalizeMinMax("Features")) //.AppendCacheCheckpoint(mlContext) .Append(mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: nameof(customTransform.Label), featureColumnName: "Features"));
这似乎没有帮助,所以我在想程序是否正确加载了数据集。
因此,我添加了一个
Preview()
函数。
var file = pipeline.Preview(10); foreach(var row in preview.RowView) { foreach(var column in row.Values) { Console.WriteLine(column); } Console.WriteLine("============================================================="); }
输出:
如您所见,”is_dublicat”列有时包含一个字符串,该字符串原本应该是特征的一部分。这是由于特征句子中使用了”,”引起的。
经过快速搜索,我找到了LoadFromTextFile()
函数的allowQuoting: true
属性,结果看起来是预期的:运行完整代码后,结果如预期那样。