我有一个包含4列的数据框,如下所示。每行代表特定数据集(使用特定参数设置)的分类或回归结果。我还有另一个包含每个数据集的金标准结果(分类使用Kappa和Accuracy,回归使用R-squared和RMSE)的数据框。我希望生成一个新的数据框,在现有列的基础上添加两个新列,分别显示这两个指标的误差。
也就是说,我希望对于第一个(样本)数据框中的每一行,找到金标准数据框中的指标1与样本数据框中的指标1之间的差异。对于指标2也同样处理。新列可以命名为错误1和错误2。将样本数据框中的每一行与金标准数据框中的数据集进行匹配。
样本数据框:
Dataset, Metric_1, Metric_2, ML_Typeccp, 11.8076142844202, 0.628949889120101, regressionpageblocks, 0.968940316686967, 0.84426843805383, classificationonp, 0.65282098713529, 0.305364681866831, classificationpageblocks, 0.961023142509135, 0.795966628677049, classificationconcrete, 10.4831489351907, 0.62767229736877, regressiononp, 0.650802993357437, 0.301621021444335, classificationconcrete, 10.8875688078687, 0.599691053769861, regressionccp, 4.60154386445267, 0.927419750011992, regression
金标准数据框:
Dataset, Metric_1, Metric_2, ML_Typeccp, 4.52997493965786, 0.929612792495658, regressionpageblocks, 0.971376370280146, 0.853898273639253, classificationonp, 0.66476078365425, 0.329343309931143, classificationconcrete, 9.98998588557546, 0.598660395228019, regression
回答:
如果您只是想获取每种模型类型的误差,以下代码将有效:
library(dplyr)df <- tribble( ~Dataset, ~Metric_1, ~Metric_2, ~ML_Type, "ccp", 11.8076142844202, 0.628949889120101, "regression", "pageblocks", 0.968940316686967, 0.84426843805383, "classification", "onp", 0.65282098713529, 0.305364681866831, "classification", "pageblocks", 0.961023142509135, 0.795966628677049, "classification", "concrete", 10.4831489351907, 0.62767229736877, "regression", "onp", 0.650802993357437, 0.301621021444335, "classification", "concrete", 10.8875688078687, 0.599691053769861, "regression", "ccp", 4.60154386445267, 0.927419750011992, "regression" )gold <- tribble( ~Dataset, ~Metric_1, ~Metric_2, ~ML_Type, "ccp", 4.52997493965786, 0.929612792495658, "regression", "pageblocks", 0.971376370280146, 0.853898273639253, "classification", "onp", 0.66476078365425, 0.329343309931143, "classification", "concrete", 9.98998588557546, 0.598660395228019, "regression")err <- gold %>% rename_with(~paste0(., "_gold"), .cols = -Dataset) %>% right_join(df, by = "Dataset") %>% mutate( Metric_1_err = Metric_1 - Metric_1_gold, Metric_2_err = Metric_2 - Metric_2_gold )select(err, -ends_with("gold"))# A tibble: 8 x 6 Dataset Metric_1 Metric_2 ML_Type Metric_1_err Metric_2_err <chr> <dbl> <dbl> <chr> <dbl> <dbl>1 ccp 11.8 0.629 regression 7.28 -0.301 2 ccp 4.60 0.927 regression 0.0716 -0.002193 pageblocks 0.969 0.844 classification -0.00244 -0.009634 pageblocks 0.961 0.796 classification -0.0104 -0.0579 5 onp 0.653 0.305 classification -0.0119 -0.0240 6 onp 0.651 0.302 classification -0.0140 -0.0277 7 concrete 10.5 0.628 regression 0.493 0.0290 8 concrete 10.9 0.600 regression 0.898 0.00103