我在尝试使用Python实现一个自定义的性能指标。目标是计算出概率阈值的最佳值,以获得指标A的最低值。我编写了以下代码来计算混淆矩阵和阈值。
def confusion_matrix(self): """此方法返回给定Y和Y_Predicted对的混淆矩阵""" #y,ypred y = self.df["y"] ypred = self.df["ypred"] self.setVariables() try: assert len(y) == len(ypred) for val in range(len(df["proba"])): print(val) if y[val] == 1 and ypred[val] == 1: self._truePositive +=1 if y[val] == 1 and ypred[val] == 0: self._trueNegative +=1 if y[val] == 0 and ypred[val] == 1: self._falsePositive +=1 if y[val] == 0 and ypred[val] == 0: self._falseNegtive +=1 for i in self._truePositive,self._trueNegative,self._falsePositive,self._falseNegtive: self._cnf_matrix.append(i) cnfMatrix = self._cnf_matrix.copy() return np.array(cnfMatrix).reshape(2,2) except AssertionError: print("输入错误:y和ypred的长度不一致。") def metricForLowestValues(self): """计算概率阈值的最佳值,以获得指标A的最低值""" dict_metricA = {} for item in tqdm(self.df['proba']): if item != None: self.predict(item) cnf = self.confusion_matrix() # A=500×假阴性的数量+100×假阳性的数量 metricA = 500 * self._falseNegtive + 100* self._falsePositive dict_metricA[item] = metricA self.df.drop(columns=["ypred"],inplace=True) sorted_metricAList = sorted(dict_metricA.items(),key=lambda item:item[1]) minKey = sorted_metricAList[0][0] minValue = dict_metricA[minKey] return minKey, minValue
但是,当我尝试运行这段代码时,在计算混淆矩阵时出现了以下KeyError错误。
---------------------------------------------------------------------------KeyError Traceback (most recent call last)<ipython-input-164-38aae4fab9c1> in <module>----> 1 performance3.metricForLowestValues()<ipython-input-148-fe3aeec53878> in metricForLowestValues(self) 91 if item != None: 92 self.predict(item)---> 93 cnf = self.confusion_matrix() 94 # A=500×假阴性的数量+100×假阳性的数量 95 metricA = 500 * self._falseNegtive + 100* self._falsePositive<ipython-input-148-fe3aeec53878> in confusion_matrix(self) 30 for val in range(len(df["proba"])): 31 print(val)---> 32 if y[val] == 1 and ypred[val] == 1: 33 self._truePositive +=1 34 if y[val] == 1 and ypred[val] == 0:~/Anaconda/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in __getitem__(self, key) 869 key = com.apply_if_callable(key, self) 870 try:--> 871 result = self.index.get_value(self, key) 872 873 if not is_scalar(result):~/Anaconda/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_value(self, series, key) 4403 k = self._convert_scalar_indexer(k, kind="getitem") 4404 try:-> 4405 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None)) 4406 except KeyError as e1: 4407 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()KeyError: 2852
我使用的数据集形状为(2852, 2)。理论上,迭代应该从0到2851,因为总行数是2852。我认为这个错误可能是由于生成了额外的行,但我不知道如何修复它。我尝试在metricForLowestValues函数中过滤掉None值,但没有成功。
我做错了什么吗?希望能得到一些见解。
回答:
错误可能是因为你遍历了self.df['proba']
的长度,而不是df['proba']
。遍历len(y)
可能更简单,因为你知道它的长度是正确的。如果你能发布df.tail()
的输出就更好了。