使用来自sklearn的iris数据集。我将数据分开,应用感知器,将得分记录在一个字典中,该字典将用于拟合模型的样本大小(键)映射到相应的得分(训练和测试得分作为一个元组)。
由于我运行循环3次,这会生成3个字典。如何找到3次迭代中得分的平均值?我尝试将字典存储在列表中并计算平均值,但没有成功。
例如,如果字典是这样的:
{21: (0.85, 0.82), 52: (0.80, 0.62), 73: (0.82, 0.45), 94: (0.81, 0.78)}{21: (0.95, 0.91), 52: (0.80, 0.89), 73: (0.84, 0.87), 94: (0.79, 0.41)}{21: (0.809, 0.83), 52: (0.841, 0.77), 73: (0.84, 0.44), 94: (0.79, 0.33)}
输出应该为 {21:(0.869,0.853),52.....}
,其中键21的值的第一个元素是 (0.85+0.95+0.809)/3,第二个元素是 (0.82+0.91+0.83)/3
import numpy as npimport pandas as pdfrom sklearn.datasets import load_irisfrom sklearn.linear_model import Perceptronfrom sklearn.model_selection import train_test_splitscore_list=shape_list=[]iris = load_iris()props=[0.2,0.5,0.7,0.9]df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])y=df[list(df.loc[:,df.columns.values =='target'])]X=df[list(df.loc[:,df.columns.values !='target'])]# 试验次数for i in range(3): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, train_size=0.7) results = {} for i in props: size = int(i*len(X_train)) ix = np.random.choice(X_train.index, size=size, replace = False) sampleX = X_train.loc[ix] sampleY = y_train.loc[ix] #应用模型 modelP = Perceptron(tol=1e-3) modelP.fit(sampleX, sampleY) train_score = modelP.score(sampleX,sampleY) test_score = modelP.score(X_test,y_test) #存储到字典中 results[size] = (train_score, test_score) print(results)
另外,如果有人懂统计学,有没有办法计算试验的标准误差,并为每个样本大小(字典的键)打印平均标准误差?
回答:
- 更新现有循环以将
results
保存到list
中,命名为rl
- 将
rl
加载到数据框中,因为您已经在使用pandas - 将
tuples
的列扩展为单独的列 - 使用
.agg
获取指标 - 使用
python 3.8
和pandas 1.3.1
进行测试f-strings
(例如f'TrS{c}', f'TeS{c}'
)需要python >= 3.6
对现有代码的更新
# 选择X和y的列y = df.loc[:, 'target']X = df.loc[:, iris['feature_names']]# 试验次数rl = list() # 添加:将结果保存到列表中for i in range(3): ... results = {} for i in props: ... ... rl.append(results) # 添加:追加结果
获取指标的新代码
- 将
metrics
转换为tuples
的list
比转换为tuples
的tuple
更容易,因为tuple
一旦创建就是不可变的。这意味着tuples
可以添加到现有的list
中,但不能添加到现有的tuple
中。- 因此,使用
defaultdict
创建tuples
的list
更容易,然后使用map
将每个值转换为tuple
。 k[3:]
要求数字始终从index 3
开始
- 因此,使用
from collections import defaultdict# 将rl转换为数据框rl = [{21: (0.5714285714285714, 0.6888888888888889), 52: (0.6153846153846154, 0.7111111111111111), 73: (0.7123287671232876, 0.6222222222222222), 94: (0.7127659574468085, 0.6)}, {21: (0.6190476190476191, 0.6444444444444445), 52: (0.6923076923076923, 0.6444444444444445), 73: (0.3698630136986301, 0.35555555555555557), 94: (0.7978723404255319, 0.7777777777777778)}, {21: (0.8095238095238095, 0.5555555555555556), 52: (0.7307692307692307, 0.5555555555555556), 73: (0.7534246575342466, 0.5777777777777777), 94: (0.6170212765957447, 0.7555555555555555)}]df = pd.DataFrame(rl)# display(df) 21 52 73 940 (0.5714285714285714, 0.6888888888888889) (0.6153846153846154, 0.7111111111111111) (0.7123287671232876, 0.6222222222222222) (0.7127659574468085, 0.6)1 (0.6190476190476191, 0.6444444444444445) (0.6923076923076923, 0.6444444444444445) (0.3698630136986301, 0.35555555555555557) (0.7978723404255319, 0.7777777777777778)2 (0.8095238095238095, 0.5555555555555556) (0.7307692307692307, 0.5555555555555556) (0.7534246575342466, 0.5777777777777777) (0.6170212765957447, 0.7555555555555555)# 扩展元组for c in df.columns: df[[f'TrS{c}', f'TeS{c}']] = pd.DataFrame(df[c].tolist(), index= df.index) df.drop(c, axis=1, inplace=True)# 获取均值和标准差metrics = df.agg(['mean', 'std']).round(3)# display(metrics) TrS21 TeS21 TrS52 TeS52 TrS73 TeS73 TrS94 TeS94mean 0.667 0.630 0.679 0.637 0.612 0.519 0.709 0.711std 0.126 0.068 0.059 0.078 0.211 0.143 0.090 0.097# 转换为字典dd = defaultdict(list)for k, v in metrics.to_dict().items(): dd[int(k[3:])].append(tuple(v.values())) dd = dict(zip(dd, map(tuple, dd.values())))print(dd)[out]:{21: ((0.667, 0.126), (0.63, 0.068)), 52: ((0.679, 0.059), (0.637, 0.078)), 73: ((0.612, 0.211), (0.519, 0.143)), 94: ((0.709, 0.09), (0.711, 0.097))}