我正在尝试完成一个奇怪的任务。我需要在不使用sklearn的情况下完成以下任务,最好使用numpy来实现:
- 给定一个数据集,将数据分成5个等量的“折叠”,或分区
- 在每个分区内,将数据分为“训练”和“测试”集,按80/20的比例分割
- 这里有一个关键点:你的数据集是按类别标记的。例如,一个有100个实例的数据集,类别A有33个样本,类别B有67个样本。我应该创建5个折叠,每个折叠包含20个数据实例,其中每个折叠中,类别A大约有6或7个(1/3)的值,类别B则占剩余部分
我的问题是:尽管我能够适当地分割数据,但我不知道如何为每个折叠正确返回测试和训练集,更重要的是,我不知道如何正确地按每个类别的元素数量进行分割。
我当前的代码在这里。在我卡住的地方有注释:
import numpydef csv_to_array(file): # 打开文件,并以逗号分隔的方式加载逗号分隔值文件 data = open(file, 'r') data = numpy.loadtxt(data, delimiter=',') # 遍历数组中的数据 for index in range(len(data)): # 使用try catch尝试转换为浮点数,如果无法转换为浮点数,则转换为0 try: data[index] = [float(x) for x in data[index]] except Exception: data[index] = 0 except ValueError: data[index] = 0 # 返回现在类型格式化后的数据 return datadef five_cross_fold_validation(dataset): # print("DATASET", dataset) numpy.random.shuffle(dataset) num_rows = dataset.shape[0] split_mark = int(num_rows / 5) folds = [] temp1 = dataset[:split_mark] # print("TEMP1", temp1) temp2 = dataset[split_mark:split_mark*2] # print("TEMP2", temp2) temp3 = dataset[split_mark*2:split_mark*3] # print("TEMP3", temp3) temp4 = dataset[split_mark*3:split_mark*4] # print("TEMP4", temp4) temp5 = dataset[split_mark*4:] # print("TEMP5", temp5) folds.append(temp1) folds.append(temp2) folds.append(temp3) folds.append(temp4) folds.append(temp5) # folds = numpy.asarray(folds) for fold in folds: # fold = numpy.asarray(fold) num_rows = fold.shape[0] split_mark = int(num_rows * .8) fold_training = fold[split_mark:] fold_testing = fold[:split_mark] print(type(fold)) # fold.tolist() list(fold) print(type(fold)) del fold[0:len(fold)] fold.append(fold_training) fold.append(fold_testing) fold = numpy.asarray(fold) # 以某种方式,在每个折叠内返回一个测试和训练集 # print(folds) return foldsdef confirm_size(folds): total = 0 for fold in folds: curr = len(fold) total = total + curr return totaldef main(): print("开始CFV") ecoli = csv_to_array('Classification/ecoli.csv') print(len(ecoli)) folds = five_cross_fold_validation(ecoli) size = confirm_size(folds) print(size)main()
此外,为了参考,我附上了我正在处理的csv文件(它是对UCI Ecoli数据集的修改)。这里的类别是最后一列的值。所以是0, 1, 2, 3, 4。需要注意的是,每个类别的数量是不相等的。
0.61,0.45,0.48,0.5,0.48,0.35,0.41,0 0.17,0.38,0.48,0.5,0.45,0.42,0.5,0 0.44,0.35,0.48,0.5,0.55,0.55,0.61,0 0.43,0.4,0.48,0.5,0.39,0.28,0.39,0 0.42,0.35,0.48,0.5,0.58,0.15,0.27,0 0.23,0.33,0.48,0.5,0.43,0.33,0.43,0 0.37,0.52,0.48,0.5,0.42,0.42,0.36,0 0.29,0.3,0.48,0.5,0.45,0.03,0.17,0 0.22,0.36,0.48,0.5,0.35,0.39,0.47,0 0.23,0.58,0.48,0.5,0.37,0.53,0.59,0 0.47,0.47,0.48,0.5,0.22,0.16,0.26,0 0.54,0.47,0.48,0.5,0.28,0.33,0.42,0 0.51,0.37,0.48,0.5,0.35,0.36,0.45,0 0.4,0.35,0.48,0.5,0.45,0.33,0.42,0 0.44,0.34,0.48,0.5,0.3,0.33,0.43,0 0.44,0.49,0.48,0.5,0.39,0.38,0.4,0 0.43,0.32,0.48,0.5,0.33,0.45,0.52,0 0.49,0.43,0.48,0.5,0.49,0.3,0.4,0 0.47,0.28,0.48,0.5,0.56,0.2,0.25,0 0.32,0.33,0.48,0.5,0.6,0.06,0.2,0 0.34,0.35,0.48,0.5,0.51,0.49,0.56,0 0.35,0.34,0.48,0.5,0.46,0.3,0.27,0 0.38,0.3,0.48,0.5,0.43,0.29,0.39,0 0.38,0.44,0.48,0.5,0.43,0.2,0.31,0 0.41,0.51,0.48,0.5,0.58,0.2,0.31,0 0.34,0.42,0.48,0.5,0.41,0.34,0.43,0 0.51,0.49,0.48,0.5,0.53,0.14,0.26,0 0.25,0.51,0.48,0.5,0.37,0.42,0.5,0 0.29,0.28,0.48,0.5,0.5,0.42,0.5,0 0.25,0.26,0.48,0.5,0.39,0.32,0.42,0 0.24,0.41,0.48,0.5,0.49,0.23,0.34,0 0.17,0.39,0.48,0.5,0.53,0.3,0.39,0 0.04,0.31,0.48,0.5,0.41,0.29,0.39,0 0.61,0.36,0.48,0.5,0.49,0.35,0.44,0 0.34,0.51,0.48,0.5,0.44,0.37,0.46,0 0.28,0.33,0.48,0.5,0.45,0.22,0.33,0 0.4,0.46,0.48,0.5,0.42,0.35,0.44,0 0.23,0.34,0.48,0.5,0.43,0.26,0.37,0 0.37,0.44,0.48,0.5,0.42,0.39,0.47,0 0,0.38,0.48,0.5,0.42,0.48,0.55,0 0.39,0.31,0.48,0.5,0.38,0.34,0.43,0 0.3,0.44,0.48,0.5,0.49,0.22,0.33,0 0.27,0.3,0.48,0.5,0.71,0.28,0.39,0 0.17,0.52,0.48,0.5,0.49,0.37,0.46,0 0.36,0.42,0.48,0.5,0.53,0.32,0.41,0 0.3,0.37,0.48,0.5,0.43,0.18,0.3,0 0.26,0.4,0.48,0.5,0.36,0.26,0.37,0 0.4,0.41,0.48,0.5,0.55,0.22,0.33,0 0.22,0.34,0.48,0.5,0.42,0.29,0.39,0 0.44,0.35,0.48,0.5,0.44,0.52,0.59,0 0.27,0.42,0.48,0.5,0.37,0.38,0.43,0 0.16,0.43,0.48,0.5,0.54,0.27,0.37,0 0.06,0.61,0.48,0.5,0.49,0.92,0.37,1 0.44,0.52,0.48,0.5,0.43,0.47,0.54,1 0.63,0.47,0.48,0.5,0.51,0.82,0.84,1 0.23,0.48,0.48,0.5,0.59,0.88,0.89,1 0.34,0.49,0.48,0.5,0.58,0.85,0.8,1 0.43,0.4,0.48,0.5,0.58,0.75,0.78,1 0.46,0.61,0.48,0.5,0.48,0.86,0.87,1 0.27,0.35,0.48,0.5,0.51,0.77,0.79,1
回答:
编辑 我用np.random.permutation(A)
替换了np.random.shuffle(A)
,唯一的区别是它不会改变输入数组。这在本代码中没有任何区别,但在一般情况下更安全。
这个想法是通过使用numpy.random.permutation
来随机抽样输入。一旦行被打乱,我们只需要遍历所有可能的测试集(这里是输入大小的20%的滑动窗口)。相应的训练集只是由所有剩余的元素组成。
这将在所有子集上保留原始类别分布,即使我们按顺序挑选它们,因为我们已经打乱了输入。
以下代码遍历测试/训练集组合:
import numpy as npdef csv_to_array(file): with open(file, 'r') as f: data = np.loadtxt(f, delimiter=',') return datadef classes_distribution(A): """打印数组A的类别分布。""" nb_classes = np.unique(A[:,-1]).shape[0] total_size = A.shape[0] for i in range(nb_classes): class_size = sum(row[-1] == i for row in A) class_p = class_size/total_size print(f"\t P(class_{i}) = {class_p:.3f}")def random_samples(A, test_set_p=0.2): """将输入数组A分割成两个均匀选择的随机集合:测试/训练。 重复此操作,直到所有行至少被作为测试集抽取一次。""" A = np.random.permutation(A) sample_size = int(test_set_p*A.shape[0]) for start in range(0, A.shape[0], sample_size): end = start + sample_size yield { "test": A[start:end,], "train": np.append(A[:start,], A[end:,], 0) }def main(): ecoli = csv_to_array('ecoli.csv') print("输入集形状: ", ecoli.shape) print("输入集类别分布:") classes_distribution(ecoli) print("训练集类别分布:") for iteration in random_samples(ecoli): test_set = iteration["test"] training_set = iteration["train"] classes_distribution(training_set) print("---") # ... 对这两个集合做任何你想做的事main()
它会生成如下形式的输出:
输入集形状: (169, 8)输入集类别分布: P(class_0) = 0.308 P(class_1) = 0.213 P(class_2) = 0.207 P(class_3) = 0.118 P(class_4) = 0.154训练集类别分布: P(class_0) = 0.316 P(class_1) = 0.206 P(class_2) = 0.199 P(class_3) = 0.118 P(class_4) = 0.162...