我有一个已经正常工作的训练好的分类器。
我尝试通过循环来修改它以处理多个.csv文件,但这导致了错误,以至于原本正常工作的代码现在在处理之前没有问题的.csv文件时也返回了同样的错误。
我非常困惑,无法理解为什么之前一切正常的情况下会突然出现这个错误。原来的(正常工作的)代码是这样的;
# -*- coding: utf-8 -*- import csv import pandas import numpy as np import sklearn.ensemble as ske import re import os import collections import pickle from sklearn.externals import joblib from sklearn import model_selection, tree, linear_model, svm # 加载数据集 url = 'test_6_During_100.csv' dataset = pandas.read_csv(url) dataset.set_index('Name', inplace = True) ##dataset = dataset[['ProcessorAffinity','ProductVersion','Handle','Company', ## 'UserProcessorTime','Path','Product','Description',]] # 打开文件以输出所有内容 new_url = re.sub('\.csv$', '', url) f = open(new_url + " output report", 'w') f.write(new_url + " output report\n") f.write("\n") # 形状 print(dataset.shape) print("\n") f.write("Dataset shape " + str(dataset.shape) + "\n") f.write("\n") clf = joblib.load(os.path.join( os.path.dirname(os.path.realpath(__file__)), 'classifier/classifier.pkl')) Class_0 = [] Class_1 = [] prob = [] for index, row in dataset.iterrows(): res = clf.predict([row]) if res == 0: if index in malware: Class_0.append(index) elif index in Class_1: Class_1.append(index) else: print "Is ", index, " recognised?" designation = raw_input() if designation == "No": Class_0.append(index) else: Class_1.append(index) dataset['Type'] = 1 dataset.loc[dataset.index.str.contains('|'.join(Class_0)), 'Type'] = 0 print "\n" results = [] results.append(collections.OrderedDict.fromkeys(dataset.index[dataset['Type'] == 0])) print (results) X = dataset.drop(['Type'], axis=1).values Y = dataset['Type'].values clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True) clf.fit(X, Y) joblib.dump(clf, 'classifier/classifier.pkl') output = collections.Counter(Class_0) print "Class_0; \n" f.write ("Class_0; \n") for key, value in output.items(): f.write(str(key) + " ; " + str(value) + "\n") print(str(key) + " ; " + str(value)) print "\n" f.write ("\n") output_1 = collections.Counter(Class_1) print "Class_1; \n" f.write ("Class_1; \n") for key, value in output_1.items(): f.write(str(key) + " ; " + str(value) + "\n") print(str(key) + " ; " + str(value)) print "\n" f.close()
我的新代码与原来的相同,但包装在几个嵌套的循环中,以便在文件夹中有文件需要处理时保持脚本运行,新代码(导致错误的代码)如下;
# -*- coding: utf-8 -*-import csvimport pandasimport numpy as npimport sklearn.ensemble as skeimport reimport osimport timeimport collectionsimport picklefrom sklearn.externals import joblibfrom sklearn import model_selection, tree, linear_model, svm# 我们将存储处理细节并稍后打印数据的数组Class_0 = []Class_1 = []prob = []results = []# 打开文件以输出我们的报告otimestr = time.strftime("%Y%m%d%H%M%S")f = open(timestr + " output report.txt", 'w')f.write(timestr + " output report\n")f.write("\n")count = len(os.listdir('.'))while (count > 0): # 加载数据集 for filename in os.listdir('.'): if filename.endswith('.csv') and filename.startswith("processes_"): url = filename dataset = pandas.read_csv(url) dataset.set_index('Name', inplace = True) clf = joblib.load(os.path.join( os.path.dirname(os.path.realpath(__file__)), 'classifier/classifier.pkl')) for index, row in dataset.iterrows(): res = clf.predict([row]) if res == 0: if index in Class_0: Class_0.append(index) elif index in Class_1: Class_1.append(index) else: print "Is ", index, " recognised?" designation = raw_input() if designation == "No": Class_0.append(index) else: Class_1.append(index) dataset['Type'] = 1 dataset.loc[dataset.index.str.contains('|'.join(Class_0)), 'Type'] = 0 print "\n" results.append(collections.OrderedDict.fromkeys(dataset.index[dataset['Type'] == 0])) print (results) X = dataset.drop(['Type'], axis=1).values Y = dataset['Type'].values clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True) clf.fit(X, Y) joblib.dump(clf, 'classifier/classifier.pkl') os.remove(filename) output = collections.Counter(Class_0)print "Class_0; \n"f.write ("Class_0; \n")for key, value in output.items(): f.write(str(key) + " ; " + str(value) + "\n") print(str(key) + " ; " + str(value))print "\n"f.write ("\n") output_1 = collections.Counter(Class_1)print "Class_1; \n"f.write ("Class_1; \n")for key, value in output_1.items(): f.write(str(key) + " ; " + str(value) + "\n") print(str(key) + " ; " + str(value))print "\n" f.close()
错误(IndexError: index 1 is out of bounds for size 1
)引用了预测行 res = clf.predict([row])
。据我所知,问题在于数据的“类别”或标签类型不足(我正在尝试二元分类器)?但在此之前,我在没有嵌套循环的情况下使用这种方法时没有遇到任何问题。
https://codeshare.io/Gkpb44 – 包含上述提到的.csv文件的.csv数据的代码共享链接。
回答:
所以我已经意识到问题所在了。
我创建了一个格式,其中加载分类器,然后使用warm_start重新拟合数据以更新分类器,试图模拟增量/在线学习。当我处理包含两种类别的数据时,这很有效。然而,如果数据只有正类别,那么当我重新拟合分类器时,它就会出错。
目前,我已经注释掉了以下内容;
clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True)clf.fit(X, Y)joblib.dump(clf, 'classifier/classifier.pkl')
这解决了问题。未来我可能会添加(又一个!)条件语句来判断是否应该重新拟合数据。
我本来想删除这个问题,但是由于在我的搜索过程中没有找到任何涵盖这一事实的内容,我认为我会留下这个问题并附上答案,以防有人遇到同样的问题。