请告诉我哪里出了问题以及如何纠正。
data = open(r"C:\Users\HS\Desktop\WORK\R\R DATA\g textonly2.txt").read()labels, texts = [], []#print(data)for i, line in enumerate(data.split("\n")): content = line.split() #print(content) if len(content) is not 0: labels.append(content[0]) texts.append(content[1:])# create a dataframe using texts and lablestrainDF = pandas.DataFrame()trainDF['text'] = textstrainDF['label'] = labels# split the dataset into training and validation datasetstrain_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])# label encode the target variableencoder = preprocessing.LabelEncoder()train_y = encoder.fit_transform(train_y)valid_y = encoder.fit_transform(valid_y)# create a count vectorizer objectcount_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')count_vect.fit(trainDF['text'])
数据文件包含如下数据:
0 #\xdaltimahora Es tracta d'un aparell de Germanwings amb 152 passatgers a bord0 Route map now being shared by http:0 Pray for #4U9525 http:0 Airbus A320 #4U9525 crash: \nFlight tracking data here: \nhttp
错误:
Traceback:"C:\Program Files\Python36\python.exe" "C:/Users/HS/PycharmProjects/R/C/Text classification1.py"Using TensorFlow backend.Traceback (most recent call last): File "C:/Users/HS/PycharmProjects/R/C/Text classification1.py", line 38, in <module> count_vect.fit(trainDF['text']) File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 836, in fit self.fit_transform(raw_documents) File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 869, in fit_transform self.fixed_vocabulary_) File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 792, in _count_vocab for feature in analyze(doc): File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 266, in <lambda> tokenize(preprocess(self.decode(doc))), stop_words) File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 232, in <lambda> return lambda x: strip_accents(x.lower())AttributeError: 'list' object has no attribute 'lower'Process finished with exit code 1
回答:
根据文档:
fit(raw_documents, y=None)[source] 学习原始文档中所有标记的词汇字典。
参数:raw_documents : iterable
一个可迭代对象,产生字符串、Unicode或文件对象。
返回:self :
您得到错误AttributeError: 'list' object has no attribute 'lower'
是因为您提供了一个列表对象的可迭代对象(在本例中是一个pd.Series
),而不是一个字符串的可迭代对象。
您可以通过使用texts.append(' '.join(content[1:]))
来代替texts.append(content[1:])
来修复这个问题:
for i, line in enumerate(data.split("\n")): content = line.split() #print(content) if len(content) is not 0: labels.append(content[0]) #texts.append(content[1:]) texts.append(' '.join(content[1:]))