有两个.csv文件包含推文及其分类:pos
、neg
和neutral
。class
表示分类,text
表示推文。
这是我的代码:
def prediction(): print("正在读取文件...") # 将从这个数据集学习。 train = file2SentencesArray('twitter-sanders-apple3') # 测试数据集。 test = file2SentencesArray('twitter-sanders-apple2') print("完成!") print("正在清理句子...") # cleanSenteces 将移除html、停用词和一些字符。 cleanTrainSentences = cleanSentences(train["text"]) cleanTestSentences = cleanSentences(test["text"]) print("完成!...") print("正在拟合句子...") vectorizer = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None, stop_words=None, max_features=5000) trainDataFeatures = vectorizer.fit_transform(cleanTrainSentences) np.asarray(trainDataFeatures) testDataFeatures = vectorizer.transform(cleanTestSentences) np.asarray(testDataFeatures) # 在这里出现错误。 randomized_lasso = RandomizedLasso() randomized_lasso.fit_transform(trainDataFeatures, testDataFeatures) trainDataFeatures = randomized_lasso.transform(trainDataFeatures) # 还有这里。 #pca = decomposition.PCA(n_components=2) #pca.fit_transform(trainDataFeatures) #trainDataFeatures = pca.transform(trainDataFeatures) print("完成!") print("正在预测...") forest = RandomForestClassifier(n_estimators=100) forest = forest.fit(trainDataFeatures, train["class"]) result = forest.predict(testDataFeatures) print("完成...") return result
随机Lasso和PCA都抛出了异常:
PCA – PCA does not support sparse input.
随机Lasso – bad input shape
我的trainDataFeatures
看起来像这样:
(0, 573) 1(0, 1411) 2(0, 2748) 1(0, 1073) 1(1, 126) 1(2, 1203) 1
回答:
PCA和随机Lasso的输入格式不正确。请替换以下两行并重试。
np.asarray(trainDataFeatures)np.asarray(testDataFeatures)# 将上述两行替换为以下内容trainDataFeatures = trainDataFeatures.toarray()testDataFeatures = testDataFeatures.toarray()