我有一个包含评分摘要(文本)的CSV文件(corpus.csv),文件格式如下:
Institute, Score, Abstract----------------------------------------------------------------------UoM, 3.0, Hello, this is abstract oneUoM, 3.2, Hello, this is abstract two and yet counting.UoE, 3.1, Hello, yet another abstract but this is a unique one.UoE, 2.2, Hello, please no more abstract.
我正在尝试用Python创建一个KNN分类程序,该程序能够接受用户输入的摘要,例如,“这是一个新的独特摘要”,然后将此用户输入的摘要与语料库(CSV)进行匹配,并返回预测摘要的分数/等级。我该如何实现这一点?
我有以下代码:
from sklearn.feature_extraction.text import TfidfVectorizerfrom nltk.corpus import stopwordsimport numpy as npimport pandas as pdfrom csv import reader,writerimport operator as opimport string#Read data from corpusr = reader(open('corpus.csv','r'))abstract_list = []score_list = []institute_list = []row_count = 0for row in list(r)[1:]: institute,score,abstract = row if len(abstract.split()) > 0: institute_list.append(institute) score = float(score) score_list.append(score) abstract = abstract.translate(string.punctuation).lower() abstract_list.append(abstract) row_count = row_count + 1print("Total processed data: ", row_count)#Vectorize (TF-IDF, ngrams 1-4, no stop words) using sklearn -->vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,4), min_df = 0, stop_words = 'english', sublinear_tf=True)response = vectorizer.fit_transform(abstract_list)feature_names = vectorizer.get_feature_names()
在上述代码中,我如何使用TF-IDF计算的特征进行如上所述的KNN分类?(可能使用sklearn.neighbors.KNeighborsClassifier框架)
附注:此应用案例的类别是摘要的相应分数/等级。
我有视觉深度学习的背景,但缺乏文本分类方面的知识,尤其是使用KNN。任何帮助将不胜感激。提前谢谢您。
回答:
KNN是一种分类算法——这意味着您必须有一个类属性。KNN可以使用TFIDF的输出作为输入矩阵 – TrainX,但您仍然需要TrainY – 您数据中每行的类。然而,您可以使用KNN回归器。将您的分数用作类变量:
from sklearn.feature_extraction.text import TfidfVectorizerfrom nltk.corpus import stopwordsimport numpy as npimport pandas as pdfrom csv import reader,writerimport operator as opimport stringfrom sklearn import neighbors#Read data from corpusr = reader(open('corpus.csv','r'))abstract_list = []score_list = []institute_list = []row_count = 0for row in list(r)[1:]: institute,score,abstract = row[0], row[1], row[2] if len(abstract.split()) > 0: institute_list.append(institute) score = float(score) score_list.append(score) abstract = abstract.translate(string.punctuation).lower() abstract_list.append(abstract) row_count = row_count + 1print("Total processed data: ", row_count)#Vectorize (TF-IDF, ngrams 1-4, no stop words) using sklearn -->vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,4), min_df = 0, stop_words = 'english', sublinear_tf=True)response = vectorizer.fit_transform(abstract_list)classes = score_listfeature_names = vectorizer.get_feature_names()clf = neighbors.KNeighborsRegressor(n_neighbors=1)clf.fit(response, classes)clf.predict(response)
“predict”将预测每个实例的分数。