使用Python进行文本分类

大家好，我是Python编程语言的新手，根据各种参考资料，我已经使用逻辑回归构建了文本分类模型，以下是代码。

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfVectorizerimport pandas as pdimport numpy as npimport stringimport nltkfrom collections import Counterfrom nltk.corpus import stopwordsfrom nltk.stem import PorterStemmerfrom nltk.tokenize import sent_tokenize, word_tokenizefrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import confusion_matrix, accuracy_score, classification_reportTrain = pd.read_excel("/Desktop/ML Based Text classification/test.xlsx")real = pd.read_excel("/Desktop/ML Based Text classification/test.xlsx", sheet_name = 'Test')Train_data = Train['description']Test_data = real['description']stop = stopwords.words('english')porter = PorterStemmer()def remove_stopwords(text):    text = [word.lower() for word in text.split() if word.lower() not in stop]    return " ".join(text)def stemmer(stem_text):    stem_text = [porter.stem(word) for word in stem_text.split()]    return " ".join(stem_text)def clean_data(data):     text_clean =  (data.str.replace('[^\w\s]','')                  .str.replace('\d+', '')                  .apply(remove_stopwords)                  .apply(stemmer)                  .astype(str))     return (text_clean)Train_data = clean_data(Train_data)counter = Counter(Train['tags'].tolist())top_10_varieties = {i[0]: idx for idx, i in enumerate(counter.most_common(50))}Train['Mapping'] = Train['tags'].map(top_10_varieties)#top_10_varieties = {'Outlook Related Issue': 0, 'Password Reset': 1, 'VPN Issue': 2}tfidf_converter = TfidfVectorizer()model_log = LogisticRegression()X = Train_dataY = Train['Mapping']X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.95, random_state = 0)svc = Pipeline([('tfidf', TfidfVectorizer()),               ('clf',LogisticRegression()),               ])svc.fit(X_train, y_train)ytest = np.array(y_test)y_pred = svc.predict(X_test)Test_data = clean_data(Test_data)y_pred = svc.predict(Test_data)

现在运行这段代码没有错误，当我打印”y_pred”时，得到的输出是

array([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 2, 1, 2, 0, 2, 2, 2, 1, 0, 1,       1, 2, 1, 2, 0, 0, 2, 2, 1, 0, 0, 2, 0, 0, 0], dtype=int64)

我不确定如何将这些数字转换为映射字符串，并将其标记在我的原始数据上，我想要这样的输出：

回答：

请尝试以下代码：

reverse_top_10_varieties = {idx:i[0] for idx, i in enumerate(counter.most_common(50))}[reverse_top_10_varieties[id] for id in y_pred]

看看这是否能解决你的问题

学技术

使用Python进行文本分类

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复