使用Python进行文本分类

大家好,我是Python编程语言的新手,根据各种参考资料,我已经使用逻辑回归构建了文本分类模型,以下是代码。

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfVectorizerimport pandas as pdimport numpy as npimport stringimport nltkfrom collections import Counterfrom nltk.corpus import stopwordsfrom nltk.stem import PorterStemmerfrom nltk.tokenize import sent_tokenize, word_tokenizefrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import confusion_matrix, accuracy_score, classification_reportTrain = pd.read_excel("/Desktop/ML Based Text classification/test.xlsx")real = pd.read_excel("/Desktop/ML Based Text classification/test.xlsx", sheet_name = 'Test')Train_data = Train['description']Test_data = real['description']stop = stopwords.words('english')porter = PorterStemmer()def remove_stopwords(text):    text = [word.lower() for word in text.split() if word.lower() not in stop]    return " ".join(text)def stemmer(stem_text):    stem_text = [porter.stem(word) for word in stem_text.split()]    return " ".join(stem_text)def clean_data(data):     text_clean =  (data.str.replace('[^\w\s]','')                  .str.replace('\d+', '')                  .apply(remove_stopwords)                  .apply(stemmer)                  .astype(str))     return (text_clean)Train_data = clean_data(Train_data)counter = Counter(Train['tags'].tolist())top_10_varieties = {i[0]: idx for idx, i in enumerate(counter.most_common(50))}Train['Mapping'] = Train['tags'].map(top_10_varieties)#top_10_varieties = {'Outlook Related Issue': 0, 'Password Reset': 1, 'VPN Issue': 2}tfidf_converter = TfidfVectorizer()model_log = LogisticRegression()X = Train_dataY = Train['Mapping']X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.95, random_state = 0)svc = Pipeline([('tfidf', TfidfVectorizer()),               ('clf',LogisticRegression()),               ])svc.fit(X_train, y_train)ytest = np.array(y_test)y_pred = svc.predict(X_test)Test_data = clean_data(Test_data)y_pred = svc.predict(Test_data)

现在运行这段代码没有错误,当我打印”y_pred”时,得到的输出是

array([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 2, 1, 2, 0, 2, 2, 2, 1, 0, 1,       1, 2, 1, 2, 0, 0, 2, 2, 1, 0, 0, 2, 0, 0, 0], dtype=int64)

我不确定如何将这些数字转换为映射字符串,并将其标记在我的原始数据上,我想要这样的输出:

enter image description here


回答:

请尝试以下代码:

reverse_top_10_varieties = {idx:i[0] for idx, i in enumerate(counter.most_common(50))}[reverse_top_10_varieties[id] for id in y_pred]

看看这是否能解决你的问题

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注