大家好,我是Python编程语言的新手,根据各种参考资料,我已经使用逻辑回归构建了文本分类模型,以下是代码。
from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfVectorizerimport pandas as pdimport numpy as npimport stringimport nltkfrom collections import Counterfrom nltk.corpus import stopwordsfrom nltk.stem import PorterStemmerfrom nltk.tokenize import sent_tokenize, word_tokenizefrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import confusion_matrix, accuracy_score, classification_reportTrain = pd.read_excel("/Desktop/ML Based Text classification/test.xlsx")real = pd.read_excel("/Desktop/ML Based Text classification/test.xlsx", sheet_name = 'Test')Train_data = Train['description']Test_data = real['description']stop = stopwords.words('english')porter = PorterStemmer()def remove_stopwords(text): text = [word.lower() for word in text.split() if word.lower() not in stop] return " ".join(text)def stemmer(stem_text): stem_text = [porter.stem(word) for word in stem_text.split()] return " ".join(stem_text)def clean_data(data): text_clean = (data.str.replace('[^\w\s]','') .str.replace('\d+', '') .apply(remove_stopwords) .apply(stemmer) .astype(str)) return (text_clean)Train_data = clean_data(Train_data)counter = Counter(Train['tags'].tolist())top_10_varieties = {i[0]: idx for idx, i in enumerate(counter.most_common(50))}Train['Mapping'] = Train['tags'].map(top_10_varieties)#top_10_varieties = {'Outlook Related Issue': 0, 'Password Reset': 1, 'VPN Issue': 2}tfidf_converter = TfidfVectorizer()model_log = LogisticRegression()X = Train_dataY = Train['Mapping']X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.95, random_state = 0)svc = Pipeline([('tfidf', TfidfVectorizer()), ('clf',LogisticRegression()), ])svc.fit(X_train, y_train)ytest = np.array(y_test)y_pred = svc.predict(X_test)Test_data = clean_data(Test_data)y_pred = svc.predict(Test_data)
现在运行这段代码没有错误,当我打印”y_pred”时,得到的输出是
array([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 2, 1, 2, 0, 2, 2, 2, 1, 0, 1, 1, 2, 1, 2, 0, 0, 2, 2, 1, 0, 0, 2, 0, 0, 0], dtype=int64)
我不确定如何将这些数字转换为映射字符串,并将其标记在我的原始数据上,我想要这样的输出:
回答:
请尝试以下代码:
reverse_top_10_varieties = {idx:i[0] for idx, i in enumerate(counter.most_common(50))}[reverse_top_10_varieties[id] for id in y_pred]
看看这是否能解决你的问题