我使用朴素贝叶斯训练了一个模型,准确率很高,但现在我想输入一个句子,然后查看它的情感。这是我的代码:
# 数据分析import pandas as pd# 数据预处理和特征工程from textblob import TextBlobimport refrom nltk.corpus import stopwordsfrom sklearn.feature_extraction.text import TfidfVectorizer# 模型选择和验证from sklearn.naive_bayes import MultinomialNBfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import confusion_matrix, classification_report, accuracy_scoreimport joblibimport warningsimport mlflowwarnings.filterwarnings("ignore")train_tweets = pd.read_csv('data/train.csv')tweets = train_tweets.tweet.valueslabels = train_tweets.label.valuesprocessed_features = []for sentence in range(0, len(tweets)): # 移除所有特殊字符 processed_feature = re.sub(r'\W', ' ', str(tweets[sentence])) # 移除所有单个字符 processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature) # 从开头移除单个字符 processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) # 将多个空格替换为单个空格 processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I) # 移除前缀'b' processed_feature = re.sub(r'^b\s+', '', processed_feature) # 转换为小写 processed_feature = processed_feature.lower() processed_features.append(processed_feature)vectorizer = TfidfVectorizer(max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))processed_features = vectorizer.fit_transform(processed_features).toarray()X_train, X_test, y_train, y_test = train_test_split(processed_features, labels, test_size=0.2, random_state=0)text_classifier = MultinomialNB()text_classifier.fit(X_train, y_train)predictions = text_classifier.predict(X_test)print(confusion_matrix(y_test,predictions))print(classification_report(y_test,predictions))print(accuracy_score(y_test, predictions))joblib.dump(text_classifier, 'model.pkl')
如你所见,我保存了我的模型。现在,我想输入这样的句子:
new_sentence = "我今天非常开心"model.predict(new_sentence)
然后我想看到这样的输出:
sentence = "我今天非常开心"sentiment = 积极
我该怎么做呢?
回答:
首先,将预处理步骤放在一个函数中:
def preproc(tweets): processed_features = [] for sentence in range(0, len(tweets)): # 移除所有特殊字符 processed_feature = re.sub(r'\W', ' ', str(tweets[sentence])) # 移除所有单个字符 processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature) # 从开头移除单个字符 processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) # 将多个空格替换为单个空格 processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I) # 移除前缀'b' processed_feature = re.sub(r'^b\s+', '', processed_feature) # 转换为小写 processed_feature = processed_feature.lower() processed_features.append(processed_feature) return processed_featuresprocessed_features = preproc(tweets)vectorizer = TfidfVectorizer(max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))processed_features = vectorizer.fit_transform(processed_features).toarray()
然后使用它来预处理测试字符串,并使用transform
将其输入到分类器中:
# 输入两个单句推文:test = preproc([["我讨厌这本书。"], ["我爱这部电影。"]])predictions = text_classifier.predict(vectorizer.transform(test).toarray())print(predictions)
现在,根据数据集中你拥有的标签以及train_tweets.label.values
的编码方式,你将得到不同的输出,你可以将其解析为字符串。例如,如果数据集中的标签被编码为1=积极,0=消极,你可能会得到[0,1]。