如何在训练好的模型上对新句子进行情感分析?

我使用朴素贝叶斯训练了一个模型,准确率很高,但现在我想输入一个句子,然后查看它的情感。这是我的代码:

# 数据分析import pandas as pd# 数据预处理和特征工程from textblob import TextBlobimport refrom nltk.corpus import stopwordsfrom sklearn.feature_extraction.text import TfidfVectorizer# 模型选择和验证from sklearn.naive_bayes import MultinomialNBfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import confusion_matrix, classification_report, accuracy_scoreimport joblibimport warningsimport mlflowwarnings.filterwarnings("ignore")train_tweets = pd.read_csv('data/train.csv')tweets = train_tweets.tweet.valueslabels = train_tweets.label.valuesprocessed_features = []for sentence in range(0, len(tweets)):    # 移除所有特殊字符    processed_feature = re.sub(r'\W', ' ', str(tweets[sentence]))    # 移除所有单个字符    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)    # 从开头移除单个字符    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)    # 将多个空格替换为单个空格    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)    # 移除前缀'b'    processed_feature = re.sub(r'^b\s+', '', processed_feature)    # 转换为小写    processed_feature = processed_feature.lower()    processed_features.append(processed_feature)vectorizer = TfidfVectorizer(max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))processed_features = vectorizer.fit_transform(processed_features).toarray()X_train, X_test, y_train, y_test = train_test_split(processed_features, labels, test_size=0.2, random_state=0)text_classifier = MultinomialNB()text_classifier.fit(X_train, y_train)predictions = text_classifier.predict(X_test)print(confusion_matrix(y_test,predictions))print(classification_report(y_test,predictions))print(accuracy_score(y_test, predictions))joblib.dump(text_classifier, 'model.pkl')

如你所见,我保存了我的模型。现在,我想输入这样的句子:

new_sentence = "我今天非常开心"model.predict(new_sentence)

然后我想看到这样的输出:

sentence = "我今天非常开心"sentiment = 积极

我该怎么做呢?


回答:

首先,将预处理步骤放在一个函数中:

def preproc(tweets):    processed_features = []    for sentence in range(0, len(tweets)):        # 移除所有特殊字符        processed_feature = re.sub(r'\W', ' ', str(tweets[sentence]))        # 移除所有单个字符        processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)        # 从开头移除单个字符        processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)        # 将多个空格替换为单个空格        processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)        # 移除前缀'b'        processed_feature = re.sub(r'^b\s+', '', processed_feature)        # 转换为小写        processed_feature = processed_feature.lower()        processed_features.append(processed_feature)    return processed_featuresprocessed_features = preproc(tweets)vectorizer = TfidfVectorizer(max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))processed_features = vectorizer.fit_transform(processed_features).toarray()

然后使用它来预处理测试字符串,并使用transform将其输入到分类器中:

# 输入两个单句推文:test = preproc([["我讨厌这本书。"], ["我爱这部电影。"]])predictions = text_classifier.predict(vectorizer.transform(test).toarray())print(predictions) 

现在,根据数据集中你拥有的标签以及train_tweets.label.values的编码方式,你将得到不同的输出,你可以将其解析为字符串。例如,如果数据集中的标签被编码为1=积极,0=消极,你可能会得到[0,1]。

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注