预测假新闻与否在新数据上表现不佳

我有一个数据集，看起来是这样的：

                     content                          label0   Sainte-Nathalène – Si les scientifiques sonnen...   11   Le musicien américano-néerlandais Eddie Van Ha...   02   Angela Merkel écoute Emmanuel Macron, lors d’u...   03   Analyse. Telle qu’elle a été présentée, dimanc...   04   Sur l’esplanade du Trocadéro, à Paris, 24 août...   0

数据中有1000篇假新闻文章和1000篇真实新闻文章。

我这样训练模型：

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(df['content'], df['label'], test_size=0.20)# Random forestfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.pipeline import Pipelinefrom sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer# Vectorizing and applying TF-IDFpipeline = Pipeline([    ('vectorizer', CountVectorizer()),    ('tfidf', TfidfTransformer()),    ('model', RandomForestClassifier())])# Fitting the modelmodel = pipeline.fit(X_train, y_train)# Accuracyfrom sklearn import metricsfrom sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrixprediction = model.predict(X_test)print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))

accuracy: 95.95%

rf_cm = metrics.confusion_matrix(y_test, prediction)print(rf_cm)

[[193 18] [ 0 233]]

因此，模型训练得很好。

我使用model.pickle来在Flask中使用模型。

当我用新文章使用这个模型时，它总是预测为假新闻。即使文章是真实的。

在Flask应用中的model.py是这样的：

在routes.py中我这样做了：

# Receiving the input url from the user and using Web Scraping to extract the news content@app.route('/predict', methods=['GET', 'POST'])def predict():    url = request.get_data(as_text=True)[5:]    url = urllib.parse.unquote(url)    article = Article(str(url))    article.download()    article.parse()    article.nlp()    news = article.summary    # Passing the news article to the model and returing whether it is Fake or Real    pred = model.predict([news])    dic = {1:'Fake',0:'Real'}    return render_template('home.html', prediction_text='The news is "{}"'.format(dic[pred[0]]))

可能的原因是什么？我怎样才能用训练过的模型在新数据上获得更好的结果？

回答：

检测假新闻是困难的。这需要对世界有大量的了解，而不仅仅是关于某些词出现的概率。几年前，文章“美国总统建议用核弹攻击飓风”显然会被很多人标记为“假新闻”。但今天呢？不那么确定了…

你的模型似乎对你的数据集很适应。但它真的代表了你的问题吗？你的模型确实学到了一些东西，但它学到了什么呢？可能数据集中某些短语表示“真实”新闻，但在网站文章中没有？可能是反过来的吗？

此外，你检查过抓取后的数据是否正确预处理了吗？数据中是否还有html标签或类似的残留物？这也可能对分类器产生影响。

但总的来说，我会对一个仅用2000个样本就能学会检测假新闻的模型感到非常惊讶。即使对于人类专家来说，事实核查也是一项艰巨的任务！

学技术

预测假新闻与否在新数据上表现不佳

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复