为什么我从看似非随机的Python sklearn代码中得到随机结果?

我根据回应更新了问题。

我有一个名为”str_tuple”的字符串列表。我想计算列表中第一个元素与其他元素之间的相似度。我运行了以下六行代码片段。

让我完全困惑的是,每次运行代码时,结果似乎完全是随机的。然而,我在我的六行代码中看不到任何引入随机性的地方。

更新:

有人指出,TruncatedSVD()有一个”random_state”参数。指定”random_state”将得到固定的结果(这是完全正确的)。然而,如果你改变”random_state”,结果也会改变。但对于其他字符串(例如str2),无论你如何改变”random_state”,结果都是相同的。实际上,这些字符串来自HOME_DEPOT Kaggle竞赛。我有一个包含数千个此类字符串的pd.Series,其中大多数字符串在设置任何”random_state”时都像str2一样给出非随机结果。出于某些未知的原因,str1是每次改变”random_state”时都会给出随机结果的例子之一。我开始认为可能是str1的一些内在特性造成了这种差异。

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.decomposition import TruncatedSVDfrom sklearn.preprocessing import Normalizer# str1产生随机结果str1 = [u'l bracket', u'simpson strong tie 12 gaug angl', u'angl make joint stronger provid consist straight corner simpson strong tie offer wide varieti angl various size thick handl light duti job project structur connect need bent skew match project outdoor project moistur present use zmax zinc coat connector provid extra resist corros look "z" end model number .versatil connector various 90 connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: 3 in. xbi 3 in. xbi 1 0.5 in. made 12 gaug steelgalvan extra corros resistanceinstal 10 d common nail 9 xbi 1 0.5 in. strong drive sd screw', u'simpson strong-tie', u'', u'versatile connector for various 90\xe2\xb0 connections and home repair projects stronger than angled nailing or screw fastening alone help ensure joints are consistently straight and strong dimensions: 3 in. x 3 in. x 1-1/2 in. made from 12-gauge steel galvanized for extra corrosion resistance install with 10d common nails or #9 x 1-1/2 in. strong-drive sd screws']# str2产生非随机结果     str2 = [u'angl bracket', u'simpson strong tie 12 gaug angl', u'angl make joint stronger provid consist straight corner simpson strong tie offer wide varieti angl various size thick handl light duti job project structur connect need bent skew match project outdoor project moistur present use zmax zinc coat connector provid extra resist corros look "z" end model number .versatil connector various 90 connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: 3 in. xbi 3 in. xbi 1 0.5 in. made 12 gaug steelgalvan extra corros resistanceinstal 10 d common nail 9 xbi 1 0.5 in. strong drive sd screw', u'simpson strong-tie', u'', u'versatile connector for various 90\xe2\xb0 connections and home repair projects stronger than angled nailing or screw fastening alone help ensure joints are consistently straight and strong dimensions: 3 in. x 3 in. x 1-1/2 in. made from 12-gauge steel galvanized for extra corrosion resistance install with 10d common nails or #9 x 1-1/2 in. strong-drive sd screws']   vectorizer = CountVectorizer(token_pattern=r"\d+\.\d+|\d+\/\d+|\b\w+\b")# 用str2替换str1,无论random_state如何设置,都会得到非随机结果cmat = vectorizer.fit_transform(str1).astype(float)    # 稀疏矩阵cmat = TruncatedSVD(2).fit_transform(cmat)    # 密集numpy数组cmat = Normalizer().fit_transform(cmat)    # 密集numpy数组sim = np.dot(cmat, cmat.T)sim[0,1:].tolist()

回答:

默认情况下,Truncated SVD采用随机算法。因此,你必须指定RandomState值,并将其设置为numpy.random.seed值。

cmat = TruncatedSVD(n_components=2, random_state=42).fit_transform(cmat)

Docs

class sklearn.decomposition.TruncatedSVD(n_components=2, algorithm=’randomized’, n_iter=5, random_state=None, tol=0.0)


为了产生非随机输出,列表的起始元素必须出现一次以上。也就是说,如果str1的起始元素是anglversatilesimpson,那么它会给出非随机结果。因为str2的列表开始处至少有一个重复的angl,所以它不会返回随机输出。

因此,随机性是衡量给定列表中元素出现的差异性。并且,在这些情况下,指定RandomState将有助于生成独特的输出。
[感谢@wen指出这一点]

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注