我根据回应更新了问题。
我有一个名为”str_tuple”的字符串列表。我想计算列表中第一个元素与其他元素之间的相似度。我运行了以下六行代码片段。
让我完全困惑的是,每次运行代码时,结果似乎完全是随机的。然而,我在我的六行代码中看不到任何引入随机性的地方。
更新:
有人指出,TruncatedSVD()有一个”random_state”参数。指定”random_state”将得到固定的结果(这是完全正确的)。然而,如果你改变”random_state”,结果也会改变。但对于其他字符串(例如str2),无论你如何改变”random_state”,结果都是相同的。实际上,这些字符串来自HOME_DEPOT Kaggle竞赛。我有一个包含数千个此类字符串的pd.Series,其中大多数字符串在设置任何”random_state”时都像str2一样给出非随机结果。出于某些未知的原因,str1是每次改变”random_state”时都会给出随机结果的例子之一。我开始认为可能是str1的一些内在特性造成了这种差异。
from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.decomposition import TruncatedSVDfrom sklearn.preprocessing import Normalizer# str1产生随机结果str1 = [u'l bracket', u'simpson strong tie 12 gaug angl', u'angl make joint stronger provid consist straight corner simpson strong tie offer wide varieti angl various size thick handl light duti job project structur connect need bent skew match project outdoor project moistur present use zmax zinc coat connector provid extra resist corros look "z" end model number .versatil connector various 90 connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: 3 in. xbi 3 in. xbi 1 0.5 in. made 12 gaug steelgalvan extra corros resistanceinstal 10 d common nail 9 xbi 1 0.5 in. strong drive sd screw', u'simpson strong-tie', u'', u'versatile connector for various 90\xe2\xb0 connections and home repair projects stronger than angled nailing or screw fastening alone help ensure joints are consistently straight and strong dimensions: 3 in. x 3 in. x 1-1/2 in. made from 12-gauge steel galvanized for extra corrosion resistance install with 10d common nails or #9 x 1-1/2 in. strong-drive sd screws']# str2产生非随机结果 str2 = [u'angl bracket', u'simpson strong tie 12 gaug angl', u'angl make joint stronger provid consist straight corner simpson strong tie offer wide varieti angl various size thick handl light duti job project structur connect need bent skew match project outdoor project moistur present use zmax zinc coat connector provid extra resist corros look "z" end model number .versatil connector various 90 connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: 3 in. xbi 3 in. xbi 1 0.5 in. made 12 gaug steelgalvan extra corros resistanceinstal 10 d common nail 9 xbi 1 0.5 in. strong drive sd screw', u'simpson strong-tie', u'', u'versatile connector for various 90\xe2\xb0 connections and home repair projects stronger than angled nailing or screw fastening alone help ensure joints are consistently straight and strong dimensions: 3 in. x 3 in. x 1-1/2 in. made from 12-gauge steel galvanized for extra corrosion resistance install with 10d common nails or #9 x 1-1/2 in. strong-drive sd screws'] vectorizer = CountVectorizer(token_pattern=r"\d+\.\d+|\d+\/\d+|\b\w+\b")# 用str2替换str1,无论random_state如何设置,都会得到非随机结果cmat = vectorizer.fit_transform(str1).astype(float) # 稀疏矩阵cmat = TruncatedSVD(2).fit_transform(cmat) # 密集numpy数组cmat = Normalizer().fit_transform(cmat) # 密集numpy数组sim = np.dot(cmat, cmat.T)sim[0,1:].tolist()
回答:
默认情况下,Truncated SVD
采用随机算法。因此,你必须指定RandomState
值,并将其设置为numpy.random.seed
值。
cmat = TruncatedSVD(n_components=2, random_state=42).fit_transform(cmat)
class sklearn.decomposition.TruncatedSVD(n_components=2, algorithm=’randomized’, n_iter=5, random_state=None, tol=0.0)
为了产生非随机输出,列表的起始元素必须出现一次以上。也就是说,如果str1
的起始元素是angl、versatile或simpson,那么它会给出非随机结果。因为str2
的列表开始处至少有一个重复的angl,所以它不会返回随机输出。
因此,随机性是衡量给定列表中元素出现的差异性。并且,在这些情况下,指定RandomState
将有助于生成独特的输出。
[感谢@wen指出这一点]