在进行机器学习任务时,我正在寻找一种方法来合并两个特征矩阵,这些矩阵具有不同的维度,以便我可以将它们同时输入到一个估计器中。我无法使用scipy的合并方法,因为这些方法要求形状兼容。我可以使用numpy的合并方法,但当我实际尝试为交叉验证拆分数组时,出现了问题。错误信息如下:
Traceback (most recent call last): File "C:\Users\Ano\workspace\final_submission\src\linearSVM.py", line 50, in <module> result = ridge(train_text,train_labels,test_set,train_state,test_state) File "C:\Users\Ano\workspace\final_submission\src\Algorithms.py", line 90, in ridge x_train, x_test, y_train, y_test = cross_validation.train_test_split(train, labels, test_size = 0.2, random_state = 42) File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1394, in train_test_split arrays = check_arrays(*arrays, **options) File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 211, in check_arrays % (size, n_samples))ValueError: Found array with dim 77946. Expected 2
我在一篇StackOverflow的讨论中找到了这个错误发生的原因:使用SciPy/Numpy在Python中连接稀疏矩阵。显然,np.vstack/hstack创建了两个矩阵对象,导致了我的错误。
我处理的形状如下:
(77946, 63677)(77946, 55)
基本上,我正在寻找一种方法,将第二个矩阵中每个样本的额外55个特征附加到第一个矩阵的特征上。
我还尝试创建一个具有适当维度的numpy数组,并简单地用特征矩阵填充它,但即使是创建这个矩阵也给我带来了内存错误。我尝试将其转换为稀疏矩阵,但这也不起作用。也许我在这里做错了什么?
new_matrix = sparse.csr_matrix(np.zeros((77946,63727)))new_matrix[:,0:63676] = big_feature_matrixnew_matrix[:,63677:63727] = small_feature_matrix
更新所以我尝试了Jaime的解决方案,但它给我带来了一个错误:
涉及的代码
def feature_extraction(train,test,train_small,test_small): vectorizer = TfidfVectorizer(min_df = 3,strip_accents = "unicode",ngram_range = (1,2)) cv = CountVectorizer(strip_accents = "unicode",analyzer = "word",token_pattern = r'\w{1,}') print("fitting Vectorizer") vectorizer.fit(train) train_small = cv.fit_transform(train_state) test_small = cv.transform(test_state) print("transforming text") train = vectorizer.transform(train) test = vectorizer.transform(test) new_train = sparse.hstack((train, train_small), format='csr') new_test = sparse.hstack((test, test_small), format='csr') return new_train,new_test
完整的错误跟踪
Traceback (most recent call last): File "C:\Users\Ano\workspace\final_submission\src\linearSVM.py", line 50, in <module> result = ridge(train_text,train_labels,test_set,train_small,test_small) File "C:\Users\Ano\workspace\final_submission\src\Algorithms.py", line 89, in ridge train,test = feature_extraction(train,test,train_small,test_small) File "C:\Users\Ano\workspace\final_submission\src\Preprocessing.py", line 109, in feature_extraction format='csr') File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 423, in hstack return bmat([blocks], format=format, dtype=dtype) File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 523, in bmat raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)ValueError: blocks[0,:] has incompatible row dimensions
训练集的维度与之前相同。测试集的样本较少(42157)。
更新
Jaime的解决方案实际上是有效的,我只是在加载文件时搞砸了,感谢你们的所有帮助!
回答:
你可以使用scipy.sparse.hstack
:
new_matrix = scipy.sparse.hstack((big_feature_matrix, small_feature_matrix), format='csr')