我在使用监督学习进行主题检测。然而,我的矩阵尺寸非常大(202180 x 15000
),无法适应我想要使用的模型。大部分矩阵元素都是零。只有逻辑回归能够工作。有没有办法让我继续使用相同的矩阵,但使它们能够与我想要的模型一起工作?比如,我可以以不同的方式创建我的矩阵吗?
这是我的代码:
import numpy as npimport subprocessfrom sklearn.linear_model import SGDClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn import metricsdef run(command): output = subprocess.check_output(command, shell=True) return output
加载词汇表
f = open('/Users/win/Documents/wholedata/RightVo.txt','r') vocab_temp = f.read().split() f.close() col = len(vocab_temp) print("训练列大小:") print(col)
创建训练矩阵
row = run('cat '+'/Users/win/Documents/wholedata/X_tr.txt'+" | wc -l").split()[0]print("训练行大小:")print(row)matrix_tmp = np.zeros((int(row),col), dtype=np.int64)print("训练矩阵大小:")print(matrix_tmp.size)label_tmp = np.zeros((int(row)), dtype=np.int64)f = open('/Users/win/Documents/wholedata/X_tr.txt','r')count = 0for line in f: line_tmp = line.split() #print(line_tmp) for word in line_tmp[0:]: if word not in vocab_temp: continue matrix_tmp[count][vocab_temp.index(word)] = 1 count = count + 1f.close()print("训练矩阵是:\n ")print(matrix_tmp)print(label_tmp)print("训练标签大小:")print(len(label_tmp))f = open('/Users/win/Documents/wholedata/RightVo.txt','r')vocab_temp = f.read().split()f.close()col = len(vocab_temp)print("测试列大小:")print(col)
创建测试矩阵
row = run('cat '+'/Users/win/Documents/wholedata/X_te.txt'+" | wc -l").split()[0]print("测试行大小:")print(row)matrix_tmp_test = np.zeros((int(row),col), dtype=np.int64)print("测试矩阵大小:")print(matrix_tmp_test.size)label_tmp_test = np.zeros((int(row)), dtype=np.int64)f = open('/Users/win/Documents/wholedata/X_te.txt','r')count = 0for line in f: line_tmp = line.split() #print(line_tmp) for word in line_tmp[0:]: if word not in vocab_tmp: continue matrix_tmp_test[count][vocab_tmp.index(word)] = 1 count = count + 1f.close()print("测试矩阵是: \n")print(matrix_tmp_test)print(label_tmp_test)print("测试标签大小:")print(len(label_tmp_test))xtrain=[]with open("/Users/win/Documents/wholedata/Y_te.txt") as filer: for line in filer: xtrain.append(line.strip().split())xtrain= np.ravel(xtrain)label_tmp_test=xtrainytrain=[]with open("/Users/win/Documents/wholedata/Y_tr.txt") as filer: for line in filer: ytrain.append(line.strip().split())ytrain = np.ravel(ytrain)label_tmp=ytrain
加载监督模型
model = LogisticRegression()model = model.fit(matrix_tmp, label_tmp)#print(model)print("已进入1")y_train_pred = model.predict(matrix_tmp_test)print("已进入2")print(metrics.accuracy_score(label_tmp_test, y_train_pred))
回答:
你可以使用scipy
包中提供的一种特殊数据结构,称为稀疏矩阵:http://docs.scipy.org/doc/scipy/reference/sparse.html
根据定义:
稀疏矩阵只是一个包含大量零值的矩阵。相比之下,如果许多或大多数条目非零,则称该矩阵为密集矩阵。关于什么构成稀疏矩阵没有严格的规则,所以我们可以说,如果利用其稀疏性有某些好处,那么矩阵就是稀疏的。此外,还有多种稀疏矩阵格式,这些格式旨在利用不同的稀疏模式(稀疏矩阵中非零值的结构)和不同的方法来访问和操作矩阵条目。