我遇到了以下问题。目前,我正在构建一个分类系统,该系统将使用文本和一些额外的补充信息作为输入。我将补充信息存储在pandas DataFrame中。我使用CountVectorizer转换文本并得到一个稀疏矩阵。现在,为了训练分类器,我需要将这两个输入合并到同一个数据框中。问题是,当我将数据框与CountVectorizer的输出合并时,我得到的是一个密集矩阵,这意味着我很快就会耗尽内存。有没有办法避免这种情况,并正确地将这两个输入合并到一个数据框中,而不生成密集矩阵?
示例代码:
import pandas as pdfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn import preprocessingfrom sklearn.model_selection import train_test_split#考虑最常用的词数n_features = 5000df = pd.DataFrame.from_csv('DataWithSentimentAndTopics.csv',index_col=None)#文本向量化tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2, max_features=n_features, stop_words='english')#获取TF矩阵tf = tf_vectorizer.fit_transform(df['reviewText'])df = pd.concat([df.drop(['reviewText', 'Summary'], axis=1), pd.DataFrame(tf.A)], axis=1)#将目标变量分成4个区间df['helpful'] = pd.cut(df['helpful'],[-1,0,10,50,100000], labels = [0,1,2,3])#创建X和Y变量train = df.drop(['helpful'], axis=1)Y = df['helpful']#拆分为训练集和测试集X_train, X_test, y_train, y_test = train_test_split(train, Y, test_size=0.1)#创建GBRgbc = GradientBoostingClassifier(max_depth = 7, n_estimators=1500, min_samples_leaf=10)print('Training GBC')print(datetime.datetime.now())#拟合分类器,寻找最佳gbc.fit(X_train, y_train)
如你所见,我设置CountVectorizer使用5000个词。我的原始数据框只有50000行,但已经得到了一个50000×5000的矩阵,这意味着有25亿个单元。这已经需要大量的内存了。
回答:
正如@已提到,你不需要将你的独热编码数据放入DataFrame中。
但如果你想这样做,你可以将Pandas.SparseSeries作为列添加:
#文本向量化tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2, max_features=n_features, stop_words='english')#获取TF矩阵tf = tf_vectorizer.fit_transform(df.pop('reviewText'))#添加“特征”列为SparseSeriesfor i, col in enumerate(tf_vectorizer.get_feature_names()): df[col] = pd.SparseSeries(tf[:, i].toarray().ravel(), fill_value=0)
结果:
In [107]: df.head(3)Out[107]: asin price reviewerID LenReview Summary LenSummary overall helpful reviewSentiment 0 \0 151972036 8.48 A14NU55NQZXML2 199 really a difficult read 23 3 2 -0.7203 0.0026321 151972036 8.48 A1CSBLAPMYV8Y0 77 wha 3 4 0 -0.1260 0.0055562 151972036 8.48 A1DDECXCGHDYZK 114 wordy and drags on 18 1 4 0.5707 0.004545 ... think thought trailers trying wanted words worth wouldn writing young0 ... 0 0 0 0 1 0 0 0 0 01 ... 0 0 0 1 0 0 0 0 0 02 ... 0 0 0 0 1 0 1 0 0 0[3 rows x 78 columns]
注意内存使用情况:
In [108]: df.memory_usage()Out[108]:Index 80asin 112price 112reviewerID 112LenReview 112Summary 112LenSummary 112overall 112helpful 112reviewSentiment 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 11210 11211 11212 11213 11214 112 ...parts 16 # 内存使用:# 非零值乘以8(np.int64)peter 16picked 16point 16quick 16rating 16reader 16reading 24really 24reviews 16stars 16start 16story 32tedious 16things 16think 16thought 16trailers 16trying 16wanted 24words 16worth 16wouldn 16writing 24young 16dtype: int64