合并多个Pandas数据框时出现内存错误

我们正在尝试加载IDS-2018数据集，该数据集由10个CSV文件组成，总大小为6.4 GB。当我们在32GB RAM的服务器上尝试合并所有CSV文件时，程序崩溃了（进程被终止）。

我们甚至尝试通过以下方法优化pandas数据框的存储空间，

def reduce_mem_usage(df):    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']    start_mem = df.memory_usage().sum() / 1024**2    for col in df.columns:        col_type = df[col].dtypes        if col_type in numerics:            c_min = df[col].min()            c_max = df[col].max()            if str(col_type)[:3] == 'int':                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:                    df[col] = df[col].astype(np.int8)                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:                    df[col] = df[col].astype(np.int16)                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:                    df[col] = df[col].astype(np.int32)                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:                    df[col] = df[col].astype(np.int64)            else:                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:                    df[col] = df[col].astype(np.float16)                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:                    df[col] = df[col].astype(np.float32)                else:                    df[col] = df[col].astype(np.float64)    end_mem = df.memory_usage().sum() / 1024**2    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))    return df

但这没有用。服务器在合并每个CSV文件时仍然崩溃。我们使用pd.concat合并了每个文件。完整代码在这里。如何实现这一点，以便我们可以进行进一步的处理？

回答：

我会尝试以下方法：

在read_csv中通过dtypes参数指定列类型。
不创建10个数据框，而是依赖del来管理内存。

import numpy as npimport pandas as pddata_files = [    './data/CSVs/02-14-2018.csv',    './data/CSVs/02-15-2018.csv',    ... # 还有几个]# 定义数据类型data_types = {  "col_a": np.float64,  ... # 其他类型}df = reduce_memory_usage(    pd.read_csv(filename[0], dtype=data_types, index_col=False))for filename[1:] in data_files:    df = pd.concat(        [            df,            reduce_mem_usage(                pd.read_csv(                    filename,                    dtype=data_types,                    index_col=False,                )            ),        ],        ignore_index=True,    )

这样，您可以确保类型推断完全符合您的需求，并减少内存占用。此外，如果您的数据中有分类列，这些列通常在CSV文件中编码为字符串，您可以通过使用分类列数据类型大大减少内存占用。

学技术

合并多个Pandas数据框时出现内存错误

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复