Keras和Tensorflow操作系统资源需求

在训练过程中，我不断收到F tensorflow/core/platform/default/env.cc:73] Check failed: ret == 0 (11 vs. 0)Thread tf_data_private_threadpool creation via pthread_create() failed.错误，尽管机器性能强大：

内存大小：256GiB
2个AMD EPYC 7302 16核处理器
8个NVIDIA A2

总共有64个逻辑核心

ulimit -s 显示为32768，ulimit -u 显示为1030608

我想用一组在线生成的512*512灰度图像以及每张图像的两个附加参数来训练以下网络。图像生成是在通过Pybind11调用的C++函数中进行的。C++函数本身并不消耗大量资源。

这是我第一次编写AI训练代码，所以只是从类似的应用程序中复制并调整了参数。我需要相对较高的分辨率，因为网络需要从图像中重复的小部分推断出一个实数。

当我只保留模型的CNN部分，去掉拼接部分时，情况仍然相同。此外，我已经统计了运行过程中创建的进程。崩溃发生在我大约有31000个python3进程时，我觉得这非常极端。与此同时，nvidia-smi报告仅在一块GPU上消耗了大约13G的内存。

# 这个在模块landscapeGenerator中
def generate(aBatchSize:int=32, aRepeatParameter:int=2):
  dim = (512, 512)
  paraShape = (aRepeatParameter * 2)
  def generator():
    xParameter = numpy.empty(paraShape, dtype=float)
    xImage     = numpy.empty(aDim, dtype=float)
    y          = numpy.empty((1), dtype=float)
    # 设置参数，使用它们通过Pybind11获取图像
    xImage = randomLandscape(dist, height, tempAmb, tempBase)
    xParameter[0] = xImage[0, 0] / 0.04  # 视场最大为0.04弧度
    xImage[0, 0]  = xImage[0, 1]
    xParameter[aRepeatParameter] = something
    for i in range(1, aRepeatParameter):
      xParameter[i] = xParameter[0]
      xParameter[aRepeatParamter + i] = xParameter[aRepeatParameter]
    y[0]          = something
    yield {"parameters": xParameters, "image": xImage}, y
  dataset = tensorflow.data.Dataset.from_generator(generate,
    output_signature=(
      (tensorflow.TensorSpec(shape=paraShape, dtype=tensorflow.float32, name="parameters"),
      tensorflow.TensorSpec(shape=dim, dtype=tensorflow.float32, name="image")),
      tensorflow.TensorSpec(shape=(1), dtype=tensorflow.float32, name="y")
            ))
  dataset = dataset.batch(aBatchSize)
  return dataset

def createMlp(aRepeatParameter:int=2):
  model = Sequential()
  vectorSize = aRepeatParameter * 2
  model.add(Dense(vectorSize, input_dim=(vectorSize), activation="relu"))
  model.add(Dense(aRepeatParameter, activation="relu"))
  return model

def createCnn():
  filters=(512, 128, 32)
  inputShape = (512, 512, 1)
  chanDim = -1
  inputs = Input(shape=inputShape)
  for (i, f) in enumerate(filters):
    if i == 0:
      x = inputs
    x = Conv2D(f, (3, 3), padding="same")(x)
    x = Activation("relu")(x)
    x = BatchNormalization(axis=chanDim)(x)
    x = MaxPooling2D(pool_size=(2, 2))(x)
  x = Flatten()(x)
  x = Dense(16)(x)
  x = Activation("relu")(x)
  x = BatchNormalization(axis=chanDim)(x)
  x = Dropout(0.5)(x)
  x = Dense(4)(x)
  x = Activation("relu")(x)
  model = Model(inputs, x)
  return model

repeatParameter:int = 2
mlp = createMlp(repeatParameter)
cnn = createCnn()
combinedInput = concatenate([mlp.output, cnn.output])
x = Dense(4, activation="relu")(combinedInput)
x = Dense(1, activation="linear")(x)
model = Model(inputs=[mlp.input, cnn.input], outputs=x)
opt = Adam(learning_rate=1e-3, decay=1e-3 / 200)
model.compile(loss="mean_absolute_percentage_error", optimizer=opt)
batchSize = 32
model.fit(landscapeGenerator.generate(batchSize, repeatParameter), validation_data=landscapeGenerator.generate(batchSize, repeatParameter),
  epochs=10, steps_per_epoch=10, validation_split=0.3)
model.save('trainAiTemp.model')

我该怎么做才能让它运行呢？

回答：

对不起大家。代码中有一个拼写错误，导致了无限递归。由于无限递归导致的堆栈溢出之前，进程资源耗尽更早发生，所以很难发现这个问题。

def generate(aBatchSize:int=32, aRepeatParameter:int=2):
  dim = (512, 512)
  paraShape = (aRepeatParameter * 2)
  def generator():
    xParameter = numpy.empty(paraShape, dtype=float)
    # ...
  dataset = tensorflow.data.Dataset.from_generator(generate,) # ...
  # 这里的generate引用了外部函数，导致了无限递归。
  # 它应该引用的是generator。
  dataset = dataset.batch(aBatchSize)
  return dataset

学技术

Keras和Tensorflow操作系统资源需求

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复