PyTorch Distributed Run with SLURM 导致“地址族未找到”错误

当我尝试通过SLURM脚本在具有2个GPU的2个节点的集群上运行一个示例Python文件时，使用torch.distributed.run，我遇到了以下错误：

[W socket.cpp:426] [c10d] 无法在 [::]:16773 上初始化服务器套接字（错误码：97 - 协议不支持的地址族）。[W socket.cpp:601] [c10d] 无法初始化客户端套接字以连接到 [clara06.url.de]:16773（错误码：97 - 协议不支持的地址族）。

这是SLURM脚本：

#!/bin/bash#SBATCH --job-name=distribution-test        # 名称#SBATCH --nodes=2                           # 节点数#SBATCH --ntasks-per-node=1                 # 关键 - 每个节点的分布式任务只能有一个！#SBATCH --cpus-per-task=4                   # 每个任务的核心数#SBATCH --partition=clara#SBATCH --gres=gpu:v100:2                   # GPU数量#SBATCH --time 0:15:00                      # 最大执行时间（HH:MM:SS）#SBATCH --output=%x-%j.out                  # 输出文件名module load Pythonpip install --user -r requirements.txtMASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))GPUS_PER_NODE=2LOGLEVEL=INFO python -m torch.distributed.run --rdzv_id=$SLURM_JOBID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR\:$MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES  torch-distributed-gpu-test.py

以及应该运行的Python代码：

import fcntlimport osimport socketimport torchimport torch.distributed as distdef printflock(*msgs):    """解决多进程交错打印问题"""    with open(__file__, "r") as fh:        fcntl.flock(fh, fcntl.LOCK_EX)        try:            print(*msgs)        finally:            fcntl.flock(fh, fcntl.LOCK_UN)local_rank = int(os.environ["LOCAL_RANK"])torch.cuda.set_device(local_rank)device = torch.device("cuda", local_rank)hostname = socket.gethostname()gpu = f"[{hostname}-{local_rank}]"try:    # 测试分布式    dist.init_process_group("nccl")    dist.all_reduce(torch.ones(1).to(device), op=dist.ReduceOp.SUM)    dist.barrier()    # 测试CUDA是否可用并能分配内存    torch.cuda.is_available()    torch.ones(1).cuda(local_rank)    # 全局排名    rank = dist.get_rank()    world_size = dist.get_world_size()    printflock(f"{gpu} is OK (global rank: {rank}/{world_size})")    dist.barrier()    if rank == 0:        printflock(f"pt={torch.__version__}, cuda={torch.version.cuda}, nccl={torch.cuda.nccl.version()}")except Exception:    printflock(f"{gpu} is broken")    raise

我尝试了不同的Python运行方式，如下所示：

LOGLEVEL=INFO python -m torch.distributed.run --master_addr $MASTER_ADDR --master_port $MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES  torch-distributed-gpu-test.py

LOGLEVEL=INFO torchrun --rdzv_id=$SLURM_JOBID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR\:$MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES  torch-distributed-gpu-test.py

LOGLEVEL=INFO python -m torch.distributed.launch --rdzv_id=$SLURM_JOBID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR\:$MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES  torch-distributed-gpu-test.py

所有这些都导致了相同的错误。

我尝试明确指定IP地址而不是使用MASTER_ADDR

IP_ADDRESS=$(srun hostname --ip-address | head -n 1)

我检查了开放的端口：1023以上的所有端口都是开放的
并检查了/etc/resolv.conf：主机名映射清晰
并ping了节点，同样成功
我通过在MASTER_ADDR后添加.ipv4来指定IP版本，但没有成功

回答：

“地址族未找到”错误与IPv4和IPv6版本有关。由于我的服务在节点之间没有提供IPv6连接，因此出现了这些错误。

但它们可以被理解为警告，通过IPv4的连接仍然建立了。

我没有找到任何禁用IPv6连接的解决方案，但由于它们只是“信息”，所以我忽略了它们

学技术

PyTorch Distributed Run with SLURM 导致“地址族未找到”错误

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复