### Top-p采样未能正常工作。CUDA错误：设备端断言触发

我尝试重新实现Hugging Face变换器模型中的model.generate()函数。我这样做是为了实现logit偏置，这是普通函数不允许的。但在我实现之前，我遇到了很多关于我的top-p采样的问题。

以下是代码片段：

generation_args = {    "max_new_tokens": 500,    "temperature": 0.4,  # 如果需要更多或更少的随机性，可以调整温度    "do_sample": True,  # 启用采样    "top_p": 0.5,  # 设置核采样的累积概率    "top_k": None,  # 可选地，你可以设置top_k，如果你想与top_p一起使用或代替它}def top_p_filtering(logits, top_p):    """使用top-p（核）采样过滤logits。"""    # 按降序排序logits并获取排序后的索引    sorted_logits, sorted_indices = torch.sort(logits, descending=True)    # 计算排序后的logits的累积概率    cumulative_probs = torch.cumsum(torch.nn.functional.softmax(sorted_logits, dim=-1), dim=-1)    # 创建一个掩码，用于保留的标记    sorted_indices_to_keep = cumulative_probs <= top_p    # 确保至少保留一个标记（第一个标记，它具有最高的logit）    sorted_indices_to_keep[..., 0] = True    # 通过将要删除的标记的logits设置为负无穷大来过滤它们    logits[sorted_indices[~sorted_indices_to_keep]] = float('-inf')    return logitsdef custom_generate(input_ids, streamer, max_new_tokens, temperature, top_p):    past_key_values = None    attention_mask = torch.ones(input_ids.shape, device=input_ids.device)    for _ in range(max_new_tokens):        with torch.no_grad():            outputs = model(                input_ids=input_ids,                past_key_values=past_key_values,                attention_mask=attention_mask,                use_cache=True            )        logits = outputs.logits[:, -1, :]  # 获取最后一个标记的logits        # 对logits应用温度        if temperature != 1.0:            logits = logits / temperature        # 应用top-p采样        if top_p is not None and top_p < 1.0:            logits = top_p_filtering(logits, top_p)        print("1")        next_token_probs = torch.nn.functional.softmax(logits, dim=-1)        print("2")        # 检查next_token_probs是否包含有效的概率        next_token_id = torch.multinomial(next_token_probs,                                          num_samples=1)          print("3")        streamer.put(next_token_id)  # 直接将张量传递给流式传输器        input_ids = next_token_id  # 将下一个输入设置为最后生成的标记        attention_mask = torch.cat(            [attention_mask, torch.ones((attention_mask.shape[0], 1), device=attention_mask.device)], dim=1)        past_key_values = outputs.past_key_values        if next_token_id.item() == tokenizer.eos_token_id:              breakwith torch.no_grad():    custom_generate(input_ids, streamer, generation_args["max_new_tokens"], generation_args["temperature"], generation_args["top_p"])

我遇到的错误是：

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], thread: [63,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.Exception in thread Thread-18 (generate):Traceback (most recent call last):  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner    self.run()  File "/usr/lib/python3.10/threading.py", line 953, in run    self._target(*self._args, **self._kwargs)  File "/mnt/c/Users/User/Documents/EmpatheticChatBot/Inference-Server.py", line 130, in generate    custom_generate(input_ids, streamer, generation_args["max_new_tokens"], generation_args["temperature"], generation_args["top_p"])  File "/mnt/c/Users/User/Documents/EmpatheticChatBot/Inference-Server.py", line 108, in custom_generate    next_token_id = torch.multinomial(next_token_probs,RuntimeError: CUDA error: device-side assert triggeredCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.For debugging consider passing CUDA_LAUNCH_BLOCKING=1.Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

整个问题只在添加top-p采样后出现。

我期望我的采样能够正常工作，因为我已经检查了我的代码大约30次。ChatGPT说这段代码是完美的，我的错误真的很难调试。我的假设是值被错误地过滤或设置为“坏”值。

回答：

问题出在你在这行代码中进行的索引操作：

logits[sorted_indices[~sorted_indices_to_keep]] = float('-inf')

由于我将要解释的原因，这导致了索引越界错误。索引越界是CUDA error: device-side assert triggered错误的常见原因。

请考虑以下内容：

import torchimport torch.nn as nntorch.manual_seed(42)top_p = 0.2logits = torch.randn(8, 128) # 随机logits# 排序logits sorted_logits, sorted_indices = torch.sort(logits, descending=True)# 计算累积概率cumulative_probs = torch.cumsum(torch.nn.functional.softmax(sorted_logits, dim=-1), dim=-1)# 对累积概率应用top p阈值sorted_indices_to_keep = cumulative_probs <= top_p# 确保至少保留一个索引sorted_indices_to_keep[..., 0] = True# 这是问题所在： logits[sorted_indices[~sorted_indices_to_keep]] = float('-inf')print(logits.shape, sorted_indices[~sorted_indices_to_keep].shape)> torch.Size([8, 128]) torch.Size([989])

当你索引sorted_indices[~sorted_indices_to_keep]时，两个输入的形状都是(8, 128)，但输出形状是(989,)（或根据用于虚拟logits的随机种子而变化）。

这是因为sorted_indices_to_keep在每行中具有不规则数量的True值。这意味着索引操作无法将输出解析为一个干净的2D张量，其中每行大小相同。PyTorch通过返回索引张量中所有True值的展开向量来处理这种情况。

这意味着当你尝试计算logits[sorted_indices[~sorted_indices_to_keep]]时，你使用一个长1D张量来索引一个小的2D张量。如果你在CPU上运行，你会得到类似IndexError: index 20 is out of bounds for dimension 0 with size 8的错误。当你在GPU上运行时，你会得到CUDA断言错误。

为了解决这个问题，使用scatter操作。使用类似这样的代码：

def top_p_filtering(logits, top_p, shift_indices=True, debug=False):    """使用top-p（核）采样过滤logits。"""    # 按降序排序logits并获取排序后的索引    sorted_logits, sorted_indices = torch.sort(logits, descending=True)    # 计算排序后的logits的累积概率    cumulative_probs = torch.cumsum(torch.nn.functional.softmax(sorted_logits, dim=-1), dim=-1)    # 创建一个掩码，用于保留的标记    sorted_indices_to_keep = cumulative_probs <= top_p        # 可选：将索引向右移动。这将导致保留第一个高于top_p阈值的标记。跳过此行以确保所有标记概率严格低于top_p阈值    if shift_indices:        sorted_indices_to_keep[..., 1:] = sorted_indices_to_keep[..., :-1].clone()    # 确保至少保留一个标记（第一个标记，它具有最高的logit）    sorted_indices_to_keep[..., 0] = True        # 使用scatter创建top_p掩码    mask = sorted_indices_to_keep.scatter(dim=1, index=sorted_indices, src=sorted_indices_to_keep)        # 可选的调试检查，以确保遵守top_p    # 注意，我们需要在掩码之前计算概率，因为在掩码后应用softmax会导致总和为1的分布    if debug:        probs = torch.nn.functional.softmax(logits, dim=-1)        probs[~mask] = 0        print(probs.sum(-1))        # 使用掩码将logit值设置为-inf    logits[~mask] = float('-inf')    return logits

学技术

### Top-p采样未能正常工作。CUDA错误：设备端断言触发

发表回复取消回复

相关文章：

Related Posts

Keras Dense层输入未被展平

无法将分类变量输入随机森林

如何在Keras中对每个输出应用Sigmoid函数？

如何选择类概率的最佳阈值？

在Keras中使用深度学习得到不同的结果

‘MatMul’操作的输入’b’类型为float32，与参数’a’的类型float64不匹配

发表回复 取消回复

发表回复取消回复