详解 PyTorch 与 CUDA 图：如何利用 CUDA Graphs 消除小模型推理的 CPU 发射开销

在 AI 推理加速领域，人们通常关注 FLOPS 或计算密度，但对于延迟敏感的场景（尤其是使用小型模型或具有许多顺序层的大型模型），CPU 发射（Kernel Launch）开销往往会成为主要的性能瓶颈。每次 PyTorch 调用 GPU 上的操作时，CPU 都需要花费微秒级别的时间来排队（enqueue）一个 CUDA Kernel。当模型计算量很小，但 Kernel 数量很多时，这些累积的 CPU 开销甚至可能超过 GPU 实际计算的时间。

CUDA Graphs 简介

CUDA Graphs 是 NVIDIA 提供的一种机制，允许开发者将一系列 CUDA 操作（包括内存操作、内核启动等）录制并优化为一个静态的、可重放的图结构。一旦图被录制，CPU 只需要发射一次“执行图”的指令，后续所有的操作都在 GPU 上异步、原子地完成，从而彻底消除了每次 Kernel 启动所带来的 CPU 开销。

本教程将展示如何在 PyTorch 中利用 torch.cuda.CUDAGraph() 来加速一个小规模的推理任务。

Table of Contents

1. 环境准备

确保您的 PyTorch 版本支持 CUDA Graphs（通常建议使用 PyTorch 1.10 或更高版本）。

import torch
import time

# 检查设备是否可用
if not torch.cuda.is_available():
    raise RuntimeError("CUDA not available. Please run this on a compatible GPU.")

device = torch.device("cuda")

# 定义测试参数
N_RUNS = 5000  # 测试运行次数，用于放大开销
MAT_SIZE = 128 # 矩阵维度，保持计算量小

print(f"设备: {torch.cuda.get_device_name(0)}")
print(f"测试运行次数: {N_RUNS}")

2. 定义小计算量模型

我们定义一个非常简单的、具有多个连续小操作的模型，以确保 CPU Launch Overhead 成为瓶颈。

class SmallWorkload(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # 模拟一个小型模型，具有多个连续层
        self.m1 = torch.nn.Linear(MAT_SIZE, MAT_SIZE).to(device)
        self.m2 = torch.nn.Linear(MAT_SIZE, MAT_SIZE).to(device)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = self.m1(x)
        x = self.relu(x)
        x = self.m2(x)
        return x

model = SmallWorkload().eval()

# 预定义静态输入和输出（CUDA Graph要求输入和输出张量的内存地址固定）
static_input = torch.randn(1, MAT_SIZE, device=device)
# 预分配输出，用于在图内进行写入
static_output = torch.empty_like(static_input)

# 首次运行用于预热和 JIT 编译
with torch.no_grad():
    for _ in range(50):
        model(static_input)
    torch.cuda.synchronize()

3. 标准（无 Graph）执行性能测试

我们首先测试标准 PyTorch 调度模式下的性能。

with torch.no_grad():
    torch.cuda.synchronize()
    start_time = time.perf_counter()

    for _ in range(N_RUNS):
        model(static_input)

    torch.cuda.synchronize()
    end_time = time.perf_counter()
    standard_time = end_time - start_time
    print(f"\n--- 性能对比 ---")
    print(f"1. 标准执行总时间 ({N_RUNS}次): {standard_time * 1000:.2f} ms")

4. CUDA Graph 录制与回放

重要限制: 在录制时，模型必须运行在一个固定的内存地址上。这意味着我们不能在模型内部创建新的张量或依赖动态控制流（如 if/else）。此外，为了确保图的完整性，我们通常使用 torch.cuda.graph_context 来捕获操作序列。

# 4a. 创建 CUDAGraph 对象
graph = torch.cuda.CUDAGraph()

# 4b. 定义用于录制的流 (推荐使用单独的流)
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream()) 

# 4c. 录制图
with torch.cuda.stream(s):
    with torch.cuda.graph_context(graph):
        # 执行一次前向传播。关键在于确保输入和输出张量的内存地址被图静态记录。
        captured_output = model(static_input)
        # 确保图的最终输出写入到我们预分配的静态输出张量中
        static_output.copy_(captured_output)

# 4d. 回放图并测量性能
torch.cuda.synchronize()
start_time_graph = time.perf_counter()

for _ in range(N_RUNS):
    graph.replay()

torch.cuda.synchronize()
end_time_graph = time.perf_counter()
graph_time = end_time_graph - start_time_graph

print(f"2. CUDA Graph 回放总时间 ({N_RUNS}次): {graph_time * 1000:.2f} ms")

speedup = standard_time / graph_time
print(f"性能提升倍数: {speedup:.2f}x")

5. 结果分析

对于像本例中这种计算量极小（几万次浮点运算），但包含多个 Kernel Launch 的模型，通过 CUDA Graphs 录制和回放，性能提升通常能达到 5x 到 20x，因为 90% 以上的延迟都来自于 CPU 调度和同步开销，而不是 GPU 实际的矩阵乘法。

关键总结:

静态性要求： 图一旦录制，就是静态的。输入张量的形状和数据类型必须固定，且不能使用动态控制流。
内存管理： 录制和回放时，必须确保输入和输出张量使用相同的内存地址（即它们必须是预分配的，如 static_input）。
适用场景： CUDA Graphs 最适合那些运行时间极短（如亚毫秒级）、需要重复执行的推理工作负载。

详解 PyTorch 与 CUDA 图：如何利用 CUDA Graphs 消除小模型推理的 CPU 发射开销

1. 环境准备

2. 定义小计算量模型

3. 标准（无 Graph）执行性能测试

4. CUDA Graph 录制与回放

5. 结果分析

相关

相关推荐

评论抢沙发

1. 环境准备

2. 定义小计算量模型

3. 标准（无 Graph）执行性能测试

4. CUDA Graph 录制与回放

5. 结果分析

相关

相关推荐

评论 抢沙发

评论抢沙发