CANN Fmod算子设计文档-港品优选

FmodScalar / FmodTensor 算子设计文档

【免费下载链接】cann-ops-competitions本仓库用于 CANN 开源社区各类竞赛、开源课题、社区任务等课题发布、开发者作品提交和展示。项目地址: https://gitcode.com/cann/cann-ops-competitions

需求背景

需求来源

通过社区任务完成昇腾开源算子贡献的需求。
参考昇腾 CANN 内置 TBEMod算子实现，在 Atlas A2 / Atlas A3 系列产品上使用 Ascend C 实现功能一致的aclnnFmodScalar与aclnnFmodTensor算子。
算子交付目标包括：算子工程代码、README、自验证报告、设计文档，以及多组 aclnn 调用测试代码。

背景介绍

Fmod算子用于执行逐元素取模计算。该算子的语义是基于向 0 截断除法的取余，而不是基于向下取整的取余。参考 TBE 源码中的注释：

truncate_mod(x, y) = x - truncate_div(x, y) * y

其中：

x对应self/input_x；
y对应other/input_y；
truncate_div(x, y)表示x / y后向 0 截断得到的商；
输出结果与x同号或为 0。

计算公式为：

out = self - trunc(self / other) * other

当self / other大于等于 0 时，trunc(self / other)等价于floor(self / other)；当self / other小于 0 时，trunc(self / other)等价于ceil(self / other)。因此 TBE 实现中通过vmin、vmax、floor、ceil组合实现向 0 截断。

本任务包含两个 aclnn 接口：

aclnnFmodScalar
- self为输入 Tensor；
- other为 Scalar；
- out的 shape 与self一致。
aclnnFmodTensor
- self为输入 Tensor；
- other为输入 Tensor；
- out的 shape 与self一致；
- other需要满足与self的合法广播关系，或在同 shape 场景下逐元素计算。

算子实现信息

TBE 参考实现
- kernel 实现：
  - /usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/impl/ops_legacy/dynamic/mod.py
- 算子原型：
  - /usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_proto/inc/ops_proto_math.h
- 算子信息库：
  - /usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/config/ascend910b/aic-ascend910b-ops-info-legacy.json
开源仓路径
- 开源仓：https://gitcode.com/cann/ops-math
- 参考样例：https://gitcode.com/cann/ops-math/tree/master/math/mod
- 目标合入路径：experimental/math
目标硬件与语言
- 适配硬件：Atlas A2 训练系列产品 / Atlas A3 系列产品
- 开发语言：Ascend C
- 支持格式：ND
- 支持 dtype：BFLOAT16、FLOAT16、FLOAT32、INT16

TBE 算子现状分析

TBE 算子支持方式

参考mod.py，TBE 版本通过如下方式实现动态 shape、广播、融合和 bfloat16 支持：

@register_operator_compute("Mod", op_mode="dynamic", support_fusion=True, support_bfp16=True) def mod_compute(input_x, input_y, output_z, kernel_name="mod"): ...

TBE 入口函数通过：

ins = classify([input_x, input_y], OpPatternMode.ELEWISE_WITH_BROADCAST)

对输入进行ELEWISE_WITH_BROADCAST分类，再对每一组分类结果创建 TVM placeholder、构造计算图并自动调度。

TBE 源码支持的 dtype 为：

check_list = ("float16", "float32", "int8", "uint8", "int32", "bfloat16")

本任务的目标 dtype 为：

BFLOAT16、FLOAT16、FLOAT32、INT16

因此 Ascend C 设计以任务书要求为准，同时参考 TBE 的核心计算语义与升精度策略。

TBE 参数校验与 shape 处理

TBE 入口函数mod(input_x, input_y, output_z, kernel_name="mod")的主要处理流程如下：

通过@para_check.check_op_params检查两个输入、一个输出以及kernel_name。
通过shape_util.compare_tensor_dict_key(input_x, input_y, "dtype")要求两个输入 dtype 一致。
通过para_check.check_dtype(input_dtype, check_list, param_name="input_x")检查输入 dtype 是否在支持列表中。
通过classify([input_x, input_y], OpPatternMode.ELEWISE_WITH_BROADCAST)对动态 shape 与广播场景进行分类。
对每组分类结果：
- 调用shape_util.variable_shape([input_x_dict, input_y_dict])获取动态 shape；
- 调用shape_util.refine_shapes_for_broadcast(shape_x, shape_y)对广播 shape 进行规整；
- 创建data_x、data_y两个 placeholder；
- 调用mod_compute构造计算图；
- 调用tbe.auto_schedule(res)自动调度；
- 最终通过tbe.build(schedule, config)编译生成算子。

TBE 核心计算逻辑

TBEmod_compute的核心逻辑如下：

读取输入 shape 与 dtype。
若输入 dtype 不是float32且平台支持float32的vdiv，则将input_x与input_y升精度到float32。
若两个输入 shape 不一致，则调用shape_util.broadcast_shapes获取广播后的 shape，并对两个输入执行tbe.broadcast。
计算除法：

data_div = tbe.vdiv(input_x, input_y)

构造 0 常量向量：

data_zero = tbe.broadcast(tvm.const(0, "float32"), shape_broadcast, "float32")

将data_div拆分为非正部分与非负部分：

data_div_min = tbe.vmin(data_div, data_zero) data_div_max = tbe.vmax(data_div, data_zero)

对非负部分取floor，对非正部分取ceil：

data_div_max_floor = tbe.floor(data_div_max, "float32") data_div_min_ceil = tbe.ceil(data_div_min, "float32")

两部分相加得到向 0 截断后的商：

data_div_res = tbe.vadd(data_div_max_floor, data_div_min_ceil)

将截断商转换为input_y的 dtype，执行乘法和减法：

data_div_res = tbe.cast_to(data_div_res, input_y.dtype.lower()) data_mul = tbe.vmul(data_div_res, input_y) res = tbe.vsub(input_x, data_mul)

若前面做过升精度，则将结果转回原始 dtype。

TBE 计算语义说明

TBE 的核心公式可等价表示为：

q_pos = floor(max(self / other, 0)) q_neg = ceil(min(self / other, 0)) q = q_pos + q_neg out = self - q * other

该实现等价于：

out = self - trunc(self / other) * other

其中trunc表示向 0 截断。

TBE 算子实现流程图

需求详细设计

使能方式

通过 aclnn 接口调用本算子：

aclnnFmodScalar
- 用于 Tensor 与 Scalar 的逐元素 fmod 计算。
aclnnFmodTensor
- 用于 Tensor 与 Tensor 的逐元素 fmod 计算。
- Tensor 版本需要支持同 shape 与合法广播场景。

算子原型

aclnnFmodScalar

参数名	类别	描述	数据类型	数据格式	shape	非连续 Tensor
self	输入 Tensor	待进行 Fmod 的输入 Tensor	BFLOAT16、FLOAT16、FLOAT32、INT16	ND	0~8 维	√
other	输入 Scalar	待进行 Fmod 的 Scalar	BFLOAT16、FLOAT16、FLOAT32、INT16	-	-	-
out	输出 Tensor	Fmod 输出结果，shape 与 self 一致	BFLOAT16、FLOAT16、FLOAT32、INT16	ND	与 self 一致	按接口要求

aclnnFmodTensor

参数名	类别	描述	数据类型	数据格式	shape	非连续 Tensor
self	输入 Tensor	待进行 Fmod 的左操作数	BFLOAT16、FLOAT16、FLOAT32、INT16	ND	0~8 维	√
other	输入 Tensor	待进行 Fmod 的右操作数	BFLOAT16、FLOAT16、FLOAT32、INT16	ND	0~8 维	√
out	输出 Tensor	Fmod 输出结果，shape 与 self 一致	BFLOAT16、FLOAT16、FLOAT32、INT16	ND	与 self 一致	√

功能约束与语义约束

输出 shape
- FmodScalar：out.shape == self.shape。
- FmodTensor：out.shape == self.shape，other需要能够广播到self.shape，或与self.shape完全一致。
dtype 支持
- 支持BFLOAT16、FLOAT16、FLOAT32、INT16。
- self、other、out的 dtype 需要满足 aclnn 原型侧的类型推导与可转换规则。
- kernel 实现中优先按输出 dtype 写回。
数据格式
- 支持 ND。
shape 范围
- 支持 0~8 维。
- 需要覆盖 0 维标量 Tensor、空 Tensor、小 shape、大 shape、广播 shape、非连续 Tensor 等泛化场景。
除 0 行为
- 本设计不在 kernel 内额外定义除 0 特殊规则，行为以 CANN/TBE 内置算子实际行为为准。
- 测试阶段需要覆盖other == 0场景，并与 TBE 结果保持一致。
负数行为
- 必须按向 0 截断取余实现，保证结果与 TBEMod语义一致。
- 例如：
  - fmod(5.3, 2.0) = 1.3
  - fmod(-5.3, 2.0) = -1.3
  - fmod(5.3, -2.0) = 1.3
  - fmod(-5.3, -2.0) = -1.3

需求总体设计

总体实现思路

本算子属于 elementwise 二元算子，整体实现分为 host 侧 tiling 与 kernel 侧计算两部分：

host 侧
- 解析输入 shape、stride、dtype、数据量、广播关系；
- 判断是否为连续 Tensor；
- 根据 shape、广播、非连续情况选择tilingKey；
- 计算分核策略、单核处理长度、尾块长度、UB 分块大小；
- 将必要的 tiling 参数写入 tiling data。
kernel 侧
- 根据tilingKey选择连续、广播、非连续或 scalar 路径；
- 按 block 维度分配 GM 数据范围；
- 将输入数据搬入 UB；
- 对当前 tile 执行 fmod 计算；
- 将结果搬出到 GM。

host 侧设计

1. 输入参数解析

host 侧需要解析如下信息：

self的 shape、stride、storage offset、dtype、元素个数；
other的 shape、stride、storage offset、dtype，或 scalar 值；
out的 shape、stride、storage offset、dtype；
self与other的广播关系；
self、other、out是否为连续 Tensor；
当前 dtype 对应的元素字节数；
总元素数totalLength = NumElements(self)。

2. 分核策略

优先采用满核参与计算策略。
设可用 AI Core 数为coreNum，总元素数为totalLength。
当totalLength >= coreNum时，按如下方式分配：

blockLength = ceil(totalLength / coreNum) usedCoreNum = coreNum

当totalLength < coreNum时：

blockLength = 1 usedCoreNum = totalLength

每个核根据blockIdx计算自己的起始逻辑下标：

start = blockIdx * blockLength end = min(start + blockLength, totalLength)

对于空 Tensor，host 侧可直接返回或生成totalLength = 0的 tiling，由 kernel 快速退出。

3. UB 分块策略

Fmod 计算至少需要以下 UB buffer：

xLocal：self tile；
yLocal：other tile 或 scalar broadcast 后 tile；
tmpDiv：除法中间结果；
tmpMin/tmpMax：截断商计算中间结果，或复用临时 buffer；
tmpFloor/tmpCeil：floor / ceil 结果，或复用临时 buffer；
qLocal：截断商；
outLocal：输出 tile。

为降低 UB 占用，建议进行 buffer 复用：

tmpDiv保存x / y。
tmpMax可在计算floor(max(div, 0))后复用。
tmpMin可在计算ceil(min(div, 0))后复用。
qLocal = tmpMax + tmpMin。
outLocal = xLocal - qLocal * yLocal。

对于FLOAT16、BFLOAT16、INT16，若升精度到FLOAT32计算，则 UB 中间 buffer 需要按float规划，tile 长度需要相应减小。

4. tilingKey 规划

建议使用如下tilingKey：

tilingKey	场景	说明
0	Scalar + 连续 self/out	`FmodScalar`高性能路径，other 为 scalar，self/out 连续
1	Tensor + 同 shape 连续输入输出	`FmodTensor`高性能路径，self/other/out 连续且 shape 一致
2	Tensor + broadcast 连续输入输出	other 可广播到 self，逻辑连续访问
3	Scalar + 非连续 self 或 out	scalar 场景的通用 stride 路径
4	Tensor + 非连续输入或输出	tensor 场景的通用 stride / broadcast 路径

首版实现若需要控制风险，可优先完成tilingKey = 0/1/2，再补齐非连续路径。但任务验收要求覆盖非连续 Tensor，因此最终版本必须实现tilingKey = 3/4或通过框架侧连续化机制保证非连续场景正确。

5. 非连续 Tensor 处理策略

对于非连续 Tensor，逻辑下标到物理 GM offset 的转换如下：

physical_offset = storage_offset + sum(index[d] * stride[d])

其中：

index[d]是逻辑多维下标；
stride[d]是对应维度的物理步长；
storage_offset是 Tensor 起始偏移。

host 侧需要将 0~8 维的 shape 与 stride 写入 tiling data。kernel 侧根据线性逻辑下标还原多维下标，再计算self、other、out的物理地址。

对于广播维度：

if other_shape[d] == 1 and self_shape[d] != 1: other_index[d] = 0 else: other_index[d] = self_index[d]

kernel 侧设计

连续 Tensor 高性能路径

连续路径中，每个核处理一段连续逻辑区间：

Scalar 路径

Scalar 路径中，other是 aclScalar。host 侧需要将 scalar 值写入 tiling data 或通过 kernel 参数传入。kernel 侧将 scalar 转换为CalcT后，在 UB 内 broadcast 为向量：

yLocal = broadcast(otherScalar) outLocal = fmod(xLocal, yLocal)

Scalar 路径不需要处理other的 GM 读入，因此性能通常优于 Tensor 路径。

Tensor broadcast 路径

Tensor broadcast 路径需要根据逻辑下标计算other的读取位置。对于连续但 broadcast 的场景，可按以下方式优化：

若other是 0 维或元素数为 1，可退化为 Scalar-like 路径。
若 broadcast 只发生在外层维度，内层连续维与self一致，可按内层连续块批量搬运。
若 broadcast 发生在内层维度，需要按维度切分，避免逐元素 scalar 搬运导致性能下降。
对一般 0~8 维 broadcast，可实现通用 index 映射路径，作为保底正确性路径。

非连续通用路径

非连续路径优先保证正确性：

为提升性能，可将连续子段识别为小块批量搬运；无法形成连续块时，使用 DataCopyPad / Gather-like 标量读写或循环处理。

Ascend C 实现流程图

Ascend C 实现与 TBE 实现差异点

TBE 通过classify(..., ELEWISE_WITH_BROADCAST)与auto_schedule自动处理广播与调度；Ascend C 需要显式实现 tiling、分核、UB 分配、搬运和广播索引映射。
TBE 中tbe.broadcast是逻辑表达式图操作；Ascend C 中需要根据不同场景选择 scalar broadcast、连续搬运、按维映射或非连续通用路径。
TBE 根据平台能力在计算前对非float32类型升精度到float32；Ascend C 需要通过模板参数与 Cast 指令显式实现对应策略。
TBE 的floor/ceil由 DSL 表达式描述；Ascend C 需要确认对应 Vector API 能力，并在不支持某 dtype 的情况下通过float32中间类型实现。
TBE 的非连续 Tensor 通常由上层图编译和 Tensor 描述协同处理；Ascend C 实现需要明确处理 stride、storage offset 与输出非连续写回。

精度设计

精度目标

精度要求满足 AscendOpTest 工具默认阈值，并与内置 TBE 算子结果对齐。

性能设计

性能目标

所有核参与计算场景下，性能不低于原 TBE 算子的 95%。
小 shape 如果无法达标，10us 以下场景允许通过性能仿真图和分析结论说明 Ascend C 实现与 TBE 一致或优于 TBE。

支持硬件

产品	是否支持
Atlas A2 训练系列产品	√
Atlas A3 系列产品	√

README 使用示例建议

README 中建议提供如下内容：

算子功能说明；
计算公式；
支持 dtype / shape / format；
aclnnFmodScalar调用示例；
aclnnFmodTensor调用示例；
编译方法；
运行测试方法；
精度和性能对比说明。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

企业官网建设流程全解析