aclrtResetDevice 返回错误 507007
aclrtResetDevice()返回错误 507007,查看日志~/ascend/log/debug/plog/plog-PID_yyyyMMddhhmmssxxx.log给出如下提示:
[ERROR] RUNTIME(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [api_impl.cc:1777]9277 DeviceReset:DeviceReset context release failed, userDevId=0, retCode=0x7070003 [ERROR] RUNTIME(PID,main):yyyy-mm-dd-hh:mm:ss.xxx.yyy [api_impl.cc:1787]9277 DeviceReset:report error module_type=0, module_name=EE9999 [ERROR] RUNTIME(PID,main):yyyy-mm-dd-hh:mm:ss.xxx.yyy [api_impl.cc:1787]9277 DeviceReset:DeviceReset failed, deviceId=0, retCode=0x7070003 [ERROR] RUNTIME(PID,main):yyyy-mm-dd-hh:mm:ss.xxx.yyy [logger.cc:692]9277 DeviceReset:Device reset failed, device_id=0. [ERROR] RUNTIME(PID,main):yyyy-mm-dd-hh:mm:ss.xxx.yyy [api_c.cc:1567]9277 rtDeviceReset:ErrCode=507007, desc=[context release error], InnerCode=0x7070003 [ERROR] RUNTIME(PID,main):yyyy-mm-dd-hh:mm:ss.xxx.yyy [error_message_manage.cc:49]9277 FuncErrorReason:report error module_type=3, module_name=EE8888 [ERROR] RUNTIME(PID,main):yyyy-mm-dd-hh:mm:ss.xxx.yyy [error_message_manage.cc:49]9277 FuncErrorReason:rtDeviceReset execute failed, reason=[context release error] [ERROR] ASCENDCL(PID,main):yyyy-mm-dd-hh:mm:ss.xxx.yyy [device.cpp:115]9277 aclrtResetDevice: reset device 0 failed, runtime result = 507007.`原因:
根据昇腾文档中的流程图,如果没有显式调用aclrtSetDevice(),而是手动调用的创建的Context和Stream,就不要调用aclrtResetDevice()。
文当并未明确说明未调用setDevice而调用resetDevice的后果,但后果确实很严重,可知Ascend CL 内部并未作相应的处理。
Python调用C++库返回错误 107002
107002 表示context为空。
使用python的ctypes加载so库,当Python进入了控制台交互,从控制台拿到反馈时。使用了另一个线程执行对话,导致从python切换回so库时就是位于不同线程上的。
因此每次返回C++库时,需要及时重新通过aclrtSetDevice()方法设置设备,再用aclrtSetCurrentContext()方法绑定content,才能沿用之前的stream。
aclrtMemcpy 返回错误 507899
aclrtMemcpy()返回错误 507899,查看日志~/ascend/log/debug/plog/plog-PID_yyyyMMddhhmmssxxx.log给出如下提示:
[ERROR] DRV(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [drv_log_user.c:621][ascend][curpid:PID,PID][drv][devmm][share_log_read_in_single_module]Pcie fill bar2dma fail. (src_dev_id=6; dst_dev_id=7; ret=-22) Make dma node-size check fail, please check addr size. (total_len=0; count=8192; idx_dma=0; did=6; dst_did=7; src=0x12c080013000; dst=0x12c180016000; idx_src=0; from_num=1; idx_dst=0; to_num=1) Cp make dmanode list fail. (num=1; ret=-22; src=0x12c080013000; dst=0x12c180016000; count=8192) Memcpy error. (ret=-22; src=0x12c080013000; dst=0x12c180016000; count=8192; direction=3) Check alloced va. (hostpid=161983; va=0x12c080013000; start_va = 0x12c080000000; end_va = 0x12c080013fff) Check alloced va. (hostpid=161983; va=0x12c180016000; start_va = 0x12c180000000; end_va = 0x12c180016fff) [ERROR] DRV(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [devmm_svm.c:406][ascend][curpid:PID,PID][drv][devmm][devmm_copy_ioctl]<errno:22, 8> Ioctl error. (cmd=-1051177723; ret=8; dst=0x12c180016000; src=0x12c080013000; size=8192) [ERROR] DRV(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [devmm_virt_com_heap.c:1525][ascend][curpid:PID,PID][drv][devmm][devmm_print_svm_va_info]<errno:22, 8> Va info. (va=0x12c080013000; start=0x12c080013000; end=0x12c080015fff; module_name=APP; devid=0) [ERROR] DRV(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [devmm_virt_com_heap.c:1525][ascend][curpid:PID,PID][drv][devmm][devmm_print_svm_va_info]<errno:22, 8> Va info. (va=0x12c180016000; start=0x12c180016000; end=0x12c180018fff; module_name=APP; devid=1) [ERROR] RUNTIME(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [npu_driver.cc:2305]PID MemCopySync:[drv api] drvMemcpy failed: destMax=8192, size=8192(Byte), kind=3, devId=4294967295, drvRetCode=8! [ERROR] RUNTIME(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [api_error.cc:1199]PID MemCopySync:Memcopy sync failed, count=8192, kind=3. [ERROR] RUNTIME(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [api_c.cc:1193]PID rtMemcpy:ErrCode=507899, desc=[driver error:internal error], InnerCode=0x7020010 [ERROR] RUNTIME(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [error_message_manage.cc:53]PID FuncErrorReason:report error module_type=3, module_name=EE8888 [ERROR] RUNTIME(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [error_message_manage.cc:53]PID FuncErrorReason:rtMemcpy execute failed, reason=[driver error:internal error] [ERROR] ASCENDCL(PID,main):yyyy-MM-dd-hh:mm:ss.xxx.yyy [memory.cpp:303]11110 aclrtMemcpy: synchronized memcpy failed, kind = 3, runtime result = 507899如果设备之间支持互相复制(aclrtDeviceCanAccessPeer()接口返回true),那么需要在两个设备上均执行aclrtDeviceEnablePeerAccess(),否则就会出现上述错误。
在CANN 8.0 RC1的文档中,aclrtDeviceEnablePeerAccess()只调用了一次,而在后续8.0RC3的Ascend CL开发文档中,该问题已经做了修正。
实际测试,aclrtDeviceEnablePeerAccess()可以仅仅切换Context。