系统工程与电子技术 ›› 2025, Vol. 47 ›› Issue (10): 3168-3178.doi: 10.12305/j.issn.1001-506X.2025.10.05

• 电子技术 • 上一篇    

基于局部特征编解码的自动驾驶3D目标检测

邵凯1,2,3,*(), 吴广1, 梁燕1, 奚兴发1, 高琳珈1   

  1. 1. 重庆邮电大学通信与信息工程学院,重庆 400065
    2. 重庆邮电大学移动通信技术重庆市重点实验室,重庆 400065
    3. 重庆邮电大学移动通信教育部工程研究中心,重庆 400065
  • 收稿日期:2024-10-15 出版日期:2025-10-25 发布日期:2025-10-23
  • 通讯作者: 邵凯 E-mail:shaokai@cqupt.edu.cn
  • 作者简介:吴 广(2000—),男,硕士研究生,主要研究方向为深度学习、3D目标检测
    梁 燕(1977—),女,高级工程师,硕士,主要研究方向为计算机视觉、物联网AI
    奚兴发(2003—),男,主要研究方向为计算机视觉
    高琳珈(2003—),男,主要研究方向为计算机视觉

Local feature encode-decoding based 3D target detection of autonomous driving

Kai SHAO1,2,3,*(), Guang WU1, Yan LIANG1, Xingfa XI1, Linjia GAO1   

  1. 1. School of Communication and Information Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    2. Chongqing Key Laboratory of Mobile Communications Technology,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    3. Engineering Research Center of Mobile Communications of the Ministry of Education,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
  • Received:2024-10-15 Online:2025-10-25 Published:2025-10-23
  • Contact: Kai SHAO E-mail:shaokai@cqupt.edu.cn

摘要:

针对自动驾驶三维目标检测中多层次特征提取和多尺度特征上下文依赖性问题,采用点?体素的检测框架,提出一种综合多项技术的基于局部特征编解码区域卷积神经网络(local feature encode-decode region-based convolutional neural network,LFED-RCNN)。首先,在三维特征提取阶段提出结合卷积网络和Transformer编解码结构的卷积编解码主干,其中额外下采样卷积网络提取多层次三维特征,局部编解码网络建模特征间关联并融合深浅层特征,提升模型在复杂背景下对前景目标的特征获取能力。其次,设计位置编码模块对鸟瞰图视角下的二维特征进行位置编码,建立长期依赖关系,提升检测精度。所提方案LFED-RCNN在KITTI和ONCE数据集上进行验证,在KITTI数据集的困难等级下,对车、行人、骑行者三类检测对象分别可达到82.95%、57.48%、72.14%的平均准确率(mean average precision,mAP)。实验结果证明,所提方法在困难模式上表现出优异性能。

关键词: 三维目标检测, 点云, Transformer, 编码器, 解码器, 接受域

Abstract:

For the issues of multi-level feature extraction and multi-scale feature context dependency in three-dimensional target detection of autonomous driving, a local feature encode-decoding region-based convolutional neural network (LFED-RCNN) is proposed based on a point-voxel detection framework by integrating multiple techniques. Firstly, a convolutional and encode-decoding backbone is proposed in the 3D feature extraction stage, which combined the convolutional network and Transformer encode-decoding structure. In CED Backbone, the deep extra downsampling convolutional network (EDSNet) is designed to extract multi-level 3D features, and the local encode-decoding network is designed to establish model feature correlation and integrate deep and shallow features for improving the model’s ability of obtaining foreground complex target’s features. Secondly, a position encoding module is designed to encode the position of two-dimensional features from the perspective of birds eye view for establishing long-term dependencies and improving detection accuracy. The proposed schem LFED-RCNN is validated on the KITTI and ONCE datasets, in the difficulty level of the KITTI dataset, the mean average precision (mAP) for the three types of detection objects-cars, pedestrians, and cyclists-can reach 82.95%, 57.48%, and 72.14% respectively. The proposed method exhibits excellent performance in difficult modes.

Key words: three dimensional (3D) target detection, point cloud, Transformer, encoder, decoder, accepted domain

中图分类号: