Systems Engineering and Electronics ›› 2025, Vol. 47 ›› Issue (10): 3168-3178.doi: 10.12305/j.issn.1001-506X.2025.10.05

• Electronic Technology • Previous Articles    

Local feature encode-decoding based 3D target detection of autonomous driving

Kai SHAO1,2,3,*(), Guang WU1, Yan LIANG1, Xingfa XI1, Linjia GAO1   

  1. 1. School of Communication and Information Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    2. Chongqing Key Laboratory of Mobile Communications Technology,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    3. Engineering Research Center of Mobile Communications of the Ministry of Education,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
  • Received:2024-10-15 Online:2025-10-25 Published:2025-10-23
  • Contact: Kai SHAO E-mail:shaokai@cqupt.edu.cn

Abstract:

For the issues of multi-level feature extraction and multi-scale feature context dependency in three-dimensional target detection of autonomous driving, a local feature encode-decoding region-based convolutional neural network (LFED-RCNN) is proposed based on a point-voxel detection framework by integrating multiple techniques. Firstly, a convolutional and encode-decoding backbone is proposed in the 3D feature extraction stage, which combined the convolutional network and Transformer encode-decoding structure. In CED Backbone, the deep extra downsampling convolutional network (EDSNet) is designed to extract multi-level 3D features, and the local encode-decoding network is designed to establish model feature correlation and integrate deep and shallow features for improving the model’s ability of obtaining foreground complex target’s features. Secondly, a position encoding module is designed to encode the position of two-dimensional features from the perspective of birds eye view for establishing long-term dependencies and improving detection accuracy. The proposed schem LFED-RCNN is validated on the KITTI and ONCE datasets, in the difficulty level of the KITTI dataset, the mean average precision (mAP) for the three types of detection objects-cars, pedestrians, and cyclists-can reach 82.95%, 57.48%, and 72.14% respectively. The proposed method exhibits excellent performance in difficult modes.

Key words: three dimensional (3D) target detection, point cloud, Transformer, encoder, decoder, accepted domain

CLC Number: 

[an error occurred while processing this directive]