系统工程与电子技术 ›› 2022, Vol. 44 ›› Issue (2): 410-419.doi: 10.12305/j.issn.1001-506X.2022.02.07

• 电子技术 • 上一篇    下一篇

注意力机制下双模态交互融合的目标跟踪网络

姚云翔, 陈莹*   

  1. 江南大学物联网工程学院, 江苏 无锡 214122
  • 收稿日期:2021-01-28 出版日期:2022-02-18 发布日期:2022-02-24
  • 通讯作者: 陈莹
  • 作者简介:姚云翔(1997—), 男, 硕士研究生, 主要研究方向为红外与可见光融合跟踪|陈莹(1976—), 女, 教授, 博士, 主要研究方向为计算机视觉与模式识别
  • 基金资助:
    国家自然科学基金(61573168)

Target tracking network based on dual-modal interactive fusion under attention mechanism

Yunxiang YAO, Ying CHEN*   

  1. College of Computer Internet of Things, Jiangnan University, Wuxi 214122, China
  • Received:2021-01-28 Online:2022-02-18 Published:2022-02-24
  • Contact: Ying CHEN

摘要:

针对当前目标跟踪难以适应低光照、运动模糊、目标快速移动等挑战, 提出了空间通道注意力下的红外与可见光双模态交互融合跟踪网络。首先, 红外图像与可见光图像通过backbone三层卷积提取分层特征, 并降维至统一分辨率, 之后级联三层特征形成各模态特征。其次, 多模态特征通过所设计的空间通道自注意力模块和跨模态交互注意力模块使得模态聚焦于全局空间特征和高响应通道, 提高双模态信息的互补性, 然后级联得到融合特征。最后, 将融合特征送入三层全连接完成目标跟踪任务。在目前最大的红外可见光跟踪数据集RGBT234的实验结果表明, 本文网络能有效提取双模态交互特征, 提高目标跟踪精度, 其精度/成功率比基线网络分别提高了5.3%和4.2%。

关键词: 红外与可见光, 目标跟踪, 深度学习, 注意力融合

Abstract:

Aiming at the challenges of current object tracking that is difficult to low illusion, motion blur, and fast motion, a dual-modal interacive fusion tracking network of infrared and visible under spatial channel attention is proposed. First, the infrared and RGB images are extracted through the backbone three-layer convalution to extract layered features which are normalized to the same resolution via dimension reduction. The three-layer features are cascaded to form each modal feature. Then the features are sent to the designed spatial channel self-attention module and the cross-module interactive attention module which lead network focus on global spatial features and high-response channels and therefore improve the complementarity of the dual-modal information. The interacted features of the dual-modal are cascaded for the fusion and finally sent to three fully connected layers to complete the target tracking. The experimental results of the largest RGB-Themeral (RGB-T) tracking data set RGBT234 show that the proposed network can effectively extract dual-modal interactive features and improve target tracking accuracy. Its Precision/Success Rateis improced by 5.3% and 4.2%, respectively, compared with the baseline network.

Key words: RGB-Themeral (RGB-T), object tracking, deep learning, attention fusion

中图分类号: