系统工程与电子技术 ›› 2023, Vol. 45 ›› Issue (12): 3915-3923.doi: 10.12305/j.issn.1001-506X.2023.12.21

• 系统工程 • 上一篇    

融合注意力机制的IETM细粒度跨模态检索算法

翟一琛, 顾佼佼, 宗富强, 姜文志   

  1. 海军航空大学岸防兵学院, 山东 烟台 264001
  • 收稿日期:2022-04-11 出版日期:2023-11-25 发布日期:2023-12-05
  • 通讯作者: 顾佼佼
  • 作者简介:翟一琛 (1998—), 男, 硕士研究生, 主要研究方向为深度学习
    顾佼佼 (1986—), 男, 讲师, 博士, 主要研究方向为深度学习
    宗富强 (1993—), 男, 助教, 硕士, 主要研究方向为武器装备信息化
    姜文志 (1964—), 男, 教授, 博士, 主要研究方向为武器系统与运用

Fine grained cross-modal retrieval algorithm for IETM with attention mechanism fused

Yichen ZHAI, Jiaojiao GU, Fuqiang ZONG, Wenzhi JIANG   

  1. Coastal Defense College, Naval Aviation University, Yantai 264001, China
  • Received:2022-04-11 Online:2023-11-25 Published:2023-12-05
  • Contact: Jiaojiao GU

摘要:

交互式电子手册是提高各类装备保障信息化、智能化的关键技术之一, 针对其检索模态单一的问题, 以其数据中图文描述为研究对象, 提出一种融合注意力机制的细粒度跨模态检索算法。针对数据中图像简图较多、色彩单一等特点, 特征提取模块使用Vision Transformer模型和Transformer编码器分别获得图文的全局和局部特征; 使用注意力机制在图文模态间及模态内部挖掘细粒度信息, 加入文本对抗训练增强模型泛化能力, 采用跨模态联合损失函数对模型进行约束。在Pascal Sentence数据集和自建数据集上进行验证, 所提方法的平均精度均值分别达到了0.964和0.959, 较基准模型(深度监督跨模态检索)分别提升了0.248和0.214。

关键词: 交互式电子手册, 图文检索, 跨模态, 注意力机制

Abstract:

Interactive electronic manual is an important technology to improve the informatization and intelligence of various equipment support. Aiming at the problem of single retrieval modal, an improved fine grained cross-modal retrieval algorithm with attention mechanism fused is proposed, which takes the graphic descriptions of the data as the research object. In view of the characteristics of many image sketches and single color in the data, the feature extraction module uses the Vision Transformer model and Transformer encoder to obtain the global and local features of the picture and text, respectively. Moreover, the attention mechanism is applied to mine fine grained information between and within graphic and text modes, and text confrontation training is added to enhance the model's generalization ability. In addition, the cross-modal joint loss function is used to constrain the model. Verifying on the Pascal Sentence dataset and self-built dataset, the average accuracy of the proposed method reaches 0.964 and 0.959 respectively, which is 0.248 and 0.214 higher than the benchmark model deep supervised cross modal retrieval (DSCMR), respectively.

Key words: interactive electronic technical manual, image-text retrieval, cross-modal, attention mechanism

中图分类号: