融合注意力机制的IETM细粒度跨模态检索算法

doi:10.12305/j.issn.1001-506X.2023.12.21

摘要/Abstract

摘要：

交互式电子手册是提高各类装备保障信息化、智能化的关键技术之一, 针对其检索模态单一的问题, 以其数据中图文描述为研究对象, 提出一种融合注意力机制的细粒度跨模态检索算法。针对数据中图像简图较多、色彩单一等特点, 特征提取模块使用Vision Transformer模型和Transformer编码器分别获得图文的全局和局部特征; 使用注意力机制在图文模态间及模态内部挖掘细粒度信息, 加入文本对抗训练增强模型泛化能力, 采用跨模态联合损失函数对模型进行约束。在Pascal Sentence数据集和自建数据集上进行验证, 所提方法的平均精度均值分别达到了0.964和0.959, 较基准模型(深度监督跨模态检索)分别提升了0.248和0.214。

关键词: 交互式电子手册, 图文检索, 跨模态, 注意力机制

Abstract:

Interactive electronic manual is an important technology to improve the informatization and intelligence of various equipment support. Aiming at the problem of single retrieval modal, an improved fine grained cross-modal retrieval algorithm with attention mechanism fused is proposed, which takes the graphic descriptions of the data as the research object. In view of the characteristics of many image sketches and single color in the data, the feature extraction module uses the Vision Transformer model and Transformer encoder to obtain the global and local features of the picture and text, respectively. Moreover, the attention mechanism is applied to mine fine grained information between and within graphic and text modes, and text confrontation training is added to enhance the model's generalization ability. In addition, the cross-modal joint loss function is used to constrain the model. Verifying on the Pascal Sentence dataset and self-built dataset, the average accuracy of the proposed method reaches 0.964 and 0.959 respectively, which is 0.248 and 0.214 higher than the benchmark model deep supervised cross modal retrieval (DSCMR), respectively.

Key words: interactive electronic technical manual, image-text retrieval, cross-modal, attention mechanism

中图分类号:

TP391

翟一琛, 顾佼佼, 宗富强, 姜文志. 融合注意力机制的IETM细粒度跨模态检索算法[J]. 系统工程与电子技术, 2023, 45(12): 3915-3923.

Yichen ZHAI, Jiaojiao GU, Fuqiang ZONG, Wenzhi JIANG. Fine grained cross-modal retrieval algorithm for IETM with attention mechanism fused[J]. Systems Engineering and Electronics, 2023, 45(12): 3915-3923.

图/表 13

图1

图2

图3

表1

表2

图4

图5

表3

图6

图7

图8

表4

图9

参考文献 30

1	宋鹏. 智能化交互式电子技术手册系统开发研究[D]. 西安: 西安工业大学, 2020.
	SONG P. Research and development of intelligent interactive electronic technical manual system[D]. Xi'an: Xi'an University of Technology, 2020.
2	刘颖, 郭莹莹, 房杰, 等. 深度学习跨模态图文检索研究综述[J]. 计算机科学与探索, 2022, 16 (3): 489- 511.
	LIU Y , GUO Y Y , FANG J , et al. A survey of research on deep learning cross-modal image text retrieval[J]. Computer Science and Exploration, 2022, 16 (3): 489- 511.
3	PENG Y X , HUANG X , ZHAO Y Z . An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges[J]. IEEE Trans.on Circuits and Systems for Video Technology, 2017, 28 (9): 2372- 2385.
4	朱路, 田晓梦, 曹赛男, 等. 基于高阶语义相关的子空间跨模态检索方法研究[J]. 数据分析与知识发现, 2020, 4 (5): 84- 91.
	ZHU L , TIAN X M , CAO S N , et al. Subspace cross-modal retrieval based on high-order semantic correlation[J]. Data Analysis and Knowledge Discovery, 2020, 4 (5): 84- 91.
5	RASIWASIA N, COSTA P J. A new approach to cross-modal multimedia retrieval[C]//Proc. of the 18th ACM International Conference on Multimedia, 2010: 251-260.
6	WANG K Y, YIN Q Y, WANG W, et al. A comprehensive survey on cross-modal retrieval[EB/OL]. [2022-04-05]. https://arxiv.org/abs/1607.06215.
7	KAUR P , PANNU H S , MALHI A K . Comparative analysis on cross-modal information retrieval: a review[J]. Computer Science Review, 2021, 39 (2): 100336.
8	BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[EB/OL]. [2022-04-05]. https://arxiv.org/abs/1409.0473v5.
9	薛静宜. 手绘草图的跨模态检索[D]. 北京: 北京邮电大学, 2020.
	XUE J Y. Cross-modal retrieval of hand drawn sketches[D]. Beijing: Beijing University of Posts and Telecommunications, 2020.
10	HU J, SHEN L, SAMUEL A. Squeeze-and-excitation networks[C]//Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7132-7141.
11	LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C]//Proc. of the European Conference on Computer Vision, 2018: 201-216.
12	REN S Q , HE K M , GIRSHICK R , et al. Faster-RCNN: towards real-time object detection with region proposal networks[J]. IEEE Trans.on Pattern Analysis and Machine Intelligence, 2017, 39 (6): 1137- 1149. doi: 10.1109/TPAMI.2016.2577031
13	ZHANG Q, LEI Z, ZHANG Z X, et al. Context-aware attention network for image-text retrieval[C]//Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 3536-3545.
14	ZENG Y, ZHANG X S, LI H. Multi-grained vision language pre-training: aligning texts with visual concepts[EB/OL]. [2022-04-05]. https://arxiv.org/abs/2111.08276v1.
15	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2022-04-05]. https://arxiv.org/abs/2010.11929v1.
16	RASHTCHIAN C, YOUNG P, HODOSH M, et al. Collecting image annotations using amazon's mechanical turk[C]//Proc. of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical, 2010: 139-147.
17	ZHEN L L, HU P, WANG X, et al. Deep supervised cross-modal retrieval[C]//Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 10394-10403.
18	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proc. of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
19	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. [2022-04-05]. https://arxiv.org/abs/1301.3781v1.
20	HOIEM D , DIVVALA S K , HAYS J H . Pascal VOC 2008 challenge[J]. International Journal of Computer Vision, 2010, 88 (2): 303- 338. doi: 10.1007/s11263-009-0275-4
21	MIYATO T, DAI A M, GOODFELLOW I. Adversarial training methods for semi-supervised text classification[EB/OL]. [2022-04-05]. https://arxiv.org/abs/1605.07725v2.
22	WANG K Y, YIN Q Y, WANG W, et al. A comprehensive survey on cross-modal retrieval[EB/OL]. [2022-04-05]. https://arxiv.org/abs/1607.06215.
23	ANDREW G, ARORA R, BILMES J, et al. Deep canonical correlation analysis[C]//Proc. of the International Conference on Machine Learning, 2013: 1247-1255.
24	WANG B K, YANG Y, XU X, et al. Adversarial cross-modal retrieval[C]//Proc. of the 25th ACM International Conference on Multimedia, 2017: 154-162.
25	HU P , PENG D Z , WANG X , et al. Multimodal adversarial network for cross-modal retrieval[J]. Knowledge-Based Systems, 2019, 180 (5): 38- 50.
26	HU P, ZHEN L L, PENG D Z, et al. Scalable deep multimodal learning for cross-modal retrieval[C]//Proc. of the 42nd ACM International Conference on Research and Development in Information Retrieval, 2019: 635-644.
27	HE K M, ZHANG X, REN S Q, et al. Deep residual learning for image recognition[C]//Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
28	ELIZALDE B, ZARAR S, RAJ B. Cross modal audio search and retrieval with joint embeddings based on text and audio[C]// Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2019: 4095-4099.
29	XU X , HE L M , LU H , et al. Deep adversarial metric learning for cross-modal retrieval[J]. World Wide Web, 2019, 22 (2): 657- 672. doi: 10.1007/s11280-018-0541-x
30	CAO W M , LIN Q B , HE Z Q , et al. Hybrid representation learning for cross-modal retrieval[J]. Neurocomputing, 2019, 345 (14): 45- 57.

方法	mAP
方法	图检文	文检图	平均值
DCCA^[23]	0.678	0.677	0.678
ACMR^[24]	0.671	0.676	0.673
MAN^[25]	0.680	0.700	0.690
DSCMR^[16]	0.722	0.710	0.716
DSCMR*	0.936	0.928	0.932
文献[28]	0.227	0.225	0.226
文献[29]	0.539	0.531	0.535
文献[30]	0.623	0.702	0.662
本文*	0.963	0.964	0.964

方法	mAP
方法	图检文	文检图	平均值
DCCA	0.482	0.444	0.463
ACMR	0.596	0.515	0.556
MAN	0.504	0.515	0.510
DSCMR	0.721	0.770	0.745
SDML	0.793	0.785	0.789
DSCMR*	0.835	0.869	0.852
SDML*	0.848	0.871	0.860
本文*	0.961	0.958	0.959

方法	模块				mAP
方法	特征提取	文本对抗	模态间注意力	模态内注意力	图检文	文检图	平均值
1	CNN CNN	-	-	-	0.721	0.770	0.745
2	CNN CNN	√	-	-	0.736	0.778	0.757
3	CNN Transformer	√	-	-	0.723	0.772	0.747
4	ViT CNN	√	-	-	0.864	0.878	0.871
5	VIT Transfomer	√	-	-	0.835	0.869	0.852
6	VIT Transfomer	√	√	-	0.937	0.929	0.933
7	VIT Transfomer	√	√	√	0.961	0.958	0.959

特征维度	mAP
特征维度	图检文	文检图	平均值
128	0.957	0.956	0.957
256	0.961	0.958	0.959
512	0.950	0.950	0.950

[1]	赵晓枫, 牛家辉, 刘春桐, 夏玉婷. 基于三维注意力与混合卷积的高光谱图像分类[J]. 系统工程与电子技术, 2023, 45(9): 2673-2680.
[2]	李海军, 孔繁程, 林云. 基于改进YOLOv5s的红外舰船检测算法[J]. 系统工程与电子技术, 2023, 45(8): 2415-2422.
[3]	邓喆, 雷菁, 孙承哲. 跳频信号盲检测的半监督干扰对消方法[J]. 系统工程与电子技术, 2023, 45(7): 2236-2248.
[4]	赵庆媛, 赵志强, 叶春茂, 鲁耀兵. 基于自注意力的双波段预警雷达微动融合识别[J]. 系统工程与电子技术, 2023, 45(3): 708-716.
[5]	曹鹏宇, 杨承志, 陈泽盛, 王露, 石礼盟. 基于深度残差收缩注意力网络的雷达信号识别方法[J]. 系统工程与电子技术, 2023, 45(3): 717-725.
[6]	闫啸家, 梁伟阁, 张钢, 佘博, 田福庆. 基于RCNN-ABiLSTM的机械设备剩余寿命预测方法[J]. 系统工程与电子技术, 2023, 45(3): 931-940.
[7]	贺翥祯, 李敏, 苟瑶, 杨爱涛. 改进YOLOv5的合成孔径雷达图像舰船目标检测方法[J]. 系统工程与电子技术, 2023, 45(12): 3743-3753.
[8]	黄妍妍, 盖绍彦, 达飞鹏. 三分支空间变换注意力机制的图像匹配算法[J]. 系统工程与电子技术, 2023, 45(11): 3363-3373.
[9]	齐城慧, 张登银. 基于感知融合机制的渐进式去雾网络[J]. 系统工程与电子技术, 2023, 45(11): 3419-3427.
[10]	李浩然, 熊伟, 崔亚奇. 基于深度特征融合的SAR图像与AIS信息关联方法[J]. 系统工程与电子技术, 2023, 45(11): 3491-3497.
[11]	孙隽丰, 李成海, 曹波. 基于TCN-BiLSTM的网络安全态势预测[J]. 系统工程与电子技术, 2023, 45(11): 3671-3679.
[12]	韩啸, 陈世文, 陈蒙, 杨锦程. 基于互易点学习的LPI信号开集识别[J]. 系统工程与电子技术, 2022, 44(9): 2752-2759.
[13]	徐平亮, 崔亚奇, 熊伟, 熊振宇, 顾祥岐. 生成式中断航迹接续关联方法[J]. 系统工程与电子技术, 2022, 44(5): 1543-1552.
[14]	吴涛, 王伦文, 朱敬成. 基于迁移学习和注意力机制的伪装图像分割[J]. 系统工程与电子技术, 2022, 44(2): 376-384.
[15]	金涛, 王晓峰, 田润澜, 张歆东. 基于改进1DCNN+TCN的雷达辐射源快速识别方法[J]. 系统工程与电子技术, 2022, 44(2): 463-469.