系统工程与电子技术 ›› 2026, Vol. 48 ›› Issue (1): 34-43.doi: 10.12305/j.issn.1001-506X.2026.01.04

• 电子技术 • 上一篇    下一篇

基于双重注意力生成对抗网络的文本到图像生成

张振兴(), 杨任农, 李永林, 左家亮, 胡利平, 陈双艳   

  1. 空军工程大学空管领航学院,陕西 西安 710051
  • 收稿日期:2024-04-03 出版日期:2026-01-25 发布日期:2026-02-11
  • 通讯作者: 李永林 E-mail:2207621676@qq.com
  • 作者简介:张振兴(1993—),男,讲师,博士,主要研究方向为人工智能、机器学习、神经网络
    杨任农(1969—),男,教授,博士研究生导师,硕士,主要研究方向为智能算法、机器学习
    左家亮(1989—),男,副教授,博士,主要研究方向为智能算法、机器学习
    胡利平(1988—),男,讲师,博士,主要研究方向为智能算法、机器学习
    陈双艳(1989—),女,讲师,硕士,主要研究方向为智能算法、机器学习
  • 基金资助:
    国家自然科学基金青年基金(71501184,62106284);陕西省自然科学基金(2021JQ370);西安市科协青年人才托举计划(0959202513098)资助课题;

Text-to-image generation based on double-attention generative adversarial network

Zhenxing ZHANG(), Rennong YANG, Yonglin LI, Jialiang ZUO, Liping HU, Shuangyan CHEN   

  1. School of Air Traffic Control and Navigation,Air Force Engineering University,Xi’an 710051,China
  • Received:2024-04-03 Online:2026-01-25 Published:2026-02-11
  • Contact: Yonglin LI E-mail:2207621676@qq.com

摘要:

针对多阶段文本到图像生成方法存在的3个显著问题,即学习多个神经网络会延长收敛时间、忽视早期生成器生成图像的质量、需要训练多个对抗器,提出基于双重注意力生成对抗网络(double-attention generative adversarial network,DoubleGAN)的文本到图像生成模型。DoubleGAN纳入通道与像素注意机制,使用语句向量引领生成器聚焦于与文字内容紧密关联的通道及像素点。同时,提出条件自适应的实例级?层标准化方法,该方法能够依据语言信息调控形状与纹理的变动幅度,从而显著增强视觉?语义的对应关系并促进训练过程的稳定性。同时,采用一种视觉损失来增强图像分辨率,确保生成图像具有生动的形状和感知上均匀的颜色分布。实验结果表明,DoubleGAN实现了优秀的性能,在CUB bird-200-2011(Caltech-UCSD Birds-200-2011)数据集上将初始分数(inception score,IS)从4.75显著提高到4.97,具有实际应用价值。

关键词: 文本到图像生成, 生成对抗网络, 注意力机制, 单阶段架构

Abstract:

To address three prominent issues in multi-stage text-to-image generation methods—prolonged convergence time due to training multiple neural networks, the architecture neglecting the image quality generated by early-stage generators, and the requirement of training multiple discriminators—a text-to-image generation model based on double-attention generative adversarial networks (DoubleGAN) is proposed. DoubleGAN incorporates both channel and pixel attention mechanisms, leveraging sentence vectors to guide the generator in focusing on channels and pixels closely associated with textual content. Meanwhile, a conditionally adaptive instance-wise layer normalization method is introduced, which can adjust the variation amplitudes of shapes and textures according to linguistic information, thereby significantly enhancing the visual-semantic alignment and improving the stability of the training process. Additionally, a novel visual loss is adopted to boost image resolution, ensuring that the generated images possess vivid shapes and perceptually uniform color distributions. Experimental results demonstrate that DoubleGAN achieves excellent performance, substantially increasing the inception score (IS) from 4.75 to 4.97 on the Caltech-UCSD Birds-200-2011(CUB Bird) dataset, indicating its practical application value.

Key words: text-to-image synthesis, generative adversarial network, attention mechanism, single-stage architecture

中图分类号: