基于双重注意力生成对抗网络的文本到图像生成

doi:10.12305/j.issn.1001-506X.2026.01.04

Abstract

Abstract:

To address three prominent issues in multi-stage text-to-image generation methods—prolonged convergence time due to training multiple neural networks, the architecture neglecting the image quality generated by early-stage generators, and the requirement of training multiple discriminators—a text-to-image generation model based on double-attention generative adversarial networks (DoubleGAN) is proposed. DoubleGAN incorporates both channel and pixel attention mechanisms, leveraging sentence vectors to guide the generator in focusing on channels and pixels closely associated with textual content. Meanwhile, a conditionally adaptive instance-wise layer normalization method is introduced, which can adjust the variation amplitudes of shapes and textures according to linguistic information, thereby significantly enhancing the visual-semantic alignment and improving the stability of the training process. Additionally, a novel visual loss is adopted to boost image resolution, ensuring that the generated images possess vivid shapes and perceptually uniform color distributions. Experimental results demonstrate that DoubleGAN achieves excellent performance, substantially increasing the inception score (IS) from 4.75 to 4.97 on the Caltech-UCSD Birds-200-2011（CUB Bird） dataset, indicating its practical application value.

Key words: text-to-image synthesis, generative adversarial network, attention mechanism, single-stage architecture

CLC Number:

V 247

Zhenxing ZHANG, Rennong YANG, Yonglin LI, Jialiang ZUO, Liping HU, Shuangyan CHEN. Text-to-image generation based on double-attention generative adversarial network[J]. Systems Engineering and Electronics, 2026, 48(1): 34-43.

Figures/Tables 15

Fig.1

Fig.2

Fig.3

Fig.4

Table 1

Table 2

Fig.5

Fig.6

Fig.7

Table 3

Table 4

Fig.8

Table 5

Fig.9

Table 6

References 22

1	LI B W, QI X J, LUKASIEWICZ T, et al. Controllable text-to-image generation[C]//Proc. of the Advances in Neural Information Processing Systems, 2019: 2065−2075.
2	MA S, FU J, CHEN C W, et al. DA-GAN: instance-level image translation by deep attention generative adversarial networks[C]//Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 5657−5666.
3	QIAO T T, ZHANG J, XU D Q, et al. MirrorGAN: learning text-toimage generation by redescription[C]//Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 1505−1514.
4	XU T, ZHANG P C, HUANG Q Y, et al. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks[C]// Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 1316−1324.
5	YUAN M, PENG Y. Text-to-image synthesis via symmetrical distillation networks[C]//Proc. of the 26th ACM International Conference on Multimedia, 2018: 1407−1415.
6	ZHANG H, XU T, LI H S, et al. StackGAN++: realistic image synthesis with stacked generative adversarial networks[J]. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2018, 41 (8): 1947- 1962.
7	TAO M, TANG H, WU S S, et al. DF-GAN: deep fusion generative adversarial networks for text-to-image synthesis[EB/OL]. [2025-03-01]. https://arxiv.org/abs/2008.05865v1.
8	ZHANG Z , SCHOMAKER L. DiverGAN: an efficient and effective single-stage framework for diverse text-to-image generation[J]. Neurocomputing, 2022, 473, 182- 198. doi: 10.1016/j.neucom.2021.12.005
9	IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]//Proc. of the International Conference on Machine Learning, 2015: 448–456.
10	WAH C, BRANSON S, WELINDER P, et al. The Caltech UCSD Birds-200-2011 dataset [J]. California Institute of Technology, 2011: 16119123.
11	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proc. of the European Conference on Computer Vision, 2014: 740–755.
12	刘戎翔, 吴琳, 谢智歌, 等. 基于生成对抗网络的防空体系态势辅助分析[J]. 系统工程与电子技术, 2022, 44 (8): 2522- 2529.
	LIU R X , WU L, XIE Z G , et al. Auxiliary situation analysis for air defense system based on generative adversarial network[J]. Systems Engineering and Electronics, 2022, 44 (8): 2522- 2529.
23	马兰, 孟诗君, 吴志军. 基于BERT与生成对抗的民航陆空通话意图挖掘[J]. 系统工程与电子技术, 2024, 46 (2): 740- 750.
	MA L, MENG S J , WU Z J. Intention mining for civil aviation radiotelephony communication based on BERT and generative adversarial[J]. Systems Engineering and Electronics, 2024, 46 (2): 740- 750.
24	田相轩, 石志强. 基于改进型生成对抗网络的指挥信息系统模拟数据生成算法[J]. 系统工程与电子技术, 2021, 43 (1): 163- 170.
	TIAN X X, SHI Z Q. Simulation data generation algorithm based on evolutional generative adversarial networks for command information system[J]. Systems Engineering and Electronics, 2021, 43 (1): 163- 170.
25	邵凯, 朱苗苗, 王光宇. 基于生成对抗与卷积神经网络的调制识别方法[J]. 系统工程与电子技术, 2022, 44 (3): 1036- 1043.
	SHAO K , ZHU M M , WANG G Y. Modulation recognition method based on generative adversarial andconvolutional neural network[J]. Systems Engineering and Electronics, 2022, 44 (3): 1036- 1043.
26	胡涛. 基于生成对抗网络的文本描述图像生成研究[D]. 合肥: 中国科学技术大学, 2021.
	HU T . Research on text-to-image generation based on generative adversarial networks[D]. Hefei: University of Science and Technology of China, 2021.
13	KINGMA D P, BA J. Adam: a method for stochastic optimization[C]//Proc. of the International Conference on Learning Representations, 2015.
14	陈丽, 方梓涵, 梅立泉. 基于GAN的直扩信号生成算法[J]. 系统工程与电子技术, 2023, 45 (5): 1544- 1552.
	CHEN L, FANG Z H, MEI L Q. DSS signal generation algorithm based on GAN[J]. Systems Engineering and Electronics, 2023, 45 (5): 1544- 1552.
15	SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs[C]//Proc. of the Advances in Neural Information Processing Systems, 2016: 2234–2242.
16	SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 2818–2826.
17	LI W B, ZHANG P C, ZHANG L, et al. Object driven text-to-image synthesis via adversarial training[C]//Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 12174−12182.
18	ZHU M F, PAN P B, CHEN W, et al. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis[C]//Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 5802−5810.
19	ZHANG Z X , SCHOMAKER L. Optimizing and interpreting the latent space of the conditional text-to-image GANs[J]. Neural Comput & Applic, 2024, 36, 2549- 2572.
20	ZHANG Z X , SCHOMAKER L. Fusion-s2igan: an efficient and effective single-stage framework for speech-to-image generation[J]. Neural Comput & Applic, 2024, 36, 10567- 10584.
21	GAFNI O, POLYAK A, ASHUAL O, et al. Make-a-scene: scene-based text-to-image generation with human priors[C]//Proc. of the Computer Vision, 2022: 89-106.
22	RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with clip latents.[EB/OL]. [2025-03-01]. https://arxiv.org/pdf/2204.06125.

算法	IS	参数量(×10⁻⁶)
StackGAN^[6]	3.70±0.04	—
StackGAN+^[6]	4.04±0.05	—
AttnGAN^[4]	4.36±0.03	230
MirrorGAN^[3]	4.56±0.05	280
ControlGAN^[1]	4.58±0.09	—
SDGAN^[5]	4.67±0.09	—
DM-GAN^[18]	4.75±0.07	46
DoubleGAN	4.97±0.05	20

数据集	StackGAN++	AttnGAN	DoubleGAN
CUB bird	26.07	23.98	15.13
MSCOCO	51.62	35.49	20.30

数据集	StackGAN++	AttnGAN	DoubleGAN
CUB bird	8.1	15.3	76.6
MSCOCO	6.2	11.2	82.6

模型	组件				IS	FID
模型	CAM	PAM	CAdaILN	VL	IS	FID
①	√	√	√	√	4.97±0.05	15.13
②	√	√	√	—	4.72±0.04	19.23
③	√	√	—	√	4.11±0.04	25.24
④	√	—	√	√	4.71±0.05	21.69
⑤	—	√	√	√	4.60±0.07	22.95
⑥	—	—	√	√	4.54±0.04	23.72

取值	IS	FID
0.05	4.74±0.05	18.15
0.10	4.97±0.05	15.13
0.15	4.82±0.06	16.75
0.20	4.59±0.04	20.91
0.30	4.70±0.06	20.28

Text-to-image generation based on double-attention generative adversarial network

RichHTML

PDF (PC)

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 15

References 22

Related Articles 15

Recommended Articles

Metrics

Comments

模型	框架	IS	FID
1	基准校正	4.11±0.04	25.24
2	+BN-sent	4.67±0.07	19.76
3	+BN-word	4.68±0.04	19.46
4	+CAdaILN	4.97±0.05	15.13
5	+CAdaILN-word	4.71±0.07	19.08

[1]	Mingyu JIANG, Shunsheng ZHANG, Siyao XIAO. SAR target recognition based on lightweight cross-attention convolutional neural network [J]. Systems Engineering and Electronics, 2025, 47(9): 2853-2861.
[2]	Weihong FU, Wenhong PENG, Naian LIU. SAR image small target detection method with hybrid attention optimization [J]. Systems Engineering and Electronics, 2025, 47(8): 2519-2526.
[3]	Kang NI, Wenjie JIA, Minrui ZOU, Zhizhong ZHENG. SAR object detection based on dynamic aggregation network [J]. Systems Engineering and Electronics, 2025, 47(8): 2527-2539.
[4]	Xiaowei FU, Xinyi WANG, Zhe QIAO. Attack-defense confrontation strategy of multi-UAV based on APIQ algorithm [J]. Systems Engineering and Electronics, 2025, 47(7): 2205-2215.
[5]	Xiaowei FU, Xinyi WANG, Zhe QIAO. Confront strategy of multi-unmanned aerial vehicle based on ASDDPG algorithm [J]. Systems Engineering and Electronics, 2025, 47(6): 1867-1879.
[6]	Xiaoyang HE, Xiaolong CHEN, Xiaolin DU, Ningyuan SU, Wang YUAN, Jian GUAN. Classification of maritime micromotion target based on transfer learning in CBAM-Swin-Transformer [J]. Systems Engineering and Electronics, 2025, 47(4): 1155-1167.
[7]	Lan ZHANG, Biao ZHANG, Tianyi LIANG, Huijie ZHU. Research progress on generative adversarial network for electromagnetic information intelligent control [J]. Systems Engineering and Electronics, 2025, 47(3): 730-744.
[8]	Yujia JIA, Siqian ZHANG, Tao TANG, Gangyao KUANG. Blind super-resolution reconstruction of airborne SAR real-time transmission images with enhanced scattering features [J]. Systems Engineering and Electronics, 2025, 47(3): 753-767.
[9]	Jiakuan LI, Bo FENG, Hongliang LIU, Chunmao YE, Jizhou YU. Angle-guided attention-based wideband PD recognition method for aerodynamic targets [J]. Systems Engineering and Electronics, 2025, 47(3): 807-816.
[10]	Qiang LIU, Haoran SUN, Denghua HU, Shuang ZHANG. Time alignment fusion algorithm based on Vondrak-Cepek combined filtering and attention mechanism weighting [J]. Systems Engineering and Electronics, 2025, 47(2): 673-679.
[11]	Jie JIANG, Wenjun YAN, Qing LING, Limin ZHANG. Tiny objects detection method for unmanned aerial vehicle ship images based on STOD [J]. Systems Engineering and Electronics, 2025, 47(11): 3559-3567.
[12]	Yong WANG, Boya ZHANG. Ship target detection method in SAR images based on feature fusion and location enhancement [J]. Systems Engineering and Electronics, 2025, 47(11): 3586-3597.
[13]	Xi TANG, Wenhai LI, Zhenhao TANG, Ruifeng LI, Gen LI. Imbalanced data oversampling method based on DBSCAN and CGAN [J]. Systems Engineering and Electronics, 2025, 47(11): 3739-3753.
[14]	Wei FANG, Tingting ZHANG, Kaiwen TAN, Miao TANG. Air combat situation assessment based on differential window generative adversarial network [J]. Systems Engineering and Electronics, 2024, 46(8): 2738-2746.
[15]	Jiajun WU, Chun SU, Yuru ZHANG. Remaining useful life prediction based on double self-attention mechanism and long short-term memory network [J]. Systems Engineering and Electronics, 2024, 46(6): 1986-1994.