鲁棒多智能体协同对抗策略离线强化学习

doi:10.12305/j.issn.1001-506X.2026.05.23

系统工程与电子技术 ›› 2026, Vol. 48 ›› Issue (5): 1670-1681.doi: 10.12305/j.issn.1001-506X.2026.05.23

鲁棒多智能体协同对抗策略离线强化学习

张华卿¹, 张晓飞², 郝明瑞¹, 姜吉祥¹^,*, 李闪³

1. 北京机电工程研究所复杂系统控制与智能协同全国重点实验室，北京 100074
2. 北京计算机技术及应用研究所，北京 100854
3. 海南大学数学与统计学院，海南海口 570228

收稿日期:2025-01-14 出版日期:2026-05-27 发布日期:2026-05-27
通讯作者: 姜吉祥
作者简介:张华卿（1993—），男，工程师，博士，主要研究方向为多智能体博弈对抗、多智能体协同搜索
张晓飞（1991—），男，高级工程师，博士，主要研究方向为多智能体、机器学习、强化学习、大模型、基于人因的自动驾驶
郝明瑞（1985—），男，研究员，博士，主要研究方向为智能决策、智能控制
李　闪（1982—），女，副教授，博士，主要研究方向为多智能体协同
基金资助:
海南省自然科学基金(122QN215)资助课题

Robust multi-agent cooperative confrontation policy ofﬂine reinforcement learning

Huaqing ZHANG¹, Xiaofei ZHANG², Mingrui HAO¹, Jixiang JIANG¹^,*, Shan LI³

1. National Key Laboratory of Complex System Control and Intelligent Agent Cooperation，Beijing Institute of Mechanical and Electrical Engineering，Beijing 100074，China
2. Beijing Institute of Computer Technology and Applications, Beijing 100854，China
3. School of Mathematics and Statistics，Hainan University，Haikou 570228，China

Received:2025-01-14 Online:2026-05-27 Published:2026-05-27
Contact: Jixiang JIANG

摘要/Abstract

摘要：

针对动态场景下的多智能体对抗策略离线学习问题，提出一种鲁棒多智能体策略离线强化学习（ robust multi-agent policy offline reinforcement learning，RMA-offlineRL）方法，以降低数据集质量对策略离线学习的影响。在RMA-offlineRL的策略提升中，通过对从数据集中采样的历史状态-动作对在当前策略下的对数概率进行Box-Cox 转换来计算离线策略梯度，不但可以限制外推误差，还能够利用含有大量随机行为的低质量数据集进行策略提升。此外，在RMA-offlineRL 的策略评估中，设计了针对多步离线交互数据的稳定策略评估算法。为了验证所提方法的有效性和优势，在利用庙算兵棋系统和基准环境采集的离线数据集下对所提方法进行了充分验证。结果表明，所提方法能够利用含有大量随机行为或低状态-动作覆盖率指标的低质量数据集进行协同对抗策略的高效学习。

关键词: 多智能体, 协同对抗, 离线强化学习, 对抗场景, 随机行为

Abstract:

To address the problem of offline learning of multi-agent confrontation policies in dynamic scenarios, a robust multi-agent policy offline reinforcement learning （RMA-offlineRL） method is proposed, aiming to reduce the impact of dataset quality on offline policy learning. In the policy improvement of RMA-offlineRL, the offline policy gradient is calculated by performing a Box-Cox transformation on the log-probabilities of historical state-action pairs sampled from the dataset under the current policy. It can not only limit extrapolation errors but also improve the policy using low-quality datasets containing large amounts of random behaviors. In addition, in the policy evaluation of RMA-offlineRL, a stable policy evaluation algorithm is designed for multi-step offline interaction datas. To verify the effectiveness and advantages of the proposed method, comprehensive validation is conducted on offline datasets collected using MiaoSuan wargame system and benchmark environments. Results show that the proposed method can learn cooperative confrontation policies efficiently from low-quality datasets containing large amounts of random behaviors or with a low state-action coverage index.

Key words: multi-agent, cooperative confrontation, offline reinforcement learning, confrontation scenarios, random behavior

中图分类号:

TP 181

张华卿, 张晓飞, 郝明瑞, 姜吉祥, 李闪. 鲁棒多智能体协同对抗策略离线强化学习[J]. 系统工程与电子技术, 2026, 48(5): 1670-1681.

Huaqing ZHANG, Xiaofei ZHANG, Mingrui HAO, Jixiang JIANG, Shan LI. Robust multi-agent cooperative confrontation policy ofﬂine reinforcement learning[J]. Systems Engineering and Electronics, 2026, 48(5): 1670-1681.

图/表 9

图1

图2

图3

图4

表1

图5

表2

图6

图7

参考文献 34

1	张梦钰, 豆亚杰, 陈子夷, 等. 深度强化学习及其在军事领域中的应用综述[J]. 系统工程与电子技术, 2024, 46(4): 1297−1308.
	ZHANG M Y, DOU Y J, CHEN Z Y, et al. Review of deep reinforcement learning and its applications in military field[J]. Systems Engineering and Electronics, 2024, 46(4): 1297−1308 .
2	周雪, 苘大鹏, 许晨, 等. 无人系统中离线强化学习的隐蔽数据投毒攻击方法[J]. 通信学报, 2024, 45 (12): 16- 27. doi: 10.11959/j.issn.1000-436x.2024264
	ZHOU X, RUI D P, XU C, et al. Stealthy data poisoning attack method on offline reinforcement learning in unmanned systems[J]. Journal on Communications, 2024, 45 (12): 16- 27. doi: 10.11959/j.issn.1000-436x.2024264
3	侯永宏, 丁旺, 任懿, 等. 基于优质样本筛选的离线强化学习算法[J]. 模式识别与人工智能, 2024, 37(11): 1022−1032.
	HOU Y H, DING W, REN Y, et al. Offline reinforcement learning algorithm based on selection of high-quality samples[J]. Pattern Recognition and Artificial Intelligence, 2024, 37(11): 1022−1032.
4	彭莉莎, 孙宇祥, 薛宇凡, 等. 融合三支多属性决策与SAC的兵棋推演智能决策技术[J]. 系统工程与电子技术, 2024, 46 (7): 2310- 2322.
	PENG L S, SUN Y X, XUE Y F, et al. Intelligent decision-making technology for wargame by integrating three-way multiple attribute decision-making and SAC[J]. Systems Engineering and Electronics, 2024, 46 (7): 2310- 2322.
5	ARNOB S Y, ISLAM R, PRECUP D. Importance of empirical sample complexity analysis for offline reinforcement learning[EB/OL]. [2024-12-10]. https://doi.org/10.48550/arXiv.2112.15578.
6	KAJETAN S, MARKUS H, MARKUS C D, et al. A dataset perspective on offline reinforcement learning[C]//Proc. of the Conference on Lifelong Learning Agents, 2022: 470−517.
7	AVRIAL K, JUSTING F, MATTHEW S, et al. Stabilizing off-policy Q-learning via bootstrapping error reduction[C]//Proc. of the 33th Neural Information Processing Systems, 2019: 452−461.
8	毛经坤, 李凤熙, 刘春新, 等. 基BCQ离线强化学习的呼吸机动态治疗策略控制[EB/OL]. [2024-12-10]. https://link.cnki.net/urlid/12.1374.N.20241122.0937.008.
	MAO J K, LI F X, LIU C X, et al. Dynamic treatment policy control of ventilator based on BCQ offline deep reinforcement learning[EB/OL]. [2024-12-10]. https://link.cnki.net/urlid/12.1374.N.20241122.0937.008.
9	陈锶奇, 耿婕, 汪云飞, 等. 基于离线强化学习的研究综述[J]. 无线电通信技术, 2024, 50 (5): 831- 842.
	CHEN S Q, GENG J, WANG Y F, et al. Survey of research on offline reinforcement learning[J]. Radio Communications and Technology, 2024, 50 (5): 831- 842.
10	NATASHA J, ASMA G, JUDY H S, et al. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog[EB/OL]. [2024-12-10]. https://doi.org/10.48550/arXiv.1907.00456.
11	HUANG L Y, DONG B T, XIE W, et al. An implicit trust region approach to behavior regularized offine reinforcement learning[C]//Proc. of the 38th AAAI Conference on Artificial Intelligence, 2024, 16944−16952.
12	WU Y F, TUCKER G, NACHUM O. Behavior regularized ofﬂine reinforcement learning[EB/OL]. [2024-12-10]. https://doi.org/10.48550/arXiv.1911.11361 .
13	FUJIMOTO S, GU S S. A minimalist approach to ofﬂine reinforcement learning[C]//Proc. of the 35th Conference on Neural Information Processing Systems, 2024.
14	PAINE T L, PADURARU C, MICHI A, et al. Hyperparameter selection for ofﬂine reinforcement learning[EB/OL]. [2024-12-10]. https://doi.org/10.48550/arXiv.2007.09055.
15	ZHANG S Y, JIANG N. Towards hyperparameter-free policy selection for offline reinforcement learning[C]//Proc. of the 35th Conference on Neural Information Processing Systems, 2021: 12864−12875.
16	JIN Y, YANG Z R, WANG Z R. Is pessimism provably efficient for offline RL?[C]//Proc. of the 38th International Conference on Machine Learning, 2021: 5084−5096.
17	KUMAR A, ZHOU A, TUCKER G, et al. Conservative Q-learning for offline reinforcement learning[C]//Proc. of the 34th Conference on Neural Information Processing Systems, 2020: 1179−1191.
18	KOSTRIKOV I, NAIR A, LEVINE S. Offline reinforcement learning with implicit Q-learning[EB/OL]. [2024-12-10]. https://doi.org/10.48550/arXiv.2110.06169.
19	BRANDFONBRENER D, WHITNEY W, RANGANATH R, et al. Offline RL without off-policy evaluation[C]//Proc. of the 35th Conference on Neural Information Processing Systems, 2021, 34: 4933−4946.
20	NAIR A, GUPTA A, DALAL M, et al. AWAC: accelerating online reinforcement learning with offline datasets[EB/OL]. [2024-12-10]. https://doi.org/10.48550/arXiv.2006.09359.
21	XU J L, HU J, WANG S X, et al. MiaoSuan Wargame: a multi-mode integrated platform for imperfect information game[C]//Proc. of the IEEE Conference on Games, 2022: 457−464.
22	WANG C B, ZHANG X Y, GAO H B, et al. COLERGs-constrained safe reinforcement learning for realising MASS’s risk-informed collision avoidance decision making[J]. Knowledge-Based Systems, 2024, 300, 112205. doi: 10.1016/j.knosys.2024.112205
23	GAO H B, ZHAO M, ZHENG X, et al. An improved hierarchical deep reinforcement learning algorithm for multi-intelligent vehicle lane change[J]. Neurocomputing, 2024, 609, 128482. doi: 10.1016/j.neucom.2024.128482
24	SUTTON R S, MCALLESTER D, SINGH S, et al. Policy gradient methods for reinforcement learning with function approximation[C]//Proc. of the Neural Information Processing Systems, 2000: 1057−1063.
25	SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Combustion optimization study of pulverized coal boiler based on proximal policy optimization algorithm[J]. Applied Thermal Engineering, 2024, 254, 1359- 1361. doi: 10.2139/ssrn.4542814
26	OSBORNE J. Improving your data transformations: applying the Box-Cox transformation[J]. Practical Assessment, Research, and Evaluation, 2010, 15(12). DOI: https://doi.org/10.7275/qbpc-gk17.
27	LIU R Z, WANG W H, SHEN Y J, et al. An introduction of mini-AlphaStar[EB/OL]. [2024-12-10]. https://doi.org/10.48550/arXiv.2104.06890.
28	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proc. of the Neural Information Processing Systems, 2017: 5998−6008.
29	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770−778.
30	ZHANG H Q, MA H B, YING J. An improved off-policy actor-critic algorithm with historical behaviors reusing for robotic control[C]//Proc. of the 15th International Conference on Intelligent Robots and Applications, 2023: 449−458.
31	ESPEHOLT L, SOYER H, MUNOS R, et al. IMPALA: scalable distributed deep RL with importance weighted actor-learner architectures[C]//Proc. of the 35th International Conference on Machine Learning, 2018: 1407−1416.
32	HU M. Deep RL zoo: a collections of deep RL algorithms implemented with PyTorch[EB/OL]. [2024-12-10]. https://github.com/michaelnny/deep_rl_zoo.
33	BROCKMAN G, CHEUNG V, PETTERSSON L, et al. OpenAI Gym[EB/OL]. [2024-12-10]. https://doi.org/10.48550/arXiv.1606.01540.
34	BERROCAL E, SIERRA B, HERRERO H W. Evaluating PyBullet and Isaac Sim in the scope of robotics and reinforcement learning[C]//Proc. of the 7th Iberian Robotics Conference, 2024.

超参数	参数设置
优化器	Adam
优化器$ \gamma $	0.99
学习率$ {l}_{r} $	$ 3\times {10}^{-4} $
交互步长$ n $	8
缩放系数$ {C}_{p} $	0.01
训练数据批次大小	15
隐藏层单元数	300,400
评价模块动作扩展层单元数	84
指数移动平均参数$ {w}_{e} $	0.9
激活函数	ReLU

基准环境全称	简称
LunarLanderContinuous-v2	LunarLander
BipedalWalkerHardcore-v3	BipedalWalker
HopperPyBulletEnv-v0	Hopper
AntPyBulletEnv-v0	Ant
HalfCheetahPyBulletEnv-v0	HalfCheetah
Walker2DPyBulletEnv-v0	Walker2D

[1]	陈丹鹤, 王书航, 刘志勇, 王创歌. 面向多航天器协作围捕的智能决策方法[J]. 系统工程与电子技术, 2026, 48(4): 1404-1412.
[2]	单晨宇, 李少凡, 齐瑞云. 多智能体系统故障临机处理下的快速任务重分配[J]. 系统工程与电子技术, 2026, 48(1): 185-197.
[3]	李宗刚, 邱进涛, 宁小刚, 陈引娟. 多智能体系统周期动态事件触发二分一致性[J]. 系统工程与电子技术, 2026, 48(1): 278-289.
[4]	郭方杰, 李靖, 张朝辉. 具有输入时滞的MAS预设时间滞后一致性[J]. 系统工程与电子技术, 2025, 47(9): 3041-3046.
[5]	杨大鹏, 龚资浩, 王小也, 郭正玉, 罗德林. 基于多智能体强化学习的无人机协同截击机动决策研究[J]. 系统工程与电子技术, 2025, 47(9): 3076-3085.
[6]	王子怡, 傅雄军, 董健, 冯程. 基于分层多智能体强化学习的雷达协同抗干扰策略优化[J]. 系统工程与电子技术, 2025, 47(4): 1108-1114.
[7]	付可, 陈浩, 王宇, 刘权, 黄健. 基于不确定性的贝叶斯策略重用方法[J]. 系统工程与电子技术, 2025, 47(2): 535-543.
[8]	李嘉乐, 钟绮霖, 肖杰, 李国飞. 多智能体系统自适应固定时间编队控制[J]. 系统工程与电子技术, 2025, 47(2): 600-607.
[9]	王琛, 朱承, 王祥科, 丁兆云, 张千桢, 张胜, 朱先强. 无人机分布式集群反制动态多目标运动控制技术[J]. 系统工程与电子技术, 2025, 47(11): 3765-3778.
[10]	张耀中, 吴卓然, 张建东, 杨啟明, 史国庆, 徐自祥. 基于ME-DDPG算法的无人机多对一追逃博弈[J]. 系统工程与电子技术, 2025, 47(10): 3288-3299.
[11]	刘伟民, 王永越, 马欣阳, 刘金琨. 输入时滞多智能体系统的输入受限一致性控制[J]. 系统工程与电子技术, 2024, 46(9): 3176-3184.
[12]	张杰, 刘开蓉, 陈金宝, 张迎雪, 陈传志, 余虹志, 张云啸. 基于空间对抗的多智能体编队控制方法[J]. 系统工程与电子技术, 2024, 46(6): 2082-2091.
[13]	罗俊仁, 张万鹏, 苏炯铭, 袁唯淋, 陈璟. 多智能体博弈学习研究进展[J]. 系统工程与电子技术, 2024, 46(5): 1628-1655.
[14]	孙谷昊, 蔡中泽, 曾庆双. 多智能体编队加权中心点固定时间分布式跟踪控制[J]. 系统工程与电子技术, 2024, 46(12): 4165-4172.
[15]	左仁伟, 李颖晖, 吕茂隆, 聂鸿雁. 动态自触发通信下多智能体输出反馈包容控制[J]. 系统工程与电子技术, 2024, 46(1): 345-356.

鲁棒多智能体协同对抗策略离线强化学习

Robust multi-agent cooperative confrontation policy ofﬂine reinforcement learning

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 34

相关文章 15

编辑推荐

Metrics

本文评价