系统工程与电子技术 ›› 2026, Vol. 48 ›› Issue (5): 1670-1681.doi: 10.12305/j.issn.1001-506X.2026.05.23

• 系统工程 • 上一篇    下一篇

鲁棒多智能体协同对抗策略离线强化学习

张华卿1, 张晓飞2, 郝明瑞1, 姜吉祥1,*, 李闪3   

  1. 1. 北京机电工程研究所复杂系统控制与智能协同全国重点实验室,北京 100074
    2. 北京计算机技术及应用研究所,北京 100854
    3. 海南大学数学与统计学院,海南 海口 570228
  • 收稿日期:2025-01-14 出版日期:2026-05-27 发布日期:2026-05-27
  • 通讯作者: 姜吉祥
  • 作者简介:张华卿(1993—),男,工程师,博士,主要研究方向为多智能体博弈对抗、多智能体协同搜索
    张晓飞(1991—),男,高级工程师,博士,主要研究方向为多智能体、机器学习、强化学习、大模型、基于人因的自动驾驶
    郝明瑞(1985—),男,研究员,博士,主要研究方向为智能决策、智能控制
    李 闪(1982—),女,副教授,博士,主要研究方向为多智能体协同
  • 基金资助:
    海南省自然科学基金(122QN215)资助课题

Robust multi-agent cooperative confrontation policy offline reinforcement learning

Huaqing ZHANG1, Xiaofei ZHANG2, Mingrui HAO1, Jixiang JIANG1,*, Shan LI3   

  1. 1. National Key Laboratory of Complex System Control and Intelligent Agent Cooperation,Beijing Institute of Mechanical and Electrical Engineering,Beijing 100074,China
    2. Beijing Institute of Computer Technology and Applications, Beijing 100854,China
    3. School of Mathematics and Statistics,Hainan University,Haikou 570228,China
  • Received:2025-01-14 Online:2026-05-27 Published:2026-05-27
  • Contact: Jixiang JIANG

摘要:

针对动态场景下的多智能体对抗策略离线学习问题,提出一种鲁棒多智能体策略离线强化学习( robust multi-agent policy offline reinforcement learning,RMA-offlineRL)方法,以降低数据集质量对策略离线学习的影响。在RMA-offlineRL的策略提升中,通过对从数据集中采样的历史状态-动作对在当前策略下的对数概率进行Box-Cox 转换来计算离线策略梯度,不但可以限制外推误差,还能够利用含有大量随机行为的低质量数据集进行策略提升。此外,在RMA-offlineRL 的策略评估中,设计了针对多步离线交互数据的稳定策略评估算法。为了验证所提方法的有效性和优势,在利用庙算兵棋系统和基准环境采集的离线数据集下对所提方法进行了充分验证。结果表明,所提方法能够利用含有大量随机行为或低状态-动作覆盖率指标的低质量数据集进行协同对抗策略的高效学习。

关键词: 多智能体, 协同对抗, 离线强化学习, 对抗场景, 随机行为

Abstract:

To address the problem of offline learning of multi-agent confrontation policies in dynamic scenarios, a robust multi-agent policy offline reinforcement learning (RMA-offlineRL) method is proposed, aiming to reduce the impact of dataset quality on offline policy learning. In the policy improvement of RMA-offlineRL, the offline policy gradient is calculated by performing a Box-Cox transformation on the log-probabilities of historical state-action pairs sampled from the dataset under the current policy. It can not only limit extrapolation errors but also improve the policy using low-quality datasets containing large amounts of random behaviors. In addition, in the policy evaluation of RMA-offlineRL, a stable policy evaluation algorithm is designed for multi-step offline interaction datas. To verify the effectiveness and advantages of the proposed method, comprehensive validation is conducted on offline datasets collected using MiaoSuan wargame system and benchmark environments. Results show that the proposed method can learn cooperative confrontation policies efficiently from low-quality datasets containing large amounts of random behaviors or with a low state-action coverage index.

Key words: multi-agent, cooperative confrontation, offline reinforcement learning, confrontation scenarios, random behavior

中图分类号: