系统工程与电子技术 ›› 2025, Vol. 47 ›› Issue (2): 535-543.doi: 10.12305/j.issn.1001-506X.2025.02.20

• 系统工程 • 上一篇    

基于不确定性的贝叶斯策略重用方法

付可, 陈浩, 王宇, 刘权, 黄健   

  1. 国防科技大学智能科学学院, 湖南 长沙 410073
  • 收稿日期:2023-09-05 出版日期:2025-02-25 发布日期:2025-03-18
  • 通讯作者: 黄健
  • 作者简介:付可 (1993—), 女, 博士研究生, 主要研究方向为多智能体强化学习、系统仿真
    陈浩 (1993—), 男, 讲师, 博士, 主要研究方向为多智能体强化学习、系统仿真
    王宇 (1998—), 男, 博士研究生, 主要研究方向为多智能体强化学习、系统仿真
    刘权 (1985—), 男, 副研究员, 博士, 主要研究方向为机器学习、无线传感器网络
    黄健 (1971—), 女, 研究员, 博士, 主要研究方向为系统仿真、机器学习

Uncertainty-based Bayesian policy reuse method

Ke FU, Hao CHEN, Yu WANG, Quan LIU, Jian HUANG   

  1. College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China
  • Received:2023-09-05 Online:2025-02-25 Published:2025-03-18
  • Contact: Jian HUANG

摘要:

针对多智能体对抗中因对手策略变化导致的非平稳性问题, 在对手动作不可获取的限制下, 提出一种基于不确定性的贝叶斯策略重用算法。在离线阶段, 在策略学习的同时, 通过自编码器建模智能体轨迹与对手动作之间的关系表征以构建对手模型。在在线阶段, 依据对手模型和有限交互信息, 估计对手策略类型的不确定性, 并基于此选择最优应对策略并重用。最后, 在两种对抗场景下的实验结果表明所提算法相比3种先进的基线方法识别精度更高, 且识别速度更快。

关键词: 多智能体对抗, 贝叶斯策略重用, 强化学习, 关系表征

Abstract:

To solve the non-stationarity problem caused by opponent policy changes in multi-agent competitions, this paper proposes an algorithm called uncertainty-based Bayesian policy reuse under the restriction of unavailability of the online opponent's actions. In the offline phase, use an autoencoder to model the relationship representation between agent trajectories and the opponent actions during policy learning. In the online phase, the agent evaluates the uncertainty of the opponent type only conditioning on limited interaction information and the built opponent models. Afterward, optimal response policy is selected for execution. The proposed algorithm on two scenarios and demonstrate that it has higher recognition accuracy and faster speed than three state-of-the-art baseline methods.

Key words: multi-agent competition, Bayesian policy reuse, reinforcement learning, relationship representation

中图分类号: