基于不确定性的贝叶斯策略重用方法

doi:10.12305/j.issn.1001-506X.2025.02.20

摘要/Abstract

摘要：

针对多智能体对抗中因对手策略变化导致的非平稳性问题, 在对手动作不可获取的限制下, 提出一种基于不确定性的贝叶斯策略重用算法。在离线阶段, 在策略学习的同时, 通过自编码器建模智能体轨迹与对手动作之间的关系表征以构建对手模型。在在线阶段, 依据对手模型和有限交互信息, 估计对手策略类型的不确定性, 并基于此选择最优应对策略并重用。最后, 在两种对抗场景下的实验结果表明所提算法相比3种先进的基线方法识别精度更高, 且识别速度更快。

关键词: 多智能体对抗, 贝叶斯策略重用, 强化学习, 关系表征

Abstract:

To solve the non-stationarity problem caused by opponent policy changes in multi-agent competitions, this paper proposes an algorithm called uncertainty-based Bayesian policy reuse under the restriction of unavailability of the online opponent's actions. In the offline phase, use an autoencoder to model the relationship representation between agent trajectories and the opponent actions during policy learning. In the online phase, the agent evaluates the uncertainty of the opponent type only conditioning on limited interaction information and the built opponent models. Afterward, optimal response policy is selected for execution. The proposed algorithm on two scenarios and demonstrate that it has higher recognition accuracy and faster speed than three state-of-the-art baseline methods.

Key words: multi-agent competition, Bayesian policy reuse, reinforcement learning, relationship representation

中图分类号:

TP301.6

付可, 陈浩, 王宇, 刘权, 黄健. 基于不确定性的贝叶斯策略重用方法[J]. 系统工程与电子技术, 2025, 47(2): 535-543.

Ke FU, Hao CHEN, Yu WANG, Quan LIU, Jian HUANG. Uncertainty-based Bayesian policy reuse method[J]. Systems Engineering and Electronics, 2025, 47(2): 535-543.

图/表 15

图1

图2

图3

图4

图5

图6

图7

图8

图9

表1

图10

图11

图12

图13

图14

参考文献 30

1	ZHOU Z Y, LIU G J, TANG Y. Multi-agent reinforcement learning: methods, applications, visionary prospects, and cha-llenges[EB/OL]. [2023-09-05]. https://doi.org/10.48550/arXiv.2305.10091.
2	WEN M N , KUBA J , LIN R J , et al. Multi-agent reinforcement learning is a sequence modeling problem[J]. Advances in Neural Information Processing Systems, 2022, 35, 16509- 16521.
3	VINYALS O , BABUSCHKIN I , CZARNECKI W M , et al. Grandmaster level in StarCraft Ⅱ using multi-agent reinforcement learning[J]. Nature, 2019, 575 (7782): 350- 354. doi: 10.1038/s41586-019-1724-z
4	GAO Y M, LIU F Y, WANG L, et al. Towards effective and interpretable human-agent collaboration in MOBA games: a communication perspective[C]//Proc. of the 11th International Conference on Learning Representations, 2023.
5	张磊, 李姜, 侯进永, 等. 基于改进强化学习的多无人机协同对抗算法研究[J]. 兵器装备工程学报, 2023, 44 (5): 230- 238.
	ZHANG L , LI J , HOU J Y , et al. Research on multi-UAV cooperative confrontation algorithm based on improved reinforcement learning[J]. Journal of Ordnance Equipment Engineering, 2023, 44 (5): 230- 238.
6	POPE A P , IDE J S , MICOVIC D , et al. Hierarchical reinforcement learning for air combat at DARPA's Alpha dog fight trials[J]. IEEE Trans.on Artificial Intelligence, 2022, 4 (6): 1371- 1385.
7	ANDRIES S , HERMAN A E , WILLIE B , et al. Scaling multi-agent reinforcement learning to full 11 versus 11 simulated robotic football[J]. Autonomous Agents and Multi-Agent Systems, 2023, 37 (1): 30.
8	孙辉辉, 胡春鹤, 张军国. 基于主动风险防御机制的多机器人强化学习协同对抗策略[J]. 控制与决策, 2023, 38 (5): 1429- 1450.
	SUN H H , HU C H , ZHANG J G . Cooperative countermeasure strategy based on active risk defense multiagent reinforcement learning[J]. Control and Decision, 2023, 38 (5): 1429- 1450.
9	ZHANG T. Opponent modelling in multi-agent systems[D]. London: University College London, 2021.
10	HU H M, SHI D X, YANG H H, et al. Independent multi-agent reinforcement learning using common knowledge[C]//Proc. of the IEEE International Conference on Systems, Man, and Cybernetics, 2022: 2703-2708.
11	ROSMAN B , HAWASLY M , RAMAMOORTHY S . Bayesian policy reuse[J]. Machine Learning, 2016, 104, 99- 127. doi: 10.1007/s10994-016-5547-y
12	何立, 沈亮, 李辉, 等. 强化学习中的策略重用: 研究进展[J]. 系统工程与电子技术, 2022, 44 (3): 884- 899.
	HE L , SHEN L , LI H , et al. Survey on policy reuse in reinforcement learning[J]. Systems Engineering and Electronics, 2022, 44 (3): 884- 899.
13	HERNANDEZ-LEAL P, TAYLOR M E, ROSMAN B, et al. Identifying and tracking switching, non-stationary opponents: a Bayesian approach[C]//Proc. of the 30th Conference on Artificial Intelligence, 2016.
14	YANG T P, MENG Z P, HAO J Y, et al. Towards efficient detection and optimal response against sophisticated opponents[C]// Proc. of the 28th International Joint Conference on Artificial Intelligence, 2019: 623-629.
15	WEERD H D , VERBRUFFE R , VERHEIJ B . How much does it help to know what she knows you know? an agent-based simulation study[J]. Artificial Intelligence, 2013, 199, 67- 92.
16	HERNANDEZ-LEAL P , KARTAL B , TAYLOR M E . A survey and critique of multiagent deep reinforcement learning[J]. Autonomous Agents and Multi-Agent Systems, 2019, 33, 750- 797.
17	ZHENG Y, MENG Z P, HAO J Y, et al. A deep Bayesian policy reuse approach against non-stationary agents[C]//Proc. of the Advances in Neural Information Processing Systems, 2018.
18	BANK D , KOENIGSTEIN N , GIRYES R . Autoencoders[J]. Machine Learning for Data Science Handbook, 2023, doi: 10.1007/978-3-031-24628-9_16
19	ZHAI J H, ZHANG S F, CHEN J F, et al. Autoencoder and its various variants[C]//Proc. of the IEEE International Conference on Systems, Man, and Cybernetics, 2018: 415-419.
20	LI C J , ZHOU D , GU Q , et al. Learning two-player Markov games: neural function approximation and correlated equilibrium[J]. Advances in Neural Information Processing Systems, 2022, 35, 33262- 33274.
21	GUO W B, WU X, HUANG S, et al. Adversarial policy learning in two-player competitive games[C]//Proc. of the 38th International Conference on Machine Learning, 2021: 3910-3919.
22	SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[EB/OL]. [2023-09-05]. https://doi.org/10.48550/arXiv.1707.06347.
23	VOLODYMYR M, ADRIA P B, MEH D, et al. Asynchronous methods for deep reinforcement learning[C]//Proc. of the 33th International Conference on Machine Learning, 2016.
24	姜楠, 王健. 信息论与编码理论[M]. 北京: 清华大学出版社, 2010.
	JIANG N , WANG J . The theory of information and coding[M]. Beijing: Tsinghua University Press, 2020.
25	ZHANG T, YING W G, GONG Z C, et al. A regularized opponent model with maximum entropy objective[C]//Proc. of the 29th International Joint Conference on Artificial Intelligence, 2019.
26	WIMMER L, SALE Y, HOFMAN P, et al. Quantifying aleatoric and epistemic uncertainty in machine learning: are conditional entropy and mutual information appropriate measures?[C]//Proc. of the 39th Conference on Uncertainty in Artificial Intelligence, 2023: 2282-2292.
27	MURPHY K P . Probabilistic machine learning: an introduction[M]. Cambridge: Massachusetts Institute of Technology Press, 2022.
28	CRESCENZO D A , LONGOBARD M . On cumulative entropies[J]. Journal of Statistical Planning and Inference, 2009, 139 (12): 4072- 4087.
29	PAPOUDAKIS G , CHRISTIANOU F , ALBRECHT S . Agent modelling under partial observability for deep reinforcement learning[J]. Advances in Neural Information Processing Systems, 2021, 34, 19210- 19222.
30	LOWE R, WU Y I, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Proc. of the 31st International Conference on Neural Information Processing Systems, 2017: 6382-6393.

实验环境	BPR+	Beyes-ToMoP	Deep BPR+	Uncertainty-BPR
足球游戏	84.5±0.0	81.8±1.6	84.1±0.1	87.7±0.1
追捕游戏	90.7±0.1	89.7±1.1	91.6±0.2	92.5±0.2

[1]	闫循良, 王宽, 张子剑, 王培臣. 基于LSTM-DDPG的再入制导方法[J]. 系统工程与电子技术, 2025, 47(1): 268-279.
[2]	张庭瑜, 曾颖, 李楠, 黄洪钟. 基于深度强化学习的航天器功率-信号复合网络优化算法[J]. 系统工程与电子技术, 2024, 46(9): 3060-3069.
[3]	夏雨奇, 黄炎焱, 陈恰. 基于深度Q网络的无人车侦察路径规划[J]. 系统工程与电子技术, 2024, 46(9): 3070-3081.
[4]	杨志鹏, 陈子浩, 曾长, 林松, 毛金娣, 张凯. 复杂环境下的飞行器在线航路规划决策方法[J]. 系统工程与电子技术, 2024, 46(9): 3166-3175.
[5]	彭莉莎, 孙宇祥, 薛宇凡, 周献中. 融合三支多属性决策与SAC的兵棋推演智能决策技术[J]. 系统工程与电子技术, 2024, 46(7): 2310-2322.
[6]	郭宏达, 娄静涛, 徐友春, 叶鹏, 李永乐, 陈晋生. 基于MADDPG的多无人车协同事件触发通信[J]. 系统工程与电子技术, 2024, 46(7): 2525-2533.
[7]	张梦钰, 豆亚杰, 陈子夷, 姜江, 杨克巍, 葛冰峰. 深度强化学习及其在军事领域中的应用综述[J]. 系统工程与电子技术, 2024, 46(4): 1297-1308.
[8]	李彦铃, 罗飞舟, 葛致磊. 基于鲁棒观测器的深度强化学习垂直起降运载器姿态稳定研究[J]. 系统工程与电子技术, 2024, 46(3): 1038-1047.
[9]	唐恒, 孙伟, 吕磊, 贺若飞, 吴建军, 孙昌浩, 孙田野. 融合动态奖励策略的无人机编队路径规划方法[J]. 系统工程与电子技术, 2024, 46(10): 3506-3518.
[10]	冯路为, 刘松涛, 徐华志. 基于POMDP模型的智能雷达干扰决策方法[J]. 系统工程与电子技术, 2023, 45(9): 2755-2760.
[11]	马悦, 吴琳, 许霄. 基于多智能体强化学习的协同目标分配[J]. 系统工程与电子技术, 2023, 45(9): 2793-2801.
[12]	韦道知, 张曌宇, 谢家豪, 李宁. 基于改进Actor-Critic算法的多传感器交叉提示技术[J]. 系统工程与电子技术, 2023, 45(6): 1624-1632.
[13]	吴冯国, 陶伟, 李辉, 张建伟, 郑成辰. 基于深度强化学习算法的无人机智能规避决策[J]. 系统工程与电子技术, 2023, 45(6): 1702-1711.
[14]	李欣致, 董胜波, 崔向阳. 基于非对称不可观测状态的强化学习技术[J]. 系统工程与电子技术, 2023, 45(6): 1755-1761.
[15]	唐进, 梁彦刚, 白志会, 黎克波. 基于DQN的旋翼无人机着陆控制算法[J]. 系统工程与电子技术, 2023, 45(5): 1451-1460.