基于多智能体博弈强化学习的无人机智能攻击策略生成模型

doi:10.12305/j.issn.1001-506X.2023.10.21

摘要/Abstract

摘要：

如何利用以攻击型无人机(unmanned aerial vehicle, UAV)为代表的新型作战力量增强战斗力, 是智能化、无人化战争研究的重点之一。研究了基于多智能体博弈强化学习的无人机智能攻击关键技术, 基于马尔可夫随机博弈的基本概念, 建立了基于多智能体博弈强化学习的无人机智能攻击策略生成模型, 并利用博弈论中“颤抖的手完美”思想提出优化方法, 改进了策略模型。仿真实验表明, 优化后的算法在原算法基础上有所提升, 训练得到的模型可生成多种实时攻击战术, 对智能化指挥控制具有较强的现实意义。

关键词: 多智能体博弈强化学习, 马尔可夫随机博弈, 无人机, 战术策略

Abstract:

How to utilize new combat forces represented by offensive unmanned aerial vehicle (UAV) to enhance combat effectiveness is one of the focuses of intelligent and unmanned warfare research. This article is based on the key technology of UAV intelligent attack using multi-agent game reinforcement learning, as well as the basic concept of Markov random games. A model for generating UAV intelligent attack strategies based on multi-agent game reinforcement learning is established, and an optimization method is proposed using the "trembling hand perfect" idea in the game theory to improve the strategy model. Simulation experiments show that the optimized algorithm has improved the original algorithm, and the trained model can generate various real-time attack tactics, which has strong practical significance for intelligent command and control.

Key words: multi-agent game reinforcement learning, Markov stochastic game, unmanned aerial vehicle (UAV), tactical strategy

中图分类号:

E917

赵芷若, 曹雷, 陈希亮, 赖俊, 章乐贵. 基于多智能体博弈强化学习的无人机智能攻击策略生成模型[J]. 系统工程与电子技术, 2023, 45(10): 3165-3171.

Zhiruo ZHAO, Lei CAO, Xiliang CHEN, Jun LAI, Legui ZHANG. UAV intelligent attack strategy generation model based on multi-agent game reinforcement learning[J]. Systems Engineering and Electronics, 2023, 45(10): 3165-3171.

图/表 9

图1

图2

表1

表2

图3

图4

表3

表4

图5

参考文献 30

1	孙彧, 李清伟, 徐志雄,等.基于多智能体深度强化学习的空战博弈对抗策略训练模型[J].指挥信息系统与技术,2021,12(2):16-20.
	SUNY , LIQ W , XUZ X ,et al.Game confrontation strategy training model for air combat based on multi agent deep reinforcement learning[J].Command Information System and Technology,2021,12(2):16-20.
2	陈希亮, 曹雷, 沈驰.基于深度逆向强化学习的行动序列规划问题研究[J].国防科技,2019,40(4):55-61.
	CHENX L , CAOL , SHENC .Research on action sequence planning based on deep inverse reinforcement learning[J].National Defense Science & Technology,2019,40(4):55-61.
3	曹雷, 孙彧, 陈希亮,等.联合作战任务智能规划关键技术及其应用思考[J].国防科技,2020,41(3):49-56.
	CAOL , SUNY , CHENX L ,et al.Key technology and application of intelligent mission planning in joint operations[J].National Defense Science & Technology,2020,41(3):49-56.
4	曹雷, 陈希亮, 汤伟.智能化陆军建设[J].国防科技,2019,40(4):14-19.
	CAOL , CHENX L , TANGW .Intelligent army construction[J].National Defense Science & Technology,2019,40(4):14-19.
5	陈希亮, 李清伟, 孙彧.基于博弈对抗的空战智能决策关键技术[J].指挥信息系统与技术,2021,12(2):6.
	CHENX L , LIQ W , SUNY .Key technologies for air combat intelligent decision based on game confrontation[J].Command Information System and Technology,2021,12(2):6.
6	SUNEHAG P, LEVER G, GRUSLYS A, et al. Value-decomposition networks for cooperative multi-agent learning[C]// Proc. of the 17th International Conference on Autonomous Agents and Multiagent Systems, 2018: 10-15.
7	RASHID T, SAMVELYAN M, WITT C D, et al. QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning[C]//Proc. of the 35th International Confe-rence on Machine Learning, 2018: 4295-4304.
8	YANG Y, RUI L, LI M, et al. Mean field multi-agent reinforcement learning[C]//Proc. of the 35th International Conference on Machine Learning, 2018: 5571-5580.
9	FOERSTER J N, CHEN R Y, AL-SHEDIVAT M, et al. Learning with opponent-learning awareness[C]//Proc. of the 17th International Conference on Autonomous Agents and Multi Agent Systems, 2017: 122-130.
10	PENG P, WEN Y, YANG Y, et al. Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play starcraft combat games[EB/OL]. [2021-10-10]. https://arxiv.org/pdf/1703.10069.pdf.
11	HU D P, JIANG X S, WEI X M, et al. State representation learning for minimax deep deterministic policy gradient[C]//Proc. of the 12th International Conference on Knowledge Science, Engineering and Management, 2019: 481-487.
12	YANG Y D, RUI L, LI M N, et al. Mean field multi-agent reinforcement learning[C]//Proc. of the 35th International Conference on Machine Learning, 2018.
13	FOERSTER J N, CHEN R Y, AL S M, et al. Learning with opponent-learning awareness[C]//Proc. of the 17th International Conference on Autonomous Agents and Multi Agent Systems, 2017: 122-130.
14	LOWE R, WU Y, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[EB/OL]. [2021-10-10]. https://arxiv.org/abs/1706.02275.
15	HERNANDEZ-LEAL P, KAISERS M, BAARSLAG T, et al. A survey of learning in multiagent environments: dealing with non-stationarity[EB/OL]. [2021-10-10]. https://arxiv.org/abs/1707.09183.
16	WATSONJ .Strategy: an introduction to game theory[M].New York:W. W. Norton & Company,2013.
17	SUTTONR , BARTOA .Reinforcement learning: an introduction[M].Cambridge:MIT Press,1998.
18	LECUNY , BENGIOY , HINTONG .Deep learning[J].Nature,2015,521(7553):436-444.
19	LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[C]//Proc. of the 4th International Conference on Learning Representations, 2016.
20	SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[EB/OL]. [2021-10-10]. https://arxiv.org/abs/1707.06347.
21	RASMUSENE .Games and information: an introduction to game theory[J].International Journal of Industrial Organization,1991,9(3):474-476.
22	LEE K, RENGARAJAN D, KALATHIL D, et al. Learning trembling hand perfect mean field equilibrium for dynamic mean field games[EB/OL]. [2021-10-10]. https://arxiv.org/abs/2006.11683.
23	SHAPLEY L S. Stochastic games[J]//Proceedings of the National Academy of Sciences, 1953, 39(10): 1095-1100.
24	BAŞART .Dynamic noncooperative game theory[J].Society for Industrial and Applied Mathematics,1982,19(2):139-152.
25	郝峰, 张栋, 唐硕,等.基于改进RRT算法的巡飞弹快速航迹规划方法[J].飞行力学,2019,37(3):58-63.
	HAOF , ZHANGD , TANGS ,et al.A rapid route planning method of loitering munitions based on improved RRT algorithm[J].Flight Mechanics,2019,37(3):58-63.
26	YU C, VELU A, VINITSKY E, et al. The surprising effectiveness of MAPPO in cooperative, multi-agent games[EB/OL]. [2021-10-10]. https://arxiv.org/abs/2103.01955.
27	BOOTH J. PPO dash: improving generalization in deep reinforcement learning[EB/OL]. [2021-10-10]. https://arxiv.org/abs/1907.06704.
28	ENGSTROM L, ILYAS A, SANTURKAR S, et al. Implementation matters in deep policy gradients: a case study on PPO and TRPO[EB/OL]. [2021-10-10]. https://arxiv.org/abs/2005.12729.
29	PEDREGOSAF , VAROQUAUXG , GRAMFORTA ,et al.Scikit-learn: machine learning in Python[J].The Journal of Machine Learning Research,2011,12(4):2825-2830.
30	ABADI M, BARHAM P, CHEN J, et al. Tensorflow: a system for large-scale machine learning[C]//Proc. of the 12th Unix Users'Group Symposium on Operating Systems Design and Implementation, 2016: 265-283.

挂载武器	红蓝双方	初始位置
AGM-114K型地狱火Ⅱ 反坦克导弹	无人机1	航向: 230° 航速: 129 km/h 高度: 777 m 东经: 43°56′37″ 北纬: 33°52′11″
	无人机2	航向: 91° 航速: 129 km/h 高度: 610 m 东经: 43°58′06″ 北纬: 32°54′23″
	无人机3	航向: 230° 航速: 248 km/h 高度: 777 m 东经: 45°09′16″ 北纬: 33°48′30″
-	坦克排1	东经: 44°09′09″ 北纬: 33°45′09″
	坦克排2	东经: 44°07′44″ 北纬: 33°16′38″
	坦克排3	东经: 44°19′35″ 北纬: 33°24′24″
	坦克排4	东经: 44°35′10″ 北纬: 33°38’28″
	坦克排5	东经: 44°35′04″ 北纬: 33°24′24″
	坦克排6	东经: 44°35′20″ 北纬: 33°11′37″
	坦克排7	东经: 44°49′43″ 北纬: 33°24′35″
	坦克排8	东经: 45°01′42″ 北纬: 33°31′36″
	坦克排9	东经: 44°59′43″ 北纬: 33°04′46″
萨姆22“灰狗” 地空导弹	地空导弹排1	东经: 44°20′07″ 北纬: 33°36′55″
	地空导弹排2	东经: 44°20′45″ 北纬: 33°12′18″
	地空导弹排3	东经: 44°49′51″ 北纬: 33°36′28″
	地空导弹排4	东经: 44°50′46″ 北纬: 33°12′28″

参数	取值
学习率	0.000 5
折扣因子	0.99
经验回放池	100 000
激活函数	ReLU
近端策略优化算法回合数	15
Clip	0.2

迭代轮数	无人机1	无人机2	无人机3
0~500	-15	-18	-10
500~1 000	108	117	90
1 000~2 000	117	125	100
2 000~3 000	113	120	99

迭代轮数	无人机1	无人机2	无人机3
0~500	-17	-20	-11
500~1 000	121	131	100
1 000~2 000	162	178	130
2 000~3 000	167	185	131

[1]	王中宝, 尹奎英. 基于联合域滤波的无人机载SAR图像块效应抑制方法[J]. 系统工程与电子技术, 2023, 45(9): 2768-2776.
[2]	张洪海, 任真苹, 冯讴歌, 王非, 刘皞. 城市低空物流无人机飞行计划预先调配[J]. 系统工程与电子技术, 2023, 45(9): 2802-2811.
[3]	梁玉峰, 赵景朝, 刘旺魁, 王雷, 王世鹏, 阮仕龙. 基于顶层滚动优化和底层跟踪的空战导引方法[J]. 系统工程与电子技术, 2023, 45(9): 2866-2872.
[4]	韩蕙竹, 黄仰超, 胡航, 安琪, 刘世豪. 无人机通信中基于短包传输的能/谱效折中优化[J]. 系统工程与电子技术, 2023, 45(9): 2956-2964.
[5]	齐小刚, 周雨桐, 刘立芳. 无人机集群对地作战任务可靠性评估[J]. 系统工程与电子技术, 2023, 45(9): 2971-2978.
[6]	盛磊, 时满红, 亓迎川, 李浩, 庞明军. 基于态势演化博弈的无人机集群动态攻防[J]. 系统工程与电子技术, 2023, 45(8): 2332-2342.
[7]	许彤, 陈亚洲, 王玉明, 赵敏. 无人机数据链宽带白噪声电磁干扰效应研究[J]. 系统工程与电子技术, 2023, 45(7): 1965-1973.
[8]	吴立尧, 苏析超, 王垒, 潘子双. 有人/无人机编队队形集结控制研究[J]. 系统工程与电子技术, 2023, 45(7): 2192-2202.
[9]	吴冯国, 陶伟, 李辉, 张建伟, 郑成辰. 基于深度强化学习算法的无人机智能规避决策[J]. 系统工程与电子技术, 2023, 45(6): 1702-1711.
[10]	刘瑶, 夏阳升, 石建迈, 陈超, 黄金才. 车载多无人机协同多区域覆盖路径规划方法[J]. 系统工程与电子技术, 2023, 45(5): 1380-1390.
[11]	杨小草, 都延丽, 步雨浓, 刘燕斌, 高程. 基于层次分解的在线三维RRT^*协同航路规划[J]. 系统工程与电子技术, 2023, 45(5): 1409-1419.
[12]	唐进, 梁彦刚, 白志会, 黎克波. 基于DQN的旋翼无人机着陆控制算法[J]. 系统工程与电子技术, 2023, 45(5): 1451-1460.
[13]	庞阳, 王明, 闫子仪, 岳童尧, 周哲. 多视图融合的无人机定位方法[J]. 系统工程与电子技术, 2023, 45(4): 1127-1133.
[14]	白嘉琪, 王彦恺, 邢昊. 无人艇与四旋翼无人机固定时间异构编队控制[J]. 系统工程与电子技术, 2023, 45(4): 1152-1163.
[15]	刘正元, 王清华. 无人机和车辆协同配送映射模式综述与展望[J]. 系统工程与电子技术, 2023, 45(3): 785-796.