系统工程与电子技术 ›› 2021, Vol. 43 ›› Issue (2): 420-433.doi: 10.12305/j.issn.1001-506X.2021.02.17

• 系统工程 • 上一篇    下一篇

MADDPG算法并行优先经验回放机制

高昂1(), 董志明1,*(), 李亮1(), 宋敬华1(), 段莉2()   

  1. 1. 陆军装甲兵学院演训中心, 北京 100072
    2. 中国人民解放军61516部队, 北京 100076
  • 收稿日期:2020-03-06 出版日期:2021-02-01 发布日期:2021-03-16
  • 通讯作者: 董志明 E-mail:15689783388@163.com;236211588@qq.com;liliang_zgy@163.com;jhsong@sina.com;E-mail:236211566@qq.com
  • 作者简介:高昂(1988-),男,博士研究生,主要研究方向为装备作战与保障仿真、多智能体深度强化学习。E-mail:15689783388@163.com|李亮(1982-),男,讲师,博士,主要研究方向为装备需求论证与试验鉴定评估。E-mail:liliang_zgy@163.com|宋敬华(1976-),女,副教授,硕士研究生导师,博士,主要研究方向为军事装备学、装备试验。E-mail:jhsong@sina.com|段莉(1976-),女,高级工程师,硕士,主要研究方向为信息系统。E-mail:236211566@qq.com
  • 基金资助:
    军队科研计划项目(41405030302);军队科研计划项目(41401020301)

Parallel priority experience replay mechanism of MADDPG algorithm

Ang GAO1(), Zhiming DONG1,*(), Liang LI1(), Jinghua SONG1(), Li DUAN2()   

  1. 1. Military Exercise and Training Center, Army Academy of Armored Forces, Beijing 100072, China
    2. Unit 61516 of the PLA, Beijing 100076, China
  • Received:2020-03-06 Online:2021-02-01 Published:2021-03-16
  • Contact: Zhiming DONG E-mail:15689783388@163.com;236211588@qq.com;liliang_zgy@163.com;jhsong@sina.com;E-mail:236211566@qq.com

摘要:

多智能体深度确定性策略梯度(multi-agent deep deterministic policy gradient, MADDPG)算法是深度强化学习方法在多智能体系统(multi-agent system, MAS)领域的重要运用,为提升算法性能,提出基于并行优先经验回放机制的MADDPG算法。分析算法框架及训练方法,针对算法集中式训练、分布式执行的特点,采用并行方法完成经验回放池数据采样,并在采样过程中引入优先回放机制,实现经验数据并行流动、数据处理模型并行工作、经验数据优先回放。分别在OpenAI多智能体对抗、合作两类典型环境中,从训练轮数、训练时间两个维度对改进算法进行了对比验证,结果表明,并行优先经验回放机制的引入使得算法性能提升明显。

关键词: 多智能体系统, 深度强化学习, 并行方法, 优先经验回放, 深度确定性策略梯度

Abstract:

The multi-agent deep deterministic policy gradient (MADDPG) algorithm is an important algorithm for deep reinforcement learning in the field of the multi-agent system (MAS). To improve the performance of the algorithm, the parallel priority experience replay mechanism of the algorithm is proposed. The algorithm framework and training method are analyzed. Aiming at the characteristics of centralized training and distributed execution of the algorithm, the multi-agent experience replay pool data sampling is completed by using the parallel method, and the priority experience replay mechanism is introduced in the sampling process. Thus, the parallel flow of empirical data is realized, the data processing model works in parallel, and the empirical data is prior replayed. Finally, the improved algorithm is compared and verified from the two dimensions of the training episode and the training time respectively in the typical environment of OpenAI multi-agent confrontation and cooperation. The results show that the introduction of the parallel prior experience replay mechanism makes the efficiency of the algorithm being improved obviously.

Key words: multi-agent system (MAS), deep reinforcement learning, parallel method, priority experience replay, deep deterministic policy gradient

中图分类号: