系统工程与电子技术 ›› 2025, Vol. 47 ›› Issue (11): 3739-3753.doi: 10.12305/j.issn.1001-506X.2025.11.22

• 系统工程 • 上一篇    

基于DBSCAN和CGAN的不平衡数据过采样方法

唐曦(), 李文海, 唐贞豪, 李睿峰, 李根   

  1. 海军航空大学航空作战勤务学院,山东 烟台 264001
  • 收稿日期:2025-04-24 出版日期:2025-11-25 发布日期:2025-12-08
  • 通讯作者: 李睿峰 E-mail:910073134@qq.Com
  • 作者简介:唐 曦(1992—),男,讲师,博士研究生,主要研究方向为航空电子设备智能测试、机器学习
    李文海(1969—),男,教授,博士,主要研究方向为航空装备保障
    唐贞豪(1995—),男,讲师,硕士,主要研究方向为软件工程、人工智能机器人
    李 根(1992—),男,博士研究生,主要研究方向为机器学习、故障诊断
  • 基金资助:
    山东省泰山学者基金(TSTP20221146)资助课题

Imbalanced data oversampling method based on DBSCAN and CGAN

Xi TANG(), Wenhai LI, Zhenhao TANG, Ruifeng LI, Gen LI   

  1. Academy of Aeronautical Operations Service,Naval Aviation University,Yantai 264001,China
  • Received:2025-04-24 Online:2025-11-25 Published:2025-12-08
  • Contact: Ruifeng LI E-mail:910073134@qq.Com

摘要:

为改善分类器对不平衡数据的分类精度,提出一种基于密度的带噪声的空间聚类方法(density-based spatial clustering of applications with noise,DBSCAN)和条件生成对抗网络(conditional generative adversarial network,CGAN)的过采样方法。首先,采用DBSCAN对正负类样本分别聚类,结合簇标签重构样本集,并结合安全级别识别和剔除噪声样本,提升数据质量。然后,将新的样本集输入CGAN模型进行训练,针对CGAN中训练不稳定和模式崩塌的问题,引入Wasserstein距离和梯度惩罚项作为损失函数,并结合分类问题对Wasserstein距离做了适应性改造,实现高质量少数类样本生成。最后,采用9个通用不平衡数据集和1个模拟电路实测数据集,在3种典型分类器上将所提方法与5个经典过采样方法进行对比实验。结果表明,所提方法在多数数据集上优于其他过采样算法,尤其在类别不平衡度较高时优势更为突出。所提方法为不平衡数据处理提供了新的思路。

关键词: 不平衡数据, 条件生成对抗网络, 基于密度的带噪声的空间聚类方法, 过采样

Abstract:

In order to improve the classification accuracy of classifiers for imbalanced data, an oversampling method based on density-based spatial clustering of applications with noise (DBSCAN) and conditional generative adversarial network (CGAN) is proposed. Firstly, DBSCAN is applied to cluster positive and negative samples separately. The dataset is then reconstructed using cluster labels, and noise samples are identified and removed based on a safety level criterion to improve data quality. Subsequently, the refined dataset is fed into a CGAN model for training. To address the issues of training instability and mode collapse in CGAN, the Wasserstein distance with gradient penalty is adopted as the loss function, and an adaptive modification of the Wasserstein distance is introduced to better suit the classification problem, enabling the generation of high-quality minority class samples. Finally, experiments are conducted on nine general imbalanced datasets and one analog circuit measurement dataset, comparing the proposed method with five classical oversampling methods across three typical classifiers. Results show that the proposed method outperforms other oversampling algorithms on most datasets, with more significant advantages observed with higher levels of class imbalance. The proposed method provides a novel approach for handling imbalanced data.

Key words: imbalanced data, conditional generative adversarial networks (CGAN), density-based spatial clustering of applications with noise (DBSCAN), oversampling

中图分类号: