系统工程与电子技术 ›› 2025, Vol. 47 ›› Issue (11): 3543-3550.doi: 10.12305/j.issn.1001-506X.2025.11.03

• 电子技术 • 上一篇    

基于跨模态注意力与门控融合的声场景分类

韦娟1,*(), 周惠文1, 宁方立2   

  1. 1. 西安电子科技大学通信工程学院,陕西 西安 710071
    2. 西北工业大学机电学院,陕西 西安 710072
  • 收稿日期:2025-04-10 出版日期:2025-11-25 发布日期:2025-12-08
  • 通讯作者: 韦娟 E-mail:weijuan@xidian.edu.cn
  • 作者简介:周惠文(1999—),女,硕士研究生,主要研究方向为声场景分类
    宁方立(1974—),男,教授,博士,主要研究方向为声源定位
  • 基金资助:
    国家自然科学基金(52475132);陕西省重点研发计划(2024GX-ZDCYL-01-16);航空科学基金(20200015053001);西安市重点产业链技术攻关基金(23ZDCYJSGG0006-2023)资助课题

Acoustic scene classification based on cross-modal attention and gating fusion

Juan WEI1,*(), Huiwen ZHOU1, Fangli NING2   

  1. 1. School of Communication Engineering,Xidian University,Xi’an 710071,China
    2. School of Mechanical Engineering,Northwestern Polytechnical University,Xi’an 710072,China
  • Received:2025-04-10 Online:2025-11-25 Published:2025-12-08
  • Contact: Juan WEI E-mail:weijuan@xidian.edu.cn

摘要:

针对声场景分类任务中模态间关联获取不充分、特征融合效率低等问题,提出一种基于跨模态注意力与门控融合的声场景分类模型。该模型通过跨模态注意力模块实现声学与视觉模态的双向交互,动态捕捉模态间关联;同时设计门控融合模块动态调整声学与视觉模态权重,实现特征的自适应融合,并引入残差增强与双路池化策略提升特征的鲁棒性;从准确率、帧率和模型参数量3个维度对所提模型与同任务下的其他方法进行评估。仿真结果表明,所提模型在保持较高准确率的同时,整体分类效果优于其他方法,证明了其有效性与实用性。

关键词: 声场景分类, 跨模态注意力, 动态门控, 自适应融合

Abstract:

Aiming at the problems of insufficient acquisition of correlation between modes and inefficient feature fusion in acoustic scene classification task, a acoustic scene classification model based on cross-modal attention and gating fusion is proposed. This model enables bidirectional interaction between acoustic and visual modalities via a cross-modal attention module, dynamically capturing their correlation. Meanwhile, the gating fusion module is designed to dynamically adjust the weights of acoustic and visual modes, realize the adaptive fusion of features, and the residual enhancement and dual-path pooling strategy are introduced to boost the robustness of features. The proposed model and the excellent methods with the same task are evaluated across accuracy, frame rate and model parameters. The simulation results show that the overall classification effect of the proposed model outperforms other methods while maintaining high accuracy, which proves its effectiveness and practicability.

Key words: acoustic scene classification, cross-modal attention, dynamic gating, adaptive fusion

中图分类号: