Systems Engineering and Electronics ›› 2025, Vol. 47 ›› Issue (11): 3543-3550.doi: 10.12305/j.issn.1001-506X.2025.11.03

• Electronic Technology • Previous Articles    

Acoustic scene classification based on cross-modal attention and gating fusion

Juan WEI1,*(), Huiwen ZHOU1, Fangli NING2   

  1. 1. School of Communication Engineering,Xidian University,Xi’an 710071,China
    2. School of Mechanical Engineering,Northwestern Polytechnical University,Xi’an 710072,China
  • Received:2025-04-10 Online:2025-11-25 Published:2025-12-08
  • Contact: Juan WEI E-mail:weijuan@xidian.edu.cn

Abstract:

Aiming at the problems of insufficient acquisition of correlation between modes and inefficient feature fusion in acoustic scene classification task, a acoustic scene classification model based on cross-modal attention and gating fusion is proposed. This model enables bidirectional interaction between acoustic and visual modalities via a cross-modal attention module, dynamically capturing their correlation. Meanwhile, the gating fusion module is designed to dynamically adjust the weights of acoustic and visual modes, realize the adaptive fusion of features, and the residual enhancement and dual-path pooling strategy are introduced to boost the robustness of features. The proposed model and the excellent methods with the same task are evaluated across accuracy, frame rate and model parameters. The simulation results show that the overall classification effect of the proposed model outperforms other methods while maintaining high accuracy, which proves its effectiveness and practicability.

Key words: acoustic scene classification, cross-modal attention, dynamic gating, adaptive fusion

CLC Number: 

[an error occurred while processing this directive]