基于自适应多分支卷积的声学场景分类

doi:10.12305/j.issn.1001-506X.2025.10.03

摘要/Abstract

摘要：

针对声学场景分类任务中模型特征表达能力不充足的问题，提出一种基于自适应多分支卷积优化的网络架构。首先，使用多支路分别提取特征，再引入动态权重自适应改变权值平衡每个支路，提升特征感知能力。其次，考虑现有模型分类时忽略类与类之间的关系问题，引入粗粒度分类器辅助训练原分类模型，通过结果融合增强分类过程。在TUT2020移动开发数据集上进行训练与测试。实验结果表明，相较于优化前的算法，所提模型在准确率上提升了6.5%，证明所提方法可以有效提升整体分类效果。

关键词: 声学场景分类, 卷积神经网络, 自适应特征融合, 层次结构

Abstract:

Aiming to address the problem of the model’s insufficient feature representation ability in the acoustic scene classification task, a network architecture based on adaptive multi-branch convolutional optimization is proposed. Firstly, multiple branches are used to extract features independently, and dynamic weights are introduced to adaptively adjust the balance among the branches, enhancing feature perception capability. Secondly, to address the issue of ignoring the relationships among classes during classification in existing models, a coarse-grained classifier is introduced to assist in training the original classification model. The classification process is enhanced by fusing the results. The proposed method is trained and tested on the TUT2020 mobile development dataset. Experimental results show that the accuracy of the proposed method is improved by 6.5% compared with the algorithm before optimization, demonstrating that the proposed method effectively enhances the overall classification performance.

Key words: acoustic scene classification, convolutional neural networks, adaptive feature fusion, hierarchical proposed

中图分类号:

韦娟, 何德华, 宁方立. 基于自适应多分支卷积的声学场景分类[J]. 系统工程与电子技术, 2025, 47(10): 3148-3154.

Juan WEI, Dehua HE, Fangli NING. Acoustic scene classification based on adaptive multi-branch convolution[J]. Systems Engineering and Electronics, 2025, 47(10): 3148-3154.

图/表 10

图1

图2

图3

表1

图4

表2

表3

表4

表5

表6

参考文献 30

1	JIANG G, MA Z C, MAO Q R, et al. Multi-level distance embedding learning for robust acoustic scene classification with unseen devices[J]. Pattern Analysis and Applications, 2023, 26 (3): 1089- 1099. doi: 10.1007/s10044-023-01172-w
2	DING B Y, ZHANG T, WANG C, et al. Acoustic scene classification: a comprehensive survey[J]. Expert Systems with Applications, 2024, 238, 121902. doi: 10.1016/j.eswa.2023.121902
3	JATI A, NADARAJAN A, PERI R, et al. Temporal dynamics of workplace acoustic scenes: egocentric analysis and prediction[J]. IEEE/ACM Trans. on Audio, Speech, and Language Processing, 2021, 29, 756- 769. doi: 10.1109/TASLP.2021.3050265
4	刘立芳, 杨海霞, 齐小刚. 基于线性判别分析的时频域特征提取算法[J]. 系统工程与电子技术, 2019, 41 (10): 2184- 2190. doi: 10.3969/j.issn.1001-506X.2019.10.05
	LIU L F, YANG H X, QI X G. Time-frequency domain feature extraction algorithm based on linear discriminant analysis[J]. Systems Engineering and Electronics, 2019, 41 (10): 2184- 2190. doi: 10.3969/j.issn.1001-506X.2019.10.05
5	SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]// Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
6	LUO W J, LI Y J, URTASUN R, et al. Understanding the effective receptive field in deep convolutional neural networks[J]. Advances in neural information processing systems, 2016, 29, 4905–4913.
7	KOUTINI K, EGHBAL-ZADEH H, DORFER M, et al. The receptive field as a regularizer in deep convolutional neural networks for acoustic scene classification[C]//Proc. of the 27th European Signal Processing Conference, 2019.
8	SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 2818−2826.
9	XIAN Y, SUN Y, WANG W W, et al. A multi-scale feature recalibration network for end-to-end single channel speech enhancement[J]. IEEE Journal of Selected Topics in Signal Processing, 2020, 15 (1): 143- 155.
10	SHIM H J, JUNG J W, KIM J H, et al. Attentive max feature map and joint training for acoustic scene classification[C]//Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2022: 1036−1040.
11	DONG X Y, YAN Y, TAN M K, et al. Late fusion via subspace search with consistency preservation[J]. IEEE Trans. on Image Processing, 2018, 28 (1): 518- 528.
12	PASEDDULA C, GANGASHETTY S V. Late fusion framework for acoustic scene classification using LPCC, SCMC, and log-Mel band energies with deep neural networks[J]. Applied Acoustics, 2021, 172, 107568. doi: 10.1016/j.apacoust.2020.107568
13	SUH S, PARK S, JEONG Y, et al. Designing acoustic scene classification models with CNN variants[R]. Tokyo: Detection and Classification of Acoustic Scenes and Events Challenge, 2020.
14	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770−778.
15	DING B Y, ZHANG T, LIU G J, et al. Late fusion for acoustic scene classification using swarm intelligence[J]. Applied Acoustics, 2022, 192, 108698. doi: 10.1016/j.apacoust.2022.108698
16	ALAMIR M A. A novel acoustic scene classification model using the late fusion of convolutional neural networks and different ensemble classifiers[J]. Applied Acoustics, 2021, 175, 107829. doi: 10.1016/j.apacoust.2020.107829
17	CHEN C, LI B. A transform module to enhance lightweight attention by expanding receptive field[J]. Expert Systems with Applications, 2024, 248 (8): 123359. doi: 10.1016/j.eswa.2024.123359
18	MOROCUTTI T, SCHMID F, KOUTINI K, et al. Device-robust acoustic scene classification via impulse response augmentation[C]//Proc. of the 31st European Signal Processing Conference, 2023: 176−180.
19	HU H, YANG C H H, XIA X, et al. A two-stage approach to device-robust acoustic scene classification[C]//Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2021: 845−849.
20	CAI Y Q, LIN M Y, ZHU C Y, et al. DCASE2023 task1 submission: device simulation and time-frequency separable convolution for acoustic scene classification[R]. Tampere: Detection and Classification of Acoustic Scenes and Events Challenge, 2023.
21	PHAYE S S R, BENETOS E, WANG Y. Subspectralnet–using sub-spectrogram based convolutional neural networks for acoustic scene classification[C]//Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2019: 825−829.
22	费鸿博, 吴伟官, 李平, 等. 基于梅尔频谱分离和LSCNet的声学场景分类方法[J]. 哈尔滨工业大学学报, 2022, 54 (5): 124- 130. doi: 10.11918/202104081
	FEI H B, WU W G, LI P, et al. Acoustic scene classification method based on Mel-spectrogram separation and LSCNet[J]. Journal of Harbin Institute of Technology, 2022, 54 (5): 124- 130. doi: 10.11918/202104081
23	ZHANG B X, WANG Z R, LING Y G, et al. ShuffleTrans: patch-wise weight shuffle for transparent object segmentation[J]. Neural Networks, 2023, 167, 199- 212. doi: 10.1016/j.neunet.2023.08.011
24	SCHMID F, MASOUDIAN S, KOUTINI K, et al. CPJKU submission to DCASE22: distilling knowledge for low complexity convolutional neural networks from a patchout audio transformer[R]. Nancy: Detection and Classification of Acoustic Scenes and Events Challenge, 2022.
25	LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[C]//Proc. of the International Conference on Learning Representations, 2017.
26	SHAO Y F, MA X X, MA Y, et al. Deep semantic learning for acoustic scene classification[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2024, 2024, 1. doi: 10.1186/s13636-023-00323-5
27	ZHANG H, CISSE M, DAUPHIN Y N, et al. Mixup: beyond empirical risk minimization[EB/OL].[2024-07-16]. https://arxiv.org/abs/1710.09412.
28	KIM B G, YANG S H, KIM J H, et al. Domain generalization with relaxed instance frequency-wise normalization for multi-device acoustic scene classification[EB/OL].[2024-07-16]. https://arxiv.org/abs/2206.12513.
29	HEITTOLA T, MESAROS A, VIRTANEN T. Acoustic scene classification in DCASE 2020 challenge: generalization across devices and low complexity solutions[EB/OL].[2024-07-16] https://arxiv.org/abs/2005.14623.
30	PHAM L, NGO D, SALOVIC D, et al. Lightweight deep neural networks for acoustic scene classification and an effective visualization for presenting sound scene contexts[J]. Applied Acoustics, 2023, 211, 109489. doi: 10.1016/j.apacoust.2023.109489

名称	细粒度分类器架构
输入	Input
卷积层	$\begin{array}{l}5 \times 5 @ 128 \\{\mathrm{B N}}, {\mathrm{RELU}}\end{array} $
阶段1	$ \left(\begin{array}{c}A(3\times 3@128),{\mathrm{BN}}\\ A(1\times 1@128),{\mathrm{BN}}\\ \mathrm{max}\;{\mathrm{pool}}\end{array}\right) $
阶段1	$ \left( \begin{gathered} A(3 \times 3@128),{\mathrm{BN}} \\A(3 \times 3@128),{\mathrm{BN}} \\ \end{gathered} \right) \times 3 $
阶段2	$ \left( \begin{gathered} A(3 \times 3@256),{\mathrm{BN}} \\ A(1 \times 1@256),{\mathrm{BN}} \\ \end{gathered} \right) $
阶段2	$ \left( \begin{gathered} A(1 \times 1@256),{\mathrm{BN}} \\A(1 \times 1@256),{\mathrm{BN}} \\ \end{gathered} \right) \times 3 $
阶段3	$ \left( \begin{gathered}A(1 \times 1@512),{\mathrm{BN}} \\ A(1 \times 1@512),{\mathrm{BN}} \\ \end{gathered} \right) \times 4 $
平均池化层	Average pool
全连接层	$ 1 \times 1@10,{\mathrm{BN}},{\mathrm{softmax}} $
输出层	Output

数据增强方法	准确率
未使用数据增强	0.693
Mixup	0.702
Freq-Mixstyle	0.714
Mixup+Freq-Mixstyle	0.719

模型	A	B	C	S1	S2	S3	S4	S5	S6	平均准确率	预测时间/s
RFRNet	0.791	0.690	0.693	0.682	0.654	0.709	0.693	0.691	0.630	0.692	2.1
AMB_Net	0.809	0.748	0.735	0.724	0.679	0.715	0.700	0.694	0.686	0.719	2.5

模型	A	B	C	S1	S2	S3	S4	S5	S6	平均准确率
粗粒度网络	0.912	0.939	0.909	0.912	0.933	0.909	0.948	0.0.906	0.936	0.923
细粒度网络	0.809	0.748	0.735	0.724	0.679	0.715	0.700	0.694	0.686	0.719
二阶段分类系统	0.824	0.781	0.787	0.761	0.727	0.748	0.748	0.733	0.703	0.757

类别	DCASE2020基线^[29]	RFRNet	细粒度网络	二阶段网络
机场	0.450	0.628	0.676	0.780
公交	0.629	0.874	0.909	0.912
地铁	0.535	0.789	0.801	0.838
地铁站	0.530	0.757	0.771	0.832
公园	0.713	0.732	0.764	0.764
广场	0.449	0.488	0.552	0.596
商场	0.483	0.495	0.505	0.562
步行街	0.298	0.587	0.599	0.650
交通街道	0.799	0.868	0.872	0.875
有轨电车	0.522	0.706	0.733	0.760
平均值	0.541	0.692	0.719	0.757