基于神经网络的声场景数据声谱图提取方法

doi:10.12305/j.issn.1001-506X.2021.12.06

Abstract

Abstract:

In complex acoustic scene classification (ASC) tasks, the deep convolution neural network with Mel spectrum as input has good recognition ability. However, the Mel filter bank is designed based on the physiological characteristics of human ears and is not the optimal filter bank for ASC. To solve this problem, spectrogram extraction neural network (SENN) is proposed to replace the traditional Mel-spectrum extraction process, and by training this model, the spectrogram is automatically adapted to the acoustic scene data set. SENN is connected to ResNet50 as the ASC architecture, and the DCASE2019 acoustic scene data set is used for training and testing. The experimental results show that this architecture has higher recognition rate than traditional models and can effectively adjust the frequency curve, amplitude of filters and filter shape.

Key words: acoustic scene classificationcan (ASC), deep convolutional neural network (DCNN), spectrogram extraction neural network (SENN), Mel-spectrum

CLC Number:

TN929.53

Juan WEI, Zhikai DING, Fangli NING. Spectrogram extraction method for acoustic scene data based on neural network[J]. Systems Engineering and Electronics, 2021, 43(12): 3462-3469.

Figures/Tables 8

Fig.1

Fig.2

Fig.3

Fig.4

Table 1

Fig.5

Fig.6

Fig.7

References 26

1	高利剑, 毛启容. 环境辅助的多任务混合声音事件检测方法[J]. 计算机科学, 2020, 47 (1): 159- 164.
	GAO L J , MAO Q R . Environment-assisted multitasking mixed sound event detection method[J]. Computer Science, 2020, 47 (1): 159- 164.
2	ORTEGA J D S, CARDINAL P, KOERICH A L. Emotion recognition using fusion of audio and video features[C]//Proc. of the IEEE International Conference on Systems, Man and Cybernetics, 2019.
3	李伟, 李硕. 理解数字声音——基于一般音频/环境声的计算机听觉综述[J]. 复旦学报(自然科学版), 2019, 58 (3): 269- 313.
	LI W , LI S . Understanding digital audio: a review of general audio/ambient sound based computer audition[J]. Journal of Fudan University (Natural Science), 2019, 58 (3): 269- 313.
4	MESAROS A, HEITTOLA T, VIRTANEN T. Acoustic scene classification: an overview of dcase 2017 challenge entries[C]//Proc. of the 16th International Workshop on Acoustic Signal Enhancement, 2018.
5	SUN F J, WANG M J, XU Q H, et al. Acoustic scene recognition based on convolutional neural networks[C]//Proc. of the IEEE 4th International Conference on Signal and Image Processing, 2019.
6	BASBUG A M, SERT M. Acoustic scene classification using spatial pyramid pooling with convolutional neural networks[C]//Proc. of the IEEE 13th International Conference on Semantic Computing, 2019.
7	NARANJO-ALCAZAR J , PEREZ-CASTANOS S , ZUCCARELLO P , et al. Acoustic scene classification with squeeze-excitation residual networks[J]. IEEE Access, 2020, 8, 112287- 112296. doi: 10.1109/ACCESS.2020.3002761
8	PHAYE S S R, BENETOS E, WANG Y. Subspectralnet-using sub-spectrogram based convolutional neural networks for acoustic scene classification[C]//Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2019.
9	KOUTINI K, EGHBAL-ZADEH H, DORFER M, et al. The receptive field as a regularizer in deep convolutional neural networks for acoustic scene classification[C]//Proc. of the 27th European Signal Processing Conference, 2019.
10	JIN X , WU L , LI X D , et al. ILGNet: inception modules with connected local and global features for efficient image aesthetic quality classification using domain adaptation[J]. IET Computer Vision, 2019, 13 (2): 206- 212. doi: 10.1049/iet-cvi.2018.5249
11	ESMAEILPOUR M , CARDINAL P , KOERICH A L . A robust approach for securing audio classification against adversarial attacks[J]. IEEE Trans.on Information Forensics and Security, 2020, 15, 2147- 2159. doi: 10.1109/TIFS.2019.2956591
12	JONGPIL L , JIYOUNG P , KEUNHYOUNG K , et al. SampleCNN: end-to-end deep convolutional neural networks using very small filters for music classification[J]. Applied Sciences, 2018, 8 (2): 150.
13	ZHANG W Y, SUN M, WANG L, et al. End-to-end overlapped speech detection and speaker counting with raw waveform[C]//Proc. of the IEEE Automatic Speech Recognition and Understanding Workshop, 2020.
14	RAJAN V , BRUTTI A , CAVALLARO A . ConflictNET: end-to-end learning for speech-based conflict intensity estimation[J]. IEEE Signal Processing Letters, 2019, 26 (11): 1668- 1672. doi: 10.1109/LSP.2019.2944004
15	UBALE R, RAMANARAYANAN V, QIAN Y, et al. Native language identification from raw waveforms using deep convolutional neural networks with attentive pooling[C]//Proc. of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019.
16	KRISHNA D N, AMRUTH A, REDDY S S, et al. Language independent gender identification from raw waveform using multi-scale convolutional neural networks[C]//Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2020.
17	DENG Y C, WANG Y R, CHEN S H, et al. Recent progress of mandrain spontaneous speech recognition on mandrain conversation dialogue corpus[C]//Proc. of the 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques, 2020.
18	WU B, YU M, CHEN L W, et al. Improving speech enhancement with phonetic embedding features[C]//Proc. of the IEEE Automatic Speech Recognition and Understanding Workshop, 2019.
19	FENG Y J, ZHANG Y, XU X. End-to-end speech recognition system based on improved CLDNN structure[C]//Proc. of the IEEE 8th Joint International Information Technology and Artificial Intelligence Conference, 2019.
20	XU T J, LI H, ZHANG H, et al. Improve data utilization with two-stage learning in CNN-LSTM-based voice activity detection[C]//Proc. of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2019.
21	ZAZO R, SAINATH T N, SIMKO G, et al. Feature learning with raw-waveform CLDNNs for voice activity detection[C]//Proc. of the Interspeech, 2016.
22	HUANG T Y, LI J L, CHANG C M, et al. A dual-complementary acoustic embedding network learned from raw waveform for speech emotion recognition[C]//Proc. of the 8th International Conference on Affective Computing and Intelligent Interaction, 2019.
23	TOKOZUME Y, HARADA T. Learning environmental sounds with end-to-end convolutional neural network[C]//Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2017.
24	EBRAHIMPOUR M, SHEA T, DANIELESCU A, et al. End-to-end auditory object recognition via inception nucleus[C]//Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2020.
25	肖寒春, 郭俊峰, 张丽. 改进的梅尔倒谱系数在低空飞行器特征提取中的应用[J]. 应用声学, 2018, 37 (6): 77- 83.
	XIAO H C , GUO J F , ZHANG L . The application of improved Mel-frequency cepstral coefficients technology in the feature extraction of low-altitude aircraft[J]. Applied Acoustics, 2018, 37 (6): 77- 83.
26	卢宏涛, 张秦川. 深度卷积神经网络在计算机视觉中的应用研究综述[J]. 数据采集与处理, 2016, 31 (1): 1- 17.
	LU H T , ZHANG Q C . Applications of deep convolutional neural network in computer vision[J]. Journal of Data Acquisition & Processing, 2016, 31 (1): 1- 17.

实验序号	识别模型	模型输入	具体结构	识别率/%	参数数量	平均预测时间/s
1	一维卷积网络	原始波形	10conv1d+2fc	58.58	5 961 038	0.035 2
2	CLDNN	原始波形	conv1d+conv2d+3lstm+4fc	59.62	16 232 220	0.078 3
3	EnvNet	原始波形	3conv1d+resnet50+2fc	69.06	26 735 626	0.112 4
4	ResNet50	MFCC	resnet50+2fc	80.13	24 635 658	0.104 7
5	ResNet50	STFT	reshape+resnet50+2fc	85.07	24 635 658	0.103 8
6	VGG19	MFSC	vgg19+2fc	85.24	20 291 018	0.089 3
7	ResNet50	MFSC	resnet50+2fc	89.17	24 635 658	0.104 2
8	声谱图提取识别架构1	原始波形	MFSCNN+resnet50+2fc 训练滤波器幅值	90.59	28 965 388	0.121 9
9	声谱图提取识别架构2	原始波形	MFFTNN+resnet50+2fc 训练滤波器幅值、滤波器形状	91.81	28 834 203	0.121 2
10	声谱图提取识别架构2	原始波形	MFFTNN+resnet50+2fc 训练滤波器幅值、滤波器形状、频率曲线	93.40	28 834 203	0.121 2

Spectrogram extraction method for acoustic scene data based on neural network

RichHTML

PDF (PC)

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 26

Related Articles 1

Recommended Articles

Metrics

Comments