期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2018
卷号:96
期号:4
页码:1114
出版社:Journal of Theoretical and Applied
摘要:Voice activity detection (VAD) is implemented in the preprocessing stage of various speech applications to identify speech and non-speech periods. Recently, deep neural networks (DNNs) have been utilized for VAD given their superior performance over other methods. When used to identify speech and non-speech periods, DNNs depend on the input of different features to discriminate speech from noise. Hence, different features have been used as input for DNN-based VAD. However, the contribution and effectiveness of such features have not been thoroughly evaluated. In this paper, we address these aspects by comparing five features, namely, log power spectra, filter bank, mel-frequency cepstral coefficients, relative spectral perceptual linear predictive analysis, and amplitude modulation spectrogram, which are widely used on speech processing, to evaluate their performance in a DNN-based VAD. Experiments on the TIMIT speech corpus show that the amplitude modulation spectrogram is the feature with the best performance given its high accuracy even when processing speech data with low signal-to-noise ratio. The next feature showing high performance is log power spectra, which can be considered as a raw feature because it does not require as many calculations or processing as the other features. This suggests that raw features may be suitable inputs for DNN-based VAD. Moreover, limiting the number and processing of features for DNNs may foster system performance, real-time application, and portability of VAD by reducing the computational cost, required memory and storage.