首页    期刊浏览 2024年10月07日 星期一
登录注册

文章基本信息

  • 标题:Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks
  • 本地全文:下载
  • 作者:Shahin Amiriparian ; Maurice Gerczuk ; Sandra Ottl
  • 期刊名称:EURASIP Journal on Audio, Speech, and Music Processing
  • 印刷版ISSN:1687-4714
  • 电子版ISSN:1687-4722
  • 出版年度:2020
  • 卷号:2020
  • 期号:1
  • 页码:1-11
  • DOI:10.1186/s13636-020-00186-0
  • 出版社:Hindawi Publishing Corporation
  • 摘要:In this paper, we investigate the performance of two deep learning paradigms for the audio-based tasks of acoustic scene, environmental sound and domestic activity classification. In particular, a convolutional recurrent neural network (CRNN) and pre-trained convolutional neural networks (CNNs) are utilised. The CRNN is directly trained on Mel-spectrograms of the audio samples. For the pre-trained CNNs, the activations of one of the top layers of various architectures are extracted as feature vectors and used for training a linear support vector machine (SVM).Moreover, the predictions of the two models—the class probabilities predicted by the CRNN and the decision function of the SVM—are combined in a decision-level fusion to achieve the final prediction. For the pre-trained CNN networks we use as feature extractors, we further evaluate the effects of a range of configuration options, including the choice of the pre-training corpus. The system is evaluated on the acoustic scene classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017) workshop, ESC-50 and the multi-channel acoustic recordings from DCASE 2018, task 5. We have refrained from additional data augmentation as our primary goal is to analyse the general performance of the proposed system on different datasets. We show that using our system, it is possible to achieve competitive performance on all datasets and demonstrate the complementarity of CRNNs and ImageNet pre-trained CNNs for acoustic classification tasks. We further find that in some cases, CNNs pre-trained on ImageNet can serve as more powerful feature extractors than AudioSet models. Finally, ImageNet pre-training is complimentary to more domain-specific knowledge, either in the form of the convolutional recurrent neural network (CRNN) trained directly on the target data or the AudioSet pre-trained models. In this regard, our findings indicate possible benefits of applying cross-modal pre-training of large CNNs to acoustic analysis tasks.
  • 关键词:Domestic activity classification ; Deep learning ; Convolutional recurrent neural networks ; Deep spectrum ; Decision-level fusion
国家哲学社会科学文献中心版权所有