Ensemble of jointly trained deep neural network-based acoustic models for reverberant speech recognition

Moa Lee, Jeehye Lee, Joon-Hyuk Chang

Research output: Contribution to journalArticleResearchpeer-review

3 Citations (Scopus)

Abstract

Distant speech recognition is a challenge, particularly due to the corruption of speech signals by reverberation caused by large distances between the speaker and microphone. In order to cope with a wide range of reverberations in real-world situations, we present novel approaches for acoustic modeling including an ensemble of deep neural networks (DNNs) and an ensemble of jointly trained DNNs. First, multiple DNNs are firstly designed, each of which copes with a different reverberation time (RT60) in a setup step. Also, each model in the ensemble of DNN acoustic models is further jointly trained, including both feature mapping and acoustic modeling, where feature mapping is designed for dereverberation as a front-end. In a testing phase, ensemble of DNNs are combined by weighted averaging of the prediction probabilities of the RT60 estimates, which is obtained by the convolutional neural network (CNN). In other words, the posterior probability outputs from DNNs are combined using the CNN-based weights as a weighted average. Extensive experiments demonstrate that the proposed approach leads to substantial improvements in speech recognition accuracy over the conventional DNN baseline systems under diverse reverberant conditions. In this paper, experiments are performed on Aurora-4 and CHiME-4 databases.

Original languageEnglish
Pages (from-to)1-9
Number of pages9
JournalDigital Signal Processing: A Review Journal
Volume85
DOIs
StatePublished - 2019 Feb 1

Fingerprint

Speech recognition
Acoustics
Reverberation
Neural networks
Microphones
Deep neural networks
Experiments
Testing

Keywords

  • Convolutional neural network
  • Deep neural network
  • Ensemble acoustic model
  • Joint training
  • Reverberant speech recognition

Cite this

@article{35b7d6096b6041f5a048b335d232797a,
title = "Ensemble of jointly trained deep neural network-based acoustic models for reverberant speech recognition",
abstract = "Distant speech recognition is a challenge, particularly due to the corruption of speech signals by reverberation caused by large distances between the speaker and microphone. In order to cope with a wide range of reverberations in real-world situations, we present novel approaches for acoustic modeling including an ensemble of deep neural networks (DNNs) and an ensemble of jointly trained DNNs. First, multiple DNNs are firstly designed, each of which copes with a different reverberation time (RT60) in a setup step. Also, each model in the ensemble of DNN acoustic models is further jointly trained, including both feature mapping and acoustic modeling, where feature mapping is designed for dereverberation as a front-end. In a testing phase, ensemble of DNNs are combined by weighted averaging of the prediction probabilities of the RT60 estimates, which is obtained by the convolutional neural network (CNN). In other words, the posterior probability outputs from DNNs are combined using the CNN-based weights as a weighted average. Extensive experiments demonstrate that the proposed approach leads to substantial improvements in speech recognition accuracy over the conventional DNN baseline systems under diverse reverberant conditions. In this paper, experiments are performed on Aurora-4 and CHiME-4 databases.",
keywords = "Convolutional neural network, Deep neural network, Ensemble acoustic model, Joint training, Reverberant speech recognition",
author = "Moa Lee and Jeehye Lee and Joon-Hyuk Chang",
year = "2019",
month = "2",
day = "1",
doi = "10.1016/j.dsp.2018.11.005",
language = "English",
volume = "85",
pages = "1--9",
journal = "Digital Signal Processing: A Review Journal",
issn = "1051-2004",

}

Ensemble of jointly trained deep neural network-based acoustic models for reverberant speech recognition. / Lee, Moa; Lee, Jeehye; Chang, Joon-Hyuk.

In: Digital Signal Processing: A Review Journal, Vol. 85, 01.02.2019, p. 1-9.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Ensemble of jointly trained deep neural network-based acoustic models for reverberant speech recognition

AU - Lee, Moa

AU - Lee, Jeehye

AU - Chang, Joon-Hyuk

PY - 2019/2/1

Y1 - 2019/2/1

N2 - Distant speech recognition is a challenge, particularly due to the corruption of speech signals by reverberation caused by large distances between the speaker and microphone. In order to cope with a wide range of reverberations in real-world situations, we present novel approaches for acoustic modeling including an ensemble of deep neural networks (DNNs) and an ensemble of jointly trained DNNs. First, multiple DNNs are firstly designed, each of which copes with a different reverberation time (RT60) in a setup step. Also, each model in the ensemble of DNN acoustic models is further jointly trained, including both feature mapping and acoustic modeling, where feature mapping is designed for dereverberation as a front-end. In a testing phase, ensemble of DNNs are combined by weighted averaging of the prediction probabilities of the RT60 estimates, which is obtained by the convolutional neural network (CNN). In other words, the posterior probability outputs from DNNs are combined using the CNN-based weights as a weighted average. Extensive experiments demonstrate that the proposed approach leads to substantial improvements in speech recognition accuracy over the conventional DNN baseline systems under diverse reverberant conditions. In this paper, experiments are performed on Aurora-4 and CHiME-4 databases.

AB - Distant speech recognition is a challenge, particularly due to the corruption of speech signals by reverberation caused by large distances between the speaker and microphone. In order to cope with a wide range of reverberations in real-world situations, we present novel approaches for acoustic modeling including an ensemble of deep neural networks (DNNs) and an ensemble of jointly trained DNNs. First, multiple DNNs are firstly designed, each of which copes with a different reverberation time (RT60) in a setup step. Also, each model in the ensemble of DNN acoustic models is further jointly trained, including both feature mapping and acoustic modeling, where feature mapping is designed for dereverberation as a front-end. In a testing phase, ensemble of DNNs are combined by weighted averaging of the prediction probabilities of the RT60 estimates, which is obtained by the convolutional neural network (CNN). In other words, the posterior probability outputs from DNNs are combined using the CNN-based weights as a weighted average. Extensive experiments demonstrate that the proposed approach leads to substantial improvements in speech recognition accuracy over the conventional DNN baseline systems under diverse reverberant conditions. In this paper, experiments are performed on Aurora-4 and CHiME-4 databases.

KW - Convolutional neural network

KW - Deep neural network

KW - Ensemble acoustic model

KW - Joint training

KW - Reverberant speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85057123464&partnerID=8YFLogxK

U2 - 10.1016/j.dsp.2018.11.005

DO - 10.1016/j.dsp.2018.11.005

M3 - Article

VL - 85

SP - 1

EP - 9

JO - Digital Signal Processing: A Review Journal

JF - Digital Signal Processing: A Review Journal

SN - 1051-2004

ER -