Dynamic temporal residual learning for speech recognition

Jiaqi Xie, Ruijie Yan, Shanyu Xiao, Liangrui Peng, Michael T. Johnson, Wei Qiang Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Long short-term memory (LSTM) networks have been widely used in automatic speech recognition (ASR). This paper proposes a novel dynamic temporal residual learning mechanism for LSTM networks to better explore temporal dependencies in sequential data. The temporal residual learning mechanism is implemented by applying shortcut connections with dynamic weights to temporally adjacent LSTM outputs. Two types of dynamic weight generation methods are proposed: using a secondary network and using a random weight generator. Experimental results on Wall Street Journal (WSJ) speech recognition dataset reveal that our proposed methods have surpassed the baseline LSTM network.

Original languageEnglish
Title of host publication2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
Pages7709-7713
Number of pages5
ISBN (Electronic)9781509066315
DOIs
StatePublished - May 2020
Event2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Barcelona, Spain
Duration: May 4 2020May 8 2020

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2020-May
ISSN (Print)1520-6149

Conference

Conference2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
Country/TerritorySpain
CityBarcelona
Period5/4/205/8/20

Bibliographical note

Publisher Copyright:
© 2020 Institute of Electrical and Electronics Engineers Inc.. All rights reserved.

Keywords

  • ASR
  • Dynamic residual learning
  • LSTM

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Dynamic temporal residual learning for speech recognition'. Together they form a unique fingerprint.

Cite this