TY - JOUR
T1 - Speech Enhancement Algorithm Based on a Convolutional Neural Network Reconstruction of the Temporal Envelope of Speech in Noisy Environments
AU - Soleymanpour, Rahim
AU - Soleymanpour, Mohammad
AU - Brammer, Anthony J.
AU - Johnson, Michael T.
AU - Kim, Insoo
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2023
Y1 - 2023
N2 - Temporal modulation processing is a promising technique for improving the intelligibility and quality of speech in noise. We propose a speech enhancement algorithm that constructs the temporal envelope (TEV) in the time-frequency domain by means of an embedded convolutional neural network (CNN). To accomplish this, the input speech signals are divided into sixteen parallel frequency bands (subbands) with bandwidths approximating 1.5 times that of auditory filters. The corrupted TEVs in each subband are extracted and then fed to the 1-dimensional CNN (1-D CNN) model to restore the TEVs distorted by noise. The method is evaluated using 2,700 words from nine different talkers, which are mixed with speech-spectrum shaped random noise (SSN), and babble noise, at different signal-to-noise ratios. The Short-time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) metrics are used to evaluate the performance of the 1-D CNN algorithm. Results suggest that the 1-D CNN model improves STOI scores on average by 27% and 34% for SSN and babble noise, respectively, and PESQ scores on average by 19% and 18%, respectively, compared to unprocessed speech. The 1-D CNN model is also shown to outperform a conventional TEV-based speech enhancement algorithm.
AB - Temporal modulation processing is a promising technique for improving the intelligibility and quality of speech in noise. We propose a speech enhancement algorithm that constructs the temporal envelope (TEV) in the time-frequency domain by means of an embedded convolutional neural network (CNN). To accomplish this, the input speech signals are divided into sixteen parallel frequency bands (subbands) with bandwidths approximating 1.5 times that of auditory filters. The corrupted TEVs in each subband are extracted and then fed to the 1-dimensional CNN (1-D CNN) model to restore the TEVs distorted by noise. The method is evaluated using 2,700 words from nine different talkers, which are mixed with speech-spectrum shaped random noise (SSN), and babble noise, at different signal-to-noise ratios. The Short-time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) metrics are used to evaluate the performance of the 1-D CNN algorithm. Results suggest that the 1-D CNN model improves STOI scores on average by 27% and 34% for SSN and babble noise, respectively, and PESQ scores on average by 19% and 18%, respectively, compared to unprocessed speech. The 1-D CNN model is also shown to outperform a conventional TEV-based speech enhancement algorithm.
KW - Speech enhancement
KW - convolution neural network (CNN)
KW - temporal envelope (TEV)
UR - http://www.scopus.com/inward/record.url?scp=85147440401&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147440401&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2023.3236242
DO - 10.1109/ACCESS.2023.3236242
M3 - Article
AN - SCOPUS:85147440401
SN - 2169-3536
VL - 11
SP - 5328
EP - 5336
JO - IEEE Access
JF - IEEE Access
ER -