End-to-end speech recognition systems are effective, but in order to train an end-to-end model, a large amount of training data is needed. For applications such as dysarthric speech recognition, we do not have sufficient data. In this paper, we propose a specialized data augmentation approach to enhance the performance of an end-to-end dysarthric ASR based on sub-word models. The proposed approach contains two methods, including prosodic transformation and time-feature masking. Prosodic transformation modifies the speaking rate and pitch of normal speech to control prosodic characteristics such as loudness, intonation, and rhythm. Using time and feature masking, we apply a mask to the Mel Frequency Cepstral Coefficients (MFCC) for robustness-focused augmentation. Results show that augmenting normal speech with prosodic transformation plus masking decreases CER by 5.4% and WER by 5.6%, and the further addition of dysarthric speech masking decreases CER by 11.3% and WER by 11.4%.
|Title of host publication||2021 11th International Conference on Speech Technology and Human-Computer Dialogue, SpeD 2021|
|Number of pages||5|
|State||Published - 2021|
|Event||11th International Conference on Speech Technology and Human-Computer Dialogue, SpeD 2021 - Virtual, Bucharest, Romania|
Duration: Oct 13 2021 → Oct 15 2021
|Name||2021 11th International Conference on Speech Technology and Human-Computer Dialogue, SpeD 2021|
|Conference||11th International Conference on Speech Technology and Human-Computer Dialogue, SpeD 2021|
|Period||10/13/21 → 10/15/21|
Bibliographical noteFunding Information:
ACKNOWLEDGEMENT This work was supported by NationalInstitutes of Health underNIDCD R15 DC017296-01.
This work was supported by NationalInstitutesof Health underNIDCD R15 DC017296-01.
© 2021 IEEE.
- Data augmentation
- Dysarthric ASR
- Speech recognition
ASJC Scopus subject areas
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Safety, Risk, Reliability and Quality