TY - GEN
T1 - Automatic lip-synchronized video-self-modeling intervention for voice disorders
AU - Shen, Ju
AU - Ti, Changpeng
AU - Cheung, Sen Ching S.
AU - Patel, Rita R.
PY - 2012
Y1 - 2012
N2 - Video self-modeling (VSM) is a behavioral intervention technique in which a learner models a target behavior by watching a video of him- or herself. In the field of speech language pathology, the approach of VSM has been successfully used for treatment of language in children with Autism and in individuals with fluency disorder of stuttering. Technical challenges remain in creating VSM contents that depict previously unseen behaviors. In this paper, we propose a novel system that synthesizes new video sequences for VSM treatment of patients with voice disorders. Starting with a video recording of a voice-disorder patient, the proposed system replaces the coarse speech with a clean, healthier speech that bears resemblance to the patient's original voice. The replacement speech is synthesized using either a text-to-speech engine or selecting from a database of clean speeches based on a voice similarity metric. To realign the replacement speech with the original video, a novel audiovisual algorithm that combines audio segmentation with lip-state detection is proposed to identify corresponding time markers in the audio and video tracks. Lip synchronization is then accomplished by using an adaptive video re-sampling scheme that minimizes the amount of motion jitter and preserves the spatial sharpness. Experimental evaluations on a dataset with 31 subjects demonstrate the effectiveness of the proposed techniques.
AB - Video self-modeling (VSM) is a behavioral intervention technique in which a learner models a target behavior by watching a video of him- or herself. In the field of speech language pathology, the approach of VSM has been successfully used for treatment of language in children with Autism and in individuals with fluency disorder of stuttering. Technical challenges remain in creating VSM contents that depict previously unseen behaviors. In this paper, we propose a novel system that synthesizes new video sequences for VSM treatment of patients with voice disorders. Starting with a video recording of a voice-disorder patient, the proposed system replaces the coarse speech with a clean, healthier speech that bears resemblance to the patient's original voice. The replacement speech is synthesized using either a text-to-speech engine or selecting from a database of clean speeches based on a voice similarity metric. To realign the replacement speech with the original video, a novel audiovisual algorithm that combines audio segmentation with lip-state detection is proposed to identify corresponding time markers in the audio and video tracks. Lip synchronization is then accomplished by using an adaptive video re-sampling scheme that minimizes the amount of motion jitter and preserves the spatial sharpness. Experimental evaluations on a dataset with 31 subjects demonstrate the effectiveness of the proposed techniques.
KW - audio-visual lip synchronization
KW - video self modeling
KW - voice disorders
UR - http://www.scopus.com/inward/record.url?scp=84872024312&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84872024312&partnerID=8YFLogxK
U2 - 10.1109/HealthCom.2012.6379415
DO - 10.1109/HealthCom.2012.6379415
M3 - Conference contribution
AN - SCOPUS:84872024312
SN - 9781457720390
T3 - 2012 IEEE 14th International Conference on e-Health Networking, Applications and Services, Healthcom 2012
SP - 244
EP - 249
BT - 2012 IEEE 14th International Conference on e-Health Networking, Applications and Services, Healthcom 2012
T2 - 2012 IEEE 14th International Conference on e-Health Networking, Applications and Services, Healthcom 2012
Y2 - 10 October 2012 through 13 October 2012
ER -