Studies have shown that visual features extracted from the lips of a speaker (visemes) can be used to automatically classify the visual representation of phonemes. Different visual features were extracted from the audio-visual recordings of a set of phonemes and used to define Linear Discriminant Analysis (LDA) functions to classify the phonemes.
Audio-visual recordings from 18 speakers of Native American English for 12 Vowel-Consonant-Vowel (VCV) sounds were obtained using the consonants /b,v,w,ð,d,z/ and the vowels /α,i/. The visual features used in this study were related to the lip height, lip width, motion in upper lips and the rate at which lips move while producing the VCV sequences. Features extracted from half of the speakers were used to design the classifier and features extracted from the other half were used in testing the classifiers.
When each VCV sound was treated as an independent class, resulting in 12 classes, the percentage of correct recognition was 55.3% in the training set and 43.1% in the testing set. This percentage increased as classes were merged based on the level of confusion appearing between them in the results. When the same consonants with different vowels were treated as one class, resulting in 6 classes, the percentage of correct classification was 65.2% in the training set and 61.6% in the testing set. This is consistent with psycho-visual experiments in which subjects were unable to distinguish between visemes associated with VCV words with the same consonant but different vowels. When the VCV sounds were grouped into 3 classes, the percentage of correct classification in the training set was 84.4% and 81.1% in the testing set.
In the second part of the study, linear discriminant functions were developed for every speaker resulting in 18 different sets of LDA functions. For every speaker, five VCV utterances were used to design the LDA functions, and 3 different VCV utterances were used to test these functions. For the training data, the range of correct classification for the 18 speakers was 90-100% with an average of 96.2%. For the testing data, the range of correct classification was 50-86% with an average of 68%.
A step-wise linear discriminant analysis evaluated the contribution of different features towards the dissemination problem. The analysis indicated that classifiers using only the top 7 features in the analysis had a performance drop of 2-5%. The top 7 features were related to the shape of the mouth and the rate of motion of lips when the consonant in the VCV sequence was being produced. Results of this work showed that visual features extracted from the lips can separate the visual representation of phonemes into different classes.
|School:||University of Pittsburgh|
|School Location:||United States -- Pennsylvania|
|Source:||DAI-B 70/12, Dissertation Abstracts International|
|Subjects:||Audiology, Electrical engineering, Acoustics|
|Keywords:||Automated lip reading, Phonemes, Visemes, Visual cues|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be