Dissertation/Thesis Abstract

Nonlinear discriminant analysis based feature dimensionality reduction for automatic speech recognition
by Hu, Hongbing, Ph.D., State University of New York at Binghamton, 2010, 145; 3423029
Abstract (Summary)

Automatic Speech Recognition (ASR) has advanced to the point where state of the art speech recognition algorithms perform reasonably well even for large vocabulary continuous speech recognition in practical environments. Among speech recognition problems, feature extraction, which compresses a speech signal into streams of acoustical feature vectors, has become even more important for ASR since acoustical modeling methods have been well established and language modeling largely depends on the nature of the targeted language. The focus of this dissertation is the determination of effective speech features, where both spectral and temporal variations in speech are captured in a low dimensional representation, for speech recognition tasks.

In this dissertation, a set of spectral-temporal features, namely Discrete Cosine Transform Coefficients (DCTCs) and Discrete Cosine Series Coefficients (DCSCs), is examined for the purpose of capturing both the spectral and temporal variations in speech. Experimental evaluations showed that temporal variations are also of great importance for speech recognition, especially using a long time context.

Additionally, in order to reduce the limitations of the acoustical modeling based on Hidden Markov Models (HMMs), a neural network is utilized as a feature transformer to maximize the discrimination and lessen the correlation of the DCTC/DCSC features. The transformed features lead to a large improvement in the phoneme speech recognition based on the TIMIT database, especially when a small number of states and Gaussian mixtures are used for HMMs.

The neural network feature transforms are viewed as two types of Nonlinear Discriminant Analysis (NLDA) methods for nonlinear dimensionality reduction of speech features since high dimensional features considerably increase computation costs and greatly restrict performance improvement. The first method (NLDA1) uses the final outputs of the network to obtain dimensionality reduced features with the incorporation of the Principal Component Analysis (PCA) processing, while the second one (NLDA2) focuses on the middle layer outputs. The very high phone accuracy obtained with NLDA2 based on TIMIT database was 75.0% using a large number of network training iterations based on state-specific targets.

Indexing (document details)
Advisor: Zahorian, Stephen A.
Commitee: Fowler, Mark, Li, Xiaohua, Yin, Lijun
School: State University of New York at Binghamton
Department: Electrical Engineering
School Location: United States -- New York
Source: DAI-B 71/11, Dissertation Abstracts International
Subjects: Electrical engineering
Keywords: Automatic speech recognition, Dimensionality reduction, Neural networks, Nonlinear discriminant analysis, Spectral-temporal features
Publication Number: 3423029
ISBN: 9781124233598
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy