Dissertation/Thesis Abstract

Speech Segregation in Background Noise and Competing Speech
by Hu, Ke, Ph.D., The Ohio State University, 2012, 147; 10631003
Abstract (Summary)

In real-world listening environments, speech reaching our ear is often accompanied by acoustic interference such as environmental sounds, music or another voice. Noise distorts speech and poses a substantial difficulty to many applications including hearing aid design and automatic speech recognition. Monaural speech segregation refers to the problem of separating speech based on only one recording and is a widely regarded challenge. In the last decades, significant progress has been made on this problem but the challenge remains.

This dissertation addresses monaural speech segregation from different interference. First, we research the problem of unvoiced speech segregation which is less studied compared to voiced speech segregation probably due to its difficulty. We propose to utilize segregated voiced speech to assist unvoiced speech segregation. Specifically, we remove all periodic signals including voiced speech from the noisy input and then estimate noise energy in unvoiced intervals using noise-dominant time-frequency units in neighboring voiced intervals. The estimated interference is used by a subtraction stage to extract unvoiced segments, which are then grouped by either simple thresholding or classification. We demonstrate that the proposed system performs substantially better than speech enhancement methods.

Interference can be nonspeech signals or other voices. Cochannel speech refers to a mixture of two speech signals. Cochannel speech separation is often addressed by model-based methods, which assume speaker identities and pretrained speaker models. To address this speaker-dependency limitation, we propose an unsupervised approach to cochannel speech separation. We employ a tandem algorithm to perform simultaneous grouping of speech and develop an unsupervised clustering method to group simultaneous streams across time. The proposed objective function for clustering measures the speaker difference of each hypothesized grouping and incorporates pitch constraints. For unvoiced speech segregation, we employ an onset/offset based analysis for segmentation, and then divide the segments into unvoiced-voiced and unvoiced-unvoiced portions for separation. We show that this method achieves considerable SNR gains over a range of input SNR conditions, and despite its unsupervised nature produces competitive performance to model-based and speaker independent methods.

In cochannel speech separation, speaker identities are sometimes known and clean utterances of each speaker are readily available. We can thus describe speakers using models to assist separation. One issue in model-based cochannel speech separation is generalization to different signal levels. We propose an iterative algorithm to separate speech signals and estimate the input SNR jointly. We employ hidden Markov models to describe speaker acoustic characteristics and temporal dynamics. Initially, we use unadapted speaker models to segregate two speech signals and then use them to estimate the input SNR. The input SNR is then utilized to adapt speaker models for re-estimating the speech signals. The two steps iterate until convergence. Systematic evaluations show that our iterative method improves segregation performance significantly and also converges relatively fast. In comparison with related model-based methods, it is computationally simpler and performs better in a number of input SNR conditions, in terms of both SNR gains and hit minus false-alarm rates.

Indexing (document details)
Advisor: Wang, DeLiang
Commitee: Belkin, Mikhail, Fosler-Lussier, Eric, Wang, DeLiang
School: The Ohio State University
Department: Computer Science and Engineering
School Location: United States -- Ohio
Source: DAI-B 78/11(E), Dissertation Abstracts International
Subjects: Communication, Computer science
Keywords: CASA, Cochannel speech separation, Monaural speech separation, Nonspeech interference, Unsupervised clustering, Unvoiced speech
Publication Number: 10631003
ISBN: 978-0-355-01314-6
Copyright © 2020 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy