Dissertation/Thesis Abstract

Analysis of environmental sounds
by Lee, Keansub, Ph.D., Columbia University, 2009, 116; 3388467
Abstract (Summary)

Environmental sound archives - casual recordings of people's daily life - are easily collected by MPS players or camcorders with low cost and high reliability, and shared in the web-sites. There are two kinds of user generated recordings we would like to be able to handle in this thesis: Continuous long-duration personal audio and Soundtracks of short consumer video clips.

These environmental recordings contain a lot of useful information (semantic concepts) related with activity, location, occasion and content. As a consequence, the environment archives present many new opportunities for the automatic extraction of information that can be used in intelligent browsing systems. This thesis proposes systems for detecting these interesting concepts on a collection of these real-world recordings.

The first system is to segment and label personal audio archives - continuous recordings of an individual's everyday experiences - into 'episodes' (relatively consistent acoustic situations lasting a few minutes or more) using the Bayesian Information Criterion and spectral clustering.

The second system is for identifying regions of speech or music in the kinds of energetic and highly-variable noise present in this real-world sound. Motivated by psychoacoustic evidence that pitch is crucial in the perception and organization of sound, we develop a noise-robust pitch detection algorithm to locate speech or music-like regions. To avoid false alarms resulting from background noise with strong periodic components (such as air-conditioning), a new scheme is added in order to suppress these noises in the domain of autocorrelogram.

In addition, the third system is to automatically detect a large set of interesting semantic concepts; which we chose for being both informative and useful to users, as well as being technically feasible. These 25 concepts are associated with people's activities, locations, occasions, objects, scenes and sounds, and are based on a large collection of consumer videos in conjunction with user studies. We model the soundtrack of each video, regardless of its original duration, as a fixed-sized clip-level summary feature. For each concept, an SVM-based classifier is trained according to three distance measures (Kullback-Leibler, Bhattacharyya, and Mahalanobis distance).

Detecting the time of occurrence of a local object (for instance, a cheering sound) embedded in a longer soundtrack is useful and important for applications such as search and retrieval in consumer video archives. We finally present a Markov-model based clustering algorithm able to identify and segment consistent sets of temporal frames into regions associated with different ground-truth labels, and at the same time to exclude a set of uninformative frames shared in common from all clips. The labels are provided at the clip level, so this refinement of the time axis represents a variant of Multiple-Instance Learning (MIL).

Quantitative evaluation shows that the performance of our proposed approaches tested on the 60h personal audio archives or 1900 YouTube video clips is significantly better than existing algorithms for detecting these useful concepts in real-world personal audio recordings.

Indexing (document details)
Advisor: Ellis, Daniel P. W.
School: Columbia University
School Location: United States -- New York
Source: DAI-B 70/12, Dissertation Abstracts International
Subjects: Electrical engineering, Acoustics
Keywords: Environmental sounds, Personal audio, Pitch detection, Semantic classification, Speech detection
Publication Number: 3388467
ISBN: 978-1-109-54817-4
Copyright © 2021 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy