Environmental sound archives - casual recordings of people's daily life - are easily collected by MPS players or camcorders with low cost and high reliability, and shared in the web-sites. There are two kinds of user generated recordings we would like to be able to handle in this thesis: Continuous long-duration personal audio and Soundtracks of short consumer video clips.
These environmental recordings contain a lot of useful information (semantic concepts) related with activity, location, occasion and content. As a consequence, the environment archives present many new opportunities for the automatic extraction of information that can be used in intelligent browsing systems. This thesis proposes systems for detecting these interesting concepts on a collection of these real-world recordings.
The first system is to segment and label personal audio archives - continuous recordings of an individual's everyday experiences - into 'episodes' (relatively consistent acoustic situations lasting a few minutes or more) using the Bayesian Information Criterion and spectral clustering.
The second system is for identifying regions of speech or music in the kinds of energetic and highly-variable noise present in this real-world sound. Motivated by psychoacoustic evidence that pitch is crucial in the perception and organization of sound, we develop a noise-robust pitch detection algorithm to locate speech or music-like regions. To avoid false alarms resulting from background noise with strong periodic components (such as air-conditioning), a new scheme is added in order to suppress these noises in the domain of autocorrelogram.
In addition, the third system is to automatically detect a large set of interesting semantic concepts; which we chose for being both informative and useful to users, as well as being technically feasible. These 25 concepts are associated with people's activities, locations, occasions, objects, scenes and sounds, and are based on a large collection of consumer videos in conjunction with user studies. We model the soundtrack of each video, regardless of its original duration, as a fixed-sized clip-level summary feature. For each concept, an SVM-based classifier is trained according to three distance measures (Kullback-Leibler, Bhattacharyya, and Mahalanobis distance).
Detecting the time of occurrence of a local object (for instance, a cheering sound) embedded in a longer soundtrack is useful and important for applications such as search and retrieval in consumer video archives. We finally present a Markov-model based clustering algorithm able to identify and segment consistent sets of temporal frames into regions associated with different ground-truth labels, and at the same time to exclude a set of uninformative frames shared in common from all clips. The labels are provided at the clip level, so this refinement of the time axis represents a variant of Multiple-Instance Learning (MIL).
Quantitative evaluation shows that the performance of our proposed approaches tested on the 60h personal audio archives or 1900 YouTube video clips is significantly better than existing algorithms for detecting these useful concepts in real-world personal audio recordings.
|Advisor:||Ellis, Daniel P. W.|
|School Location:||United States -- New York|
|Source:||DAI-B 70/12, Dissertation Abstracts International|
|Subjects:||Electrical engineering, Acoustics|
|Keywords:||Environmental sounds, Personal audio, Pitch detection, Semantic classification, Speech detection|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be