Designing generalized data-driven distance measures for both ordered and unordered set data is the core focus of the proposed work. An ordered set is a set where time-linear property is maintained when distance between pair of temporal segments. One application in the ordered set is the human gesture analysis from RGBD data. Human gestures are fast becoming the natural form of human computer interaction. This serves as a motivation to modeling, analyzing, and recognition of gestures. The large number of gesture categories such as sign language, traffic signals, everyday actions and also subtle cultural variations in gesture classes makes gesture recognition a challenging problem. As part of generalization, an algorithm is proposed as part of an overlap speech detection application for unordered set.
Any gesture recognition task involves comparing an incoming or a query gesture against a training set of gestures. Having one or few samples deters any class statistic learning approaches to classification, as the full range of variation is not covered. Due to the large variability in gesture classes, temporally segmenting individual gestures also becomes hard. A matching algorithm in such scenarios needs to be able to handle single sample classes and have the ability to label multiple gestures without temporal segmentation.
Each gesture sequence is considered as a class and each class is a data point on an input space. A pair-wise distances pattern between to gesture frame sequences conditioned on a third (anchor) sequence is considered and is referred to as warp vectors. Such a process is defined as conditional distances. At the algorithmic core we have two dynamic time warping processes, one to compute the warp vectors with the anchor sequences and the other to compare these warp vectors. We show that having class dependent distance function can disambiguate classification process where the samples of classes are close to each other. Given a situation where the model base is large (number of classes is also large); the disadvantage of such a distance would be the computational cost. A distributed version combined with sub-sampling anchor gestures is proposed as speedup strategy. In order to label multiple connected gestures in query we use a simultaneous segmentation and recognition matching algorithm called level building algorithm. We use the dynamic programming implementation of the level building algorithm. The core of this algorithm depends on a distance function that compares two gesture sequences. We propose that, we replace this distance function, with the proposed distances. Hence, this version of level building is called as conditional level building (clb). We present results on a large dataset of 8000 RGBD sequences spanning over 200 gesture classes, extracted from the ChaLearn Gesture Challenge dataset. The result is that there is significant improvement over the underlying distance used to compute conditional distance when compared to conditional distance.
As an application of unordered set and non-visual data, overlap speech segment detection algorithm is proposed. Speech recognition systems have a vast variety of application, but fail when there is overlap speech involved. This is especially true in a meeting-room setting. The ability to recognize speaker and localize him/her in the room is an important step towards a higher-level representation of the meeting dynamics. Similar to gesture recognition, a new distance function is defined and it serves as the core of the algorithm to distinguish between individual speech and overlap speech temporal segments. The overlap speech detection problem is framed as outlier detection problem. An incoming audio is broken into temporal segments based on Bayesian Information Criterion (BIC). Each of these segments is considered as node and conditional distance between the nodes are determined. The underlying distances for triples used in conditional distances is the symmetric KL distance. As each node is modeled as a Gaussian, the distance between the two segments or nodes is given by Monte-Carlo estimation of the KL distance. An MDS based global embedding is created based on the pairwise distance between the nodes and RANSAC is applied to compute the outliers. NIST meeting room data set is used to perform experiments on the overlap speech detection. An improvement of more than 20% is achieved with conditional distance based approach when compared to a KL distance based approach.
|Commitee:||Kasturi, Rangachar, Raij, Andrew, Sanocki, Thomas, Sun, Yu|
|School:||University of South Florida|
|Department:||Computer Science and Engineering|
|School Location:||United States -- Florida|
|Source:||DAI-B 76/04(E), Dissertation Abstracts International|
|Keywords:||Conditional distance, Distance measure, Gesture recognition, Level building, One-shot, Warp vector|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be