Dissertation/Thesis Abstract

Low-Dimensional Embeddings for Symbolic Data Science
by Tillquist, Richard Carter, Ph.D., University of Colorado at Boulder, 2020, 102; 27833617
Abstract (Summary)

Symbolic data plays a prominent role in a variety of fields. This is particularly true in modern biology where sequence information has become an indispensable tool. In contrast, most analysis techniques, including many machine learning algorithms, assume their input resides in a Euclidean space. Furthermore, it is not always clear how to embed symbolic data in such a space to most effectively leverage the power of these algorithms. Conventional approaches to embedding, including k-mer count and binary vectors in computational biology, tend to be high-dimensional and may ignore relevant structure. More sophisticated techniques, such as those based on neural networks like BioVec, usually require a substantial amount of data for learning and must be rerun from scratch when faced with information not present during training.

In this work, I describe efforts to address these issues from a novel perspective based on the graph-theoretic notion of metric dimension. A discrete space analogue of trilateration, metric dimension allows every vertex in a graph to be uniquely represented using geodesic distances. Unfortunately, finding minimal representations is NP-hard in general. To overcome this difficulty, particular attention is paid to three topics. First, the recursive structure of Hamming graphs, a simple graph family well suited to representing biological sequences, can be taken advantage of to produce low-dimensional embeddings for sequences of arbitrary but fixed length. When applied to the problem of classifying DNA 20-mers centered at intron-exon boundaries, such embeddings result in performance comparable to that of other state-of-the-art embedding techniques such as Node2Vec. Second, by focusing on the adjacency information of graphs generated according to the Stochastic Block Model (SBM), an ensemble exhibiting simple community structure, probabilistic bounds on metric dimension can be determined. These bounds form the basis of efficient, scalable algorithms for embedding large graphs conforming to the SBM. Finally, in some applications distance measurements lose precision as distance increases. Truncated metric dimension, a novel variation of the traditional definition, compensates for this uncertainty and improves vertex representations. This work represents first steps toward making metric dimension a practical approach for embedding large symbolic data sets in real space.

Indexing (document details)
Advisor: Lladser, Manuel E
Commitee: Frongillo, Rafael, Dowell, Robin, Larremore, Daniel, Layer, Ryan
School: University of Colorado at Boulder
Department: Computer Science
School Location: United States -- Colorado
Source: DAI 81/11(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Computer science, Mathematics, Bioinformatics
Keywords: Embeddings, Graph theory, Hamming graphs, Metric dimension, Multilateration, Stochastic block model
Publication Number: 27833617
ISBN: 9798645452360
Copyright © 2020 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest