Symbolic data plays a prominent role in a variety of fields. This is particularly true in modern biology where sequence information has become an indispensable tool. In contrast, most analysis techniques, including many machine learning algorithms, assume their input resides in a Euclidean space. Furthermore, it is not always clear how to embed symbolic data in such a space to most effectively leverage the power of these algorithms. Conventional approaches to embedding, including k-mer count and binary vectors in computational biology, tend to be high-dimensional and may ignore relevant structure. More sophisticated techniques, such as those based on neural networks like BioVec, usually require a substantial amount of data for learning and must be rerun from scratch when faced with information not present during training.
In this work, I describe efforts to address these issues from a novel perspective based on the graph-theoretic notion of metric dimension. A discrete space analogue of trilateration, metric dimension allows every vertex in a graph to be uniquely represented using geodesic distances. Unfortunately, finding minimal representations is NP-hard in general. To overcome this difficulty, particular attention is paid to three topics. First, the recursive structure of Hamming graphs, a simple graph family well suited to representing biological sequences, can be taken advantage of to produce low-dimensional embeddings for sequences of arbitrary but fixed length. When applied to the problem of classifying DNA 20-mers centered at intron-exon boundaries, such embeddings result in performance comparable to that of other state-of-the-art embedding techniques such as Node2Vec. Second, by focusing on the adjacency information of graphs generated according to the Stochastic Block Model (SBM), an ensemble exhibiting simple community structure, probabilistic bounds on metric dimension can be determined. These bounds form the basis of efficient, scalable algorithms for embedding large graphs conforming to the SBM. Finally, in some applications distance measurements lose precision as distance increases. Truncated metric dimension, a novel variation of the traditional definition, compensates for this uncertainty and improves vertex representations. This work represents first steps toward making metric dimension a practical approach for embedding large symbolic data sets in real space.
|Advisor:||Lladser, Manuel E|
|Commitee:||Frongillo, Rafael, Dowell, Robin, Larremore, Daniel, Layer, Ryan|
|School:||University of Colorado at Boulder|
|School Location:||United States -- Colorado|
|Source:||DAI 81/11(E), Dissertation Abstracts International|
|Subjects:||Computer science, Mathematics, Bioinformatics|
|Keywords:||Embeddings, Graph theory, Hamming graphs, Metric dimension, Multilateration, Stochastic block model|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be