With PQDT Open, you can read the full text of open access dissertations and theses free of charge.
About PQDT Open
Search
Symbolic data plays a prominent role in a variety of fields. This is particularly true in modern biology where sequence information has become an indispensable tool. In contrast, most analysis techniques, including many machine learning algorithms, assume their input resides in a Euclidean space. Furthermore, it is not always clear how to embed symbolic data in such a space to most effectively leverage the power of these algorithms. Conventional approaches to embedding, including k-mer count and binary vectors in computational biology, tend to be high-dimensional and may ignore relevant structure. More sophisticated techniques, such as those based on neural networks like BioVec, usually require a substantial amount of data for learning and must be rerun from scratch when faced with information not present during training.
In this work, I describe efforts to address these issues from a novel perspective based on the graph-theoretic notion of metric dimension. A discrete space analogue of trilateration, metric dimension allows every vertex in a graph to be uniquely represented using geodesic distances. Unfortunately, finding minimal representations is NP-hard in general. To overcome this difficulty, particular attention is paid to three topics. First, the recursive structure of Hamming graphs, a simple graph family well suited to representing biological sequences, can be taken advantage of to produce low-dimensional embeddings for sequences of arbitrary but fixed length. When applied to the problem of classifying DNA 20-mers centered at intron-exon boundaries, such embeddings result in performance comparable to that of other state-of-the-art embedding techniques such as Node2Vec. Second, by focusing on the adjacency information of graphs generated according to the Stochastic Block Model (SBM), an ensemble exhibiting simple community structure, probabilistic bounds on metric dimension can be determined. These bounds form the basis of efficient, scalable algorithms for embedding large graphs conforming to the SBM. Finally, in some applications distance measurements lose precision as distance increases. Truncated metric dimension, a novel variation of the traditional definition, compensates for this uncertainty and improves vertex representations. This work represents first steps toward making metric dimension a practical approach for embedding large symbolic data sets in real space.
Advisor: | Lladser, Manuel E |
Commitee: | Frongillo, Rafael, Dowell, Robin, Larremore, Daniel, Layer, Ryan |
School: | University of Colorado at Boulder |
Department: | Computer Science |
School Location: | United States -- Colorado |
Source: | DAI 81/11(E), Dissertation Abstracts International |
Source Type: | DISSERTATION |
Subjects: | Computer science, Mathematics, Bioinformatics |
Keywords: | Embeddings, Graph theory, Hamming graphs, Metric dimension, Multilateration, Stochastic block model |
Publication Number: | 27833617 |
ISBN: | 9798645452360 |