Dinoflagellates are a diverse and ancient lineage of globally abundant algae that have adapted to fill a diverse array of important ecological roles. Despite their importance, dinoflagellate genomes remain relatively poorly understood because of their enormous size. It is suspected that dinoflagellate genomes have expanded through rampant gene duplication, possibly using a lineage-specific mechanism that involves reinsertion of mature transcripts back into the genome, and that may rely on spliced leader trans-splicing for reactivation and processing of recycled transcripts. Draft genomes have recently been published for two extremely small endosymbiotic species. These genomes confirm expansion of nearly 10k gene families, relative to other eukaryotes. In the more complete genome, evidence for transcript recycling based on relict spliced leader sequences was found in over 5,500 genes. Genomic efforts in larger dinoflagellates have focused instead on transcriptome sequencing, but transcriptomes assembled from short-read HTS data contain very little evidence for rampant gene duplication, or for trans-splicing. I have shown that apparent disagreement with hypotheses related to ubiquitous trans-splicing and widespread gene duplication are the result of technological limitations. By leveraging the statistical power of high-throughput sequencing, I found that spliced leader suffixes as short as six nucleotides are sufficient for positive identification. I also found that isoform sequences from families of conserved paralogs are systematically collapsed during assembly, but that many of these consensus sequences can be identified using a custom SNP-calling procedure that can be combined with traditional clustering based on pairwise sequence alignment to obtain a more complete picture of gene duplication in dinoflagellates. Efficient, automated homology detection based on pairwise sequence alignment is an equally challenging problem for which there is much room for improvement. I explored alternative metrics for scoring alignments between sequences using a popular procedure based on BLAST and Markov clustering, and showed that simplified metrics perform as well or better than more popular alternatives. I also found that Markov clustering of protein sequences suffers from a serious false positive problem when compared against manual curation, suggesting that it is more appropriate for pre-clustering of very large data sets than as a complete clustering solution.
|Advisor:||Delwiche, Charles F.|
|Commitee:||El-Sayed, Najib M., Kingsford, Carl L., Kocher, Thomas D., Mount, Stephen M.|
|School:||University of Maryland, College Park|
|Department:||Cell Biology & Molecular Genetics|
|School Location:||United States -- Maryland|
|Source:||DAI-B 77/07(E), Dissertation Abstracts International|
|Keywords:||Clustering, Dinoflagellate, Illumina, Paralogy, Spliced leader, Transcriptomics|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be