This thesis focuses on high dimensional variable selection and addresses the limitation of existing penalized likelihood-based prediction models, as well as multiple hypothesis testing issues in jump detection. In the first project, we proposed a weighted sparse network learning method which allows users to first estimate a data driven network with sparsity property. The estimated network is then optimally combined using a weighted approach to a known or partially known network structure. We adapted the ℓ1 penalties and proved the oracle property of our proposed model which aims to improve the accuracy of parameter estimation and achieves a parsimonious model in high dimensional setting. We further implemented a stability selection method for tuning the parameters and compared its performance to the cross-validation approach. We implemented our proposed framework for several generalized linear models including the Gaussian, logistic, and Cox proportional hazards (partial) models. We carried out extensive Monte Carlo simulations and compared the performance of our proposed model to the existing methods. Results showed that in the absence of prior information for constructing known network, our approach showed significant improvement over the elastic net models using data driven estimated network structure. On the other hand, if the prior network is correctly specified in advance, our prediction model significantly outperformed other methods. Results further showed that our proposed method is robust to network misspecification and the ℓ1 penalty improves the prediction and variable selection regardless of the magnitude of the effects size. We also found that the stability selection method achieved a more robust parameter tuning results compared to the cross-validation approach, for all three phenotypes (continuous, binary and survival) considered in our simulation studies. Case studies on proteomic ovarian cancer and gene expression skin cutaneous melanoma further demonstrated that our proposed model achieved good operating characteristics in predicting response to platinum-based chemotherapy and survival risk. We further extended our work in statistical predictive learning in nonlinear prediction, where the traditional generalized linear models are insufficient. Nonlinear methods such as kernel methods show a great power in mapping the nonlinear space to a linear space, which can be easily incorporated into generalized linear models. This thesis demonstrated how to apply multiple kernel tricks to generalized linear model. Results from simulation shows that our proposed multiple kernel learning method can successfully identify the nonlinear likelihood functions under various scenarios.
The second project concerns jump detection in high frequency financial data. Nonparametric tests are popular and efficient methods for detecting jumps in high frequency financial data. Each method has its own advantageous and disadvantageous and their performance could be affected by the underlying noise and dynamic structure. To address this, we proposed a robust p-values pooling method which aims to combine the advantages of each method. We focus on model validation within a Monte Carlo framework to assess the reproducibility and false discovery rate. Reproducible analysis via correspondence curve and irreproducible discovery rate were analyzed with replicates to study local dependency and robustness across replicates. Extensive simulation studies of high frequency trading data at the minute level were carried out and the operating characteristics of these methods were compared via the false discovery rate control (FDR) framework. Our proposed method was robust across all scenario under reproducibility and FDR analysis. Finally, we applied the method to minute level data from the Limit Order Book System—the Efficient Reconstruction System (LOBSTER). An R package JumpTest implementing these methods is made available on the Comprehensive R Archive Network (CRAN).
|Advisor:||Kuan, Pei-Fen, Zhu, Wei|
|Commitee:||Clouston, Sean, Wang, Xuefeng|
|School:||State University of New York at Stony Brook|
|Department:||Applied Mathematics and Statistics|
|School Location:||United States -- New York|
|Source:||DAI-B 80/08(E), Dissertation Abstracts International|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be