This dissertation is devoted to investigate the impact of model misspecification on the discriminatory accuracy of regression-based classifiers as well as the added values of new biomarkers. The dissertation is organized into three parts. Part one is the background section which provides notation and develops theoretical foundation for subsequent sections. As background, we introduce two of the widely-studied measures of the classification performance improvement between two nested models, i.e. the incremental AUC and net reclassification index (NRI). Their characteristics and the corresponding statistical tests are investigated. General and explicit formulas are derived for AUC and incremental AUC as well as for NRI, assuming multivariate normal covariates. These formulas work for cases both with and without misspecification of the log odds ratio parameters. Part two considers the situation of model misspecification in terms of incorrect log odds ratios in logistic regression models. We also consider an extended case when the population of interest is the mixture of several subpopulation, and thus no ``true'' coefficients can be acquired from simple logistic regression models based only on one subpopulation. The equivalence of the null hypothesis between the approach that tests the predictor significance, and the approach that tests model performance, are assessed accordingly. In part three, we investigate the situation where the outcome data (e.g. disease status) are classified inaccurately due to imperfect diagnostic test (lack of a gold standard). In this case, a particular method is introduced to obtain the consistent estimates of the coefficients (log odds ratios) without being biased by the inaccurate outcome data. In addition, we propose estimates of AUC and incremental AUCs to assess the difference of predictive ability between two nested regression models. Numerical studies are conducted to demonstrate the validity of the derived conclusions and the proposed methods in both simulations and real data applications.
Two skin cancer data sets are investigated in this thesis. One data set is a subset case-control study from the population-based Minnesota Skin Health Study (SHS). This subset study evaluated the associations between the cutaneous melanoma risk, melanocortin 1-receptor (MC1R) polymorphisms, outdoor and indoor ultraviolet light (UV) exposure, adjusted for potential risk factors such as age, gender, skin color, eye color. etc. The second real-world dataset is from a prospectively collected database: Interdisciplinary Melanoma Cooperative Group (IMCG) during the time period 2002 -- 2009. Based on this dataset, the recurrence of melanoma cancer within the period of the cohort was showed to be statistically associated with tumor ulceration status, the AJCC stage, the logarithm of tumor thickness, gender, and tumor site. Both datasets have been studied using logistic regression analysis to form multivariate risk prediction/classification models. The potential influences from model misspecification of logistic regression and inaccurate outcome ascertainment are illustrated by the applications of these two cancer data sets. (Abstract shortened by UMI.)
|Commitee:||Fang, Yixin, Goldberg, Judith D., Jin, Zhezhen, Liu, Mengling|
|School:||New York University|
|Department:||Environmental Health Science|
|School Location:||United States -- New York|
|Source:||DAI-B 76/01(E), Dissertation Abstracts International|
|Keywords:||Classification evaluation, Logistic regression, Model misspecification, Nested models|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be