Dissertation/Thesis Abstract

Classification accuracy of logistic regression models under misspecifications
by Zhang, Enhan, Ph.D., New York University, 2014, 137; 3635317
Abstract (Summary)

This dissertation is devoted to investigate the impact of model misspecification on the discriminatory accuracy of regression-based classifiers as well as the added values of new biomarkers. The dissertation is organized into three parts. Part one is the background section which provides notation and develops theoretical foundation for subsequent sections. As background, we introduce two of the widely-studied measures of the classification performance improvement between two nested models, i.e. the incremental AUC and net reclassification index (NRI). Their characteristics and the corresponding statistical tests are investigated. General and explicit formulas are derived for AUC and incremental AUC as well as for NRI, assuming multivariate normal covariates. These formulas work for cases both with and without misspecification of the log odds ratio parameters. Part two considers the situation of model misspecification in terms of incorrect log odds ratios in logistic regression models. We also consider an extended case when the population of interest is the mixture of several subpopulation, and thus no ``true'' coefficients can be acquired from simple logistic regression models based only on one subpopulation. The equivalence of the null hypothesis between the approach that tests the predictor significance, and the approach that tests model performance, are assessed accordingly. In part three, we investigate the situation where the outcome data (e.g. disease status) are classified inaccurately due to imperfect diagnostic test (lack of a gold standard). In this case, a particular method is introduced to obtain the consistent estimates of the coefficients (log odds ratios) without being biased by the inaccurate outcome data. In addition, we propose estimates of AUC and incremental AUCs to assess the difference of predictive ability between two nested regression models. Numerical studies are conducted to demonstrate the validity of the derived conclusions and the proposed methods in both simulations and real data applications.

Two skin cancer data sets are investigated in this thesis. One data set is a subset case-control study from the population-based Minnesota Skin Health Study (SHS). This subset study evaluated the associations between the cutaneous melanoma risk, melanocortin 1-receptor (MC1R) polymorphisms, outdoor and indoor ultraviolet light (UV) exposure, adjusted for potential risk factors such as age, gender, skin color, eye color. etc. The second real-world dataset is from a prospectively collected database: Interdisciplinary Melanoma Cooperative Group (IMCG) during the time period 2002 -- 2009. Based on this dataset, the recurrence of melanoma cancer within the period of the cohort was showed to be statistically associated with tumor ulceration status, the AJCC stage, the logarithm of tumor thickness, gender, and tumor site. Both datasets have been studied using logistic regression analysis to form multivariate risk prediction/classification models. The potential influences from model misspecification of logistic regression and inaccurate outcome ascertainment are illustrated by the applications of these two cancer data sets. (Abstract shortened by UMI.)

Indexing (document details)
Advisor: Shao, Yongzhao
Commitee: Fang, Yixin, Goldberg, Judith D., Jin, Zhezhen, Liu, Mengling
School: New York University
Department: Environmental Health Science
School Location: United States -- New York
Source: DAI-B 76/01(E), Dissertation Abstracts International
Subjects: Biostatistics
Keywords: Classification evaluation, Logistic regression, Model misspecification, Nested models
Publication Number: 3635317
ISBN: 9781321163056
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy