Dissertation/Thesis Abstract

New matching algorithm – Outlier First Matching (OFM) and its performance on Propensity Score Analysis (PSA) under new Stepwise Matching Framework (SMF)
by Sun, Yi, Ph.D., State University of New York at Albany, 2014, 101; 3633233
Abstract (Summary)

An observational study is an empirical investigation of treatment effect when randomized experimentation is not ethical or feasible (Rosenbaum 2009). Observational studies are common in real life due to the following reasons: a) randomization is not feasible due to the ethical or financial reason; b) data are collected from survey or other resources where the object and design of the study has not been determined (e.g. retrospective study using administrative records); c) little knowledge on the given region so that some preliminary studies of observational data are conducted to formulate hypotheses to be tested in subsequent experiments. When statistical analysis are done using observational studies, the following issues need to be considered: a) the lack of randomization may lead to a selection bias; b) representativeness of sampling with respect to the problem under consideration (e.g. study of factors influencing a rare disease using a nationally representative survey with respective to race, income, and gender but not with respect to the rare disease condition).We will use the following sample to illustrate the challenges of observational studies and possible mitigation measures.

Our example is based on the study by Lalonde (1986), which evaluated the impact of job training on the earnings improvement of low-skilled workers in 1970's (In Paper 1 section 1.5.2, we will discuss this data set in more detail). The treatment effect estimated from the observational study was quite different from the one obtained using the baseline randomized "National Supported Work (NSW) Experiment" carried out in the mid-1970's. Now we understand the treatment effect which is the impact of job training. Selection bias may contaminate the treatment effect, in other words, workers who receive the job training may be fundamentally different from those who do not. Furthermore, the sample of control group selected for observational study by Lalonde may not represent the sample of control group from the original NSW experiment.

In this study, we address the issue of lack of randomization by applying a new matching algorithm (Outlier First Matching, OFM) which can be used in conjunction with the Propensity Score Analysis (PSA) or other similar methods to achieve the convincible treatment effect estimation in observational studies.

This dissertation consists of three papers.

Paper 1 proposes a new "Stepwise Matching Framework (SMF)" and rationalizes its usage in causal inference study (especially for PSA study using observational data). Furthermore, under the new framework of SMF, one new matching algorithm (Outlier First Matching or OFM in short) will be introduced. Its performance along with other well-known matching algorithms will be studied using the cross sectional data.

Paper 2 extends methods of paper 1 to correlated data (especially to longitudinal data). In the circumstance of correlated data (e.g. longitudinal data), besides the selection bias as in cross-sectional observational data, the repeated measures bring out the between-subject and within-subject correlation. Furthermore, the repeated measures can also bring out the missing value problem and rolling enrollment problem. All of above challenges from correlated data complexity the data structure and need to be addressed using more complex model and methodology. Our methodology calculate the variant p-score of control subjects at each time point and generate the p-score difference from each control subject to every treatment subject at treatment subject's time point. Then such p-score differences are summarized to create the distance matrix for next step analysis. Once again, the performance of OFM and other well-established matching algorithms are compared side by side and the conclusion will be summarized through simulation and real data applications.

Paper 3 handles missing value problem in longitudinal data. As we have mentioned in paper 2, the complexity of data structure of longitudinal data often comes with the problem of missing data. Due to the possibility of between subject and within subject correlation, the traditional imputation methodology will probably ignore the above two correlations so that it may lead to biased or inefficient imputation of missing data. We adopt one missing value imputation strategy introduced by Schafer and Yucel (2002) through one R package "pan" to handle the above two correlations. The "imputed complete data" will be treated using the similar methodology as paper 2. Then MI results will be summarized using Rubin's rule (1987). The conclusion will be drawn based on the findings through simulation study and compared to what we have found in complete longitudinal data study in paper 2.

In last section, we conclude the dissertation with the discussion of preliminary results, as well as the strengths and limitations of the present research. Also we will point out the direction of the future study and provide suggestions to practice works.

Indexing (document details)
Advisor: Yucel, Recai M., Pruzek, Robert M.
Commitee: DiRienzo, Gregory A., Lu, Tao
School: State University of New York at Albany
Department: Biometry and Statistics
School Location: United States -- New York
Source: DAI-B 76/01(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Biostatistics
Keywords: Causal inference, Generalized estimating equation, Longitudinal analysis, Matching, Missing data analysis, Propensity score analysis
Publication Number: 3633233
ISBN: 9781321128772
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest