Dissertation/Thesis Abstract

A Comparative Study of IRT Models for Rater Effects and Double Scoring
by Song, Yoon Ah, Ph.D., The University of Iowa, 2019, 168; 22617490
Abstract (Summary)

In dealing with rater effects, double scoring is a popular method to control the quality of ratings for tests including constructed-response (CR) type of items. Treating individual multiple ratings as independent violates the local independence assumption in item response theory (IRT). The typical way to fit standard IRT models to multiple ratings is to use the linear combination of multiple ratings as item scores, such as sum or average scores. However, these summed or averaged score approaches have limitations because it requires the adjustment of original item score categories and still contains rater effects in item scores. The purpose of this dissertation is to assess the effectiveness of using double ratings over single ratings in standard IRT models when rater effects are present, and to compare the performance of standard and newer IRT models for rater effects and multiple ratings, known to correct rater effects from parameter estimation and preserve the original item score categories.

Two simulation studies examined the accuracy of IRT models. As such, the number of ratings and IRT models were considered as main factors in the simulation study. The number of ratings includes single and double ratings. Two IRT models entail the generalized partial credit model (GPCM) and hierarchical rater model (HRM), each representing a standard IRT model and the IRT model for multiple ratings and rater effects. The HRM was used to generate ratings with rater effects. Then the GPCM and HRM were fitted to ratings. All the ratings were generated with the combination of other study factors, including sample size, test length, rater effects, and number of score categories. Results were compared and interpreted relative to baseline conditions, where ratings were generated with no rater effects.

The main findings of this dissertation were as follows: (1) using single ratings as item scores in rater effect conditions reduced the accuracy of proficiency estimation in the GPCM; (2) double scoring methods relieved the impact of rater effects on proficiency estimation and improved accuracy in the GPCM; (3) for double ratings, the HRM showed better performance than the GPCM using summed item scores; (4) as more items and larger number of score categories were used, accuracy of proficiency estimation improved, in general.

Supplemental Files

Some files may require a special program or browser plug-in. More Information

Indexing (document details)
Advisor: LeBeau, Brandon C, Lee, Won-Chan
Commitee: Templin, Jonathan, Aloe, Ariel, Srivastava, Sanvesh
School: The University of Iowa
Department: Education
School Location: United States -- Iowa
Source: DAI-A 81/8(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Educational tests & measurements
Keywords: Double scoring, Hierarchical rater model, Item response theory, Rater effects
Publication Number: 22617490
ISBN: 9781392495476
Copyright © 2020 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest