Date
Journal Title
Journal ISSN
Volume Title
Publisher
Multiple indices have been proposed claiming to measure the amount of agreement between ratings of two or more judges on a multi-item measure. Unfortunately, simulation work based on these indices is lacking; thus we are left with very little understanding of exactly what should be expected of these indices and when they should work. The present investigation seeks to bridge this gap in the literature by comparing several of the more commonly used measures of interrater agreement via an Item Response Theory (IRT) model. The goal is to identify which agreement indices best recover true agreement.
In this manuscript, several agreement indices are compared. Among these are the kappa coefficient kappam (Fleiss, 1971); the intraclass correlation, ICC(2,1) (Shrout & Fleiss, 1979); several variants of the rWG(J) index (James, Demaree, & Wolf, 1984; Lindell, Brandt, & Whitney, 1999); a measure of agreement for ordinal data (Stine, 1989); and an index derived from a Latent Trait Model (Terry, 2000). Results identify two measures of agreement that consistently recover true agreement. Implications and extensions to the measurement of agreement in multiple contexts are addressed.