Evaluating human scoring using generalizability theory

Share this page

21 Jul 2020

Evaluating human scoring using generalizability theory

By Yaw Bimpeh, William Pointer, Ben Smith, Elizabeth Harrison

Abstract

Many high-stakes examinations in the United Kingdom (UK) use both constructed-response items and selected-response items. We need to evaluate the inter-rater reliability for constructed-response items that are scored by humans. While there are a variety of methods for evaluating rater consistency across ratings in the psychometric literature, we apply generalizability theory (G theory) to data from routine monitoring of ratings to derive an estimate for inter-rater reliability.

UK examinations use a combination of double or multiple rating for routine monitoring, creating a more complex design that consists of cross-pairing of raters and overlapping of raters for different groups of candidates or items. This sampling design is neither fully crossed nor is it nested. Each double- or multiple-scored item takes a different set of candidates, and the number of sampled candidates per item varies.

Therefore, the standard G theory method, and its various forms for estimating inter-rater reliability, cannot be directly applied to the operational data. We propose a method that takes double or multiple rating data as given and analyzes the datasets at the item level in order to obtain more accurate and stable variance component estimates. We adapt the variance component in observed scores for an unbalanced one-facet crossed design with some missing observations. These estimates can be used to make inferences about the reliability of the entire scoring process. We illustrate the proposed method by applying it to real scoring data.

How to cite

Bimpeh, Y., Pointer, W., Smith, B. A., & Harrison, L. (2020). Evaluating Human Scoring Using Generalizability Theory. Applied Measurement in Education, 33(3), 198–209. https://doi.org/10.1080/08957347.2020.1750403

Keywords

Reliability

Log in

Log in

Professional Development

Exams Admin

Services

Subjects

Qualifications

Professional Development

Exams Admin

Services

About AQA

Centre Services

Join Us

Contact Us

Back

Subjects

Back

GCSEs

AS and A-levels

Other qualifications

Back

Our training

Courses by theme

Courses by subject

Back

Dates

Non-exam assessment (NEA)

Exams

Results

Back

Assessment Services

Products

News and Insights

Evaluating human scoring using generalizability theory

Evaluating human scoring using generalizability theory

Abstract

How to cite

Keywords