Introduction to Item Response Theory
All measurements inherently harbor errors. In fields like education and psychology, where numerous factors influence human responses, these errors become especially pronounced. Responses to items in an intelligence test, for instance, do not directly capture a person's intelligence. Rather, they reflect the products of a person utilizing their intelligence on the items. The responses are dominated by two factors – a person’s “intelligence” and item characteristics. Measurement, in this context, involves making inferences about the intelligence based on responses, introducing the potential for errors.
Since we cannot get rid of all measurement errors, we must develop theories to conceptualize and handle them. Fortunately, through decades of development, many psychometric theories and tools have been proposed to address measurement errors. In this blog post, I introduce item response theory, one of the most prevalent methods in analyzing psychological measurements. It has been well proven to be a powerful tool in item quality examination, person scores constructed from multiple items, measurement development, and more.
Understanding IRT Models
Traditional approaches to measurement often sum or average scores of a measurement. An important assumption of these approaches is that measurement errors for individual items can cancel out each other through the summation procedure. This is the view of Classical Test Theory (CTT). Despite its apparent simplicity, CTT comes with inherent limitations that cannot be overlooked, including uniform error variances for all respondents, the expectation of measurement errors equal to zero, exclusive focus on total scores, item characteristics (such as difficulty) contingent on respondent groups, and a lack of examination regarding item and person fit.
Item Response Theory (IRT), on the other hand, shifts the focus from total scores to responses for each item and further utilizes the item characteristics. The core idea of IRT is to describe how attributes (e.g., ability, attitude, personality) and item characteristics contribute to the probability of giving a response. Equation (1) below gives the simplest IRT model for binary responses, called the Rasch Model:
Don’t be intimidated by this equation! Let’s break it down step by step.
On the left-hand side, P(ypi = 1) states that we are modeling the probability of a person called p getting one point on item i
The right-hand side states that there are two factors determining the probability – person p’s attribute (denoted as θp) and item i’s difficulty (denoted as bi),
exp represents the exponential function, which is chosen because the logistic regression form is adopted to transform the θp - bito a probability.
In reality, there is always more than one item in a measurement. Figure 1 demonstrates a visual representation of five items in Rasch models. The x-axis corresponds to θ and the y-axis denotes the probability of getting a one point. The S-shaped curves in different colors represent different items.
So where are the b values? Note that there is a horizontal line where the probability of the curves equals 0.5. Here, the b are the intersection points. You might also notice that the numeric values correspond to the values in the x-axis. That is, if we draw some vertical lines from the intersection and through the x-axis, the intersection between the vertical lines and the x-axis are the values of b
If we go back to Equation (1), you will find that when θp= bi, θp - bi = 0, so exp(θp - bi) = exp(0) = 1, thenEquation (1) becomes 1/(1+1), which is equal to a probability of 0.50. This matches our intuition from the visualization.
Figure 1 Item characteristics curves in Rasch models
IRT is not confined to binary responses. Many IRT models have been developed to accommodate polytomous responses, nominal categories, Likert scales, etc. Furthermore, person and item covariates can also be incorporated into IRT models to investigate the factors contributing to person attribute and item parameters.
Assumptions Underlying IRT
The establishment of IRT relies on three assumptions: unidimensionality, monotonicity, and local dependencies. We can make sense of the first two assumptions in Figure 1.
- Unidimensionality: Only one dimension of a person’s attribute (latent trait) dominates the response probability. In Figure 1, there is only one axis for θ and the other dimension is for probability. If there are two dimensions of a person’s attribute, Figure 1 should be three-dimensional.
- Monotonicity: The probability increases as the attribute increases. In Figure 1, as the value of θ increases, the probability goes up. It is a very intuitive assumption in measurement as we always hypothesize that respondents with higher attributes to be measured should give responses for higher scores.
- Local Independence: After controlling for the attribute, the correlation between responses to any two items disappears. In other words, the association of a person’s responses to different items is due to the attribute to be measured.
Importantly, violation of any of the assumptions could lead to the development of specialized extensions within the realm of IRT. First, a violation of unidimensionality gives rise to multidimensional IRT models. Second, the ideal point models are not subject to monotonicity. Third, a testlet effect can be used to address some local dependence if items share a common context.
Advantages and Applications of IRT
IRT offers many advantages over Classical Test Theory:
Precision and Granularity: IRT dissects responses at the item level via various item response functions, providing a granular understanding of individual question performance. This precision allows for pinpointing specific areas of strength and weakness within a test, offering valuable insights into individual respondents and items and reducing measurement errors (e.g., Xue & Chen, 2023).
Separation of Person Attribute and Item Characteristics: The separation of θp and bi is considered an important advantage because in IRT the item difficulties become pure item properties and will not vary across respondents. This is different from CTT, whose item difficulties depend on the group of respondents. The separation enables a wide range of measurement possibilities. Moreover, θp - bi essentially implies that θp and bi are on the same scale, which offers a useful perspective of measurement development.
Adaptability to Diverse Response Patterns: IRT's flexibility enables the analysis of diverse response patterns, accommodating various question formats such as multiple-choice, Likert scale, and open-ended questions. This adaptability ensures a comprehensive assessment, capturing the intricacies of human cognition and behavior.
Individualized/Adaptive Assessment: Unlike traditional methods, measurement errors in IRT are a function of both person attribute level and item parameters. For example, in IRT, items with difficulties closer to person attribute level provide more information about person attribute than other items. Therefore, assessments can be tailored to respondents’ attribute level to reduce measurement errors.
Comprehensive Item Analysis: IRT can be used to conduct in-depth item analysis, evaluating parameters such as item discrimination, difficulty, and guessing parameters. This comprehensive scrutiny provides detailed information for item usage and improvement.
Applications of IRT can be widely seen in measurement development and construction. Wilson (2023) provides a series of examples. Readers might also have an indirect experience with IRT if they have taken some standardized tests such as TOEFL, SAT, ACT, GRE, etc. Beyond the field of education and psychology, examples can also be found in medicine, health, social, and clinical research.
IRT in Practice: Software Solutions
Implementing Item Response Theory (IRT) in real-world scenarios demands robust and user-friendly software solutions. Both commercial and open-source options cater to the diverse needs of researchers and practitioners, offering sophisticated tools for IRT analyses.
Commercial software includes:
ConQuest: A powerful tool specializing in Rasch family models and their extensions. ConQuest provides an extensive range of functionalities, ensuring precise analyses of complex data structures.
flexMIRT: This software stands out for its capability to handle multilevel, multidimensional, and multiple-group IRT analyses. Its flexibility makes it ideal for intricate item analyses and accurate test scoring.
WINSTEPS: A classical program tailored for Rasch models, WINSTEPS offers comprehensive item and person parameter estimation. Its user-friendly interface makes it accessible for both beginners and seasoned researchers.
There are also some R packages that provide free access to IRT analysis:
TAM: Can be seen as an R alternative for ConQuest, focusing on Rasch family models and their extensions, e.g., many-facet models. It also provides many useful item and person fit functions.
ltm: Enables analyses of multivariate binary and polytomous responses via IRT. For binary responses, Rasch, the Two-Parameter Logistic, and Birnbaum's Three-Parameter models are available; for polytomous data, Semejima's Graded Response model is implemented.
IRTShiny: Provides an interactive interface for IRT analysis
mirt: A versatile R package, offering an array of IRT models, including Rasch models, 2PL, 3PL, Nominal Response models, and multidimensional models. It also provides insightful functions for item and person analysis, ensuring a holistic approach to IRT.
It is worth noting that IRT R packages are not limited to the four mentioned above. Readers who are interested in IRT analysis on R can find more of them on the psychometrics pages of the r-project website, which offers detailed information on various R packages tailored for psychometric research. These packages support tasks such as test construction, assessment validation, and the analysis of psychological measurements, enabling users to conduct robust and sophisticated IRT analyses within the R environment.
In the realm of psychological measurements, IRT emerges as a powerful tool in analyzing and developing measurement. Through the lens of IRT, we transcend the constraints of classical test theory, delving into the relationship of how person attributes and item characteristics dominate the probability of responses to items.
In this post, we have explored IRT models, assumptions, and software solutions. It aims at providing a basic understanding for readers who are interested in IRT but not familiar with it. For further usage and advanced understanding of IRT, readers can find a lot of useful information on the internet and IRT handbooks. Numerous opportunities remain for harnessing the potential of IRT models in surveys and measurements.
Wilson, M. (2023). Constructing Measures: An Item Response Modeling Approach. Routledge.
Xue, M., & Chen, Y. (2023). A Stan tutorial on Bayesian IRTree models: Conventional models and explanatory extension. Behavior Research Methods, 1-21.