Recently, I was thinking about how to improve the accuracy of assessment tests for ESL learners and so I googled and found Computerized Adaptive Testing (CAT).
During the process, I accidentally discovered an interesting theory behind it. It’s called Items Response Theory or IRT for short.
So I’ve spent some time reading up about it and in the process, picked up a few very useful bits about statistical hypothesis testing, which I’m very glad to have learned.
Below, I share the most important ideas about IRT that I’ve learned.
Before IRT: The Classical approach
We’ve all taken tests whose scores are something like 7/10: getting 7 correct answers over a total of 10 questions in the test.
Of course, no student will be tested only once because a one-time testing over just 10 questions is simply unreliable. And that’s why in a semester, normally a student will get a string of scores, at the very least the mid-term & final-term scores + some project scores.
Combining these info will make the final score of a student more reliable, b/c it reduces the errors inherent in every test. A 9/10 student may very well score just 7/10 on some day and if we’re going to test only once, the result isn’t going to be consistent.
Before the advent of IRT, the Classical test theory approach focuses on dealing with the issue of reliability that we’ve mentioned above.
And it does so via measuring “internal consistency” between items (questions) in a test, often via a metric called Cronbach’s alpha, which, in essence, is based on measuring the co-variance between items in a test.
Once the reliability for each test is available, a standard error (SD) of measurement is calculated based on the Cronbach reliability for each test. The SD tells us the confidence interval for the score of each student. That is, if I got 5/10 and the SEM of the test is 1, my actual score range is 5 +- 1 (from 4 to 6).
If 5 is the determination score of pass/fail, concluding that I passed the test is a mistake, because for the 68% confidence interval (the interval within 1 SD for normal distributions), my scores can be in the range [4, 6], which means I can pass or fail the test. In this case, more tests will need to be done to reliably conclude if I should pass or fail the course.
In everyday scenarios, the scores we get in class and school are reliable or not depends on whether the test has a high reliability or not. And this has remained a serious challenge in the traditional testing approach.
In high-stakes testing, such as college entrance exams, careful calibrations of questions are usually done beforehand to make sure the results have statistical significance, reducing the “luck” factor. We certainly don’t want the test day to have any resemblance of a lottery day.
Items response theory
The main challenge of the classical approach is that it models ability via the percentage of correct answers for the test, but it doesn’t have any concrete mathematical model to express the relationship between ability and questions’ difficulty.
And that’s where IRT fits into the picture: it provides a framework to model:
- The difficulty of each item
- How likely a test taker answers an item correctly given their ability
- A way to measure the measurement error
As can be guessed from its name, Items response theory is characterized by:
- Item response function: modelling the probability of answering an item correctly. This is where the core model resides.
- Item information function: how much information is obtained by giving some item to a test taker. It’s calculated based on the item response probability obtained above.
At any given time during the test, instead of using the function for each item, we’d have an equivalent function to represent the current state of the test:
- Test response function: sum of response functions of all items tested so far.
- Test information function: sum of information functions of all items tested so far.
- SEM: the model provides a way to calculate this at any point in the test.
The item response function, which measures the probability of answering an item correctly, has 3 versions:
- The 1PL (1 param logistic) model: the probability depends on the item difficulty (b) & test taker’s ability (theta)
- The 2PL model: adds an additional parameter: the item discrimination (a). Two items can have the same difficulty (b), but different discrimination levels.
- The 3PL model: adds a guessing parameter (c) to model the guessing behavior of test takers.
To understand these 3 models as well as the response and information functions in more detail, this is the free short book where I’ve learned the aboves: Visual IRT.
Computerized Adaptive testing (CAT)
Having understood the basics of how IRT works, now we’re ready to get to the process of how adaptive testing works.
The general structure of a CAT process will look like this:
- Start with a best guess about the test taker’s ability, given all information sources available before the test. If none available, just select the mean value.
- Call this guessed ability theta.
- Repeat:
- Given the estimated ability theta, select the question that yields the highest value for the Item information function.
- Deliver the question to the test taker and get back the answer (and other info if needed).
- Given the test taker’s answer to the current question, as well as answers to all previous questions, using Maximum likelihood estimate to predict the updated ability theta.
- Calculate SEM at this point of the test.
- Based on the current theta and SEM, if the stop criteria are satisfied, we can stop.
- Alternatively, the stop criteria can be time-based or maximum number of questions tested, or any other criteria that make sense given the testing goal.
For a concrete example of IRT-based CAT, here’s an example by IACAT.
As we can see above, the process is pretty straightforward because all the work has already been done before the testing takes place: parameters estimation of the questions, statistics software, etc.
As such, the key challenge of organizing a CAT is in the preparation part, and specifically:
- The preparation of a battery of questions that can cover the entire target range of abilities.
- Parameters estimation for each question: difficulty, discrimination and guessing.
Parameters estimation is discussed in more length in the Visual IRT book, Chapter 8 & 9.
As an observation, this explains why in some tests, such as TOEFL iBT, SAT or GRE, there may be an unscored portion: my guess is that this is where the parameters of new test questions are estimated.
This is indeed a clever way to get around the challenges of parameters estimation. The actual ability of the test taker is determined via the scored questions whose params have already been reliably determined. This ability, combined with their responses to the new questions are all that’s needed to carry out parameters estimation for the new questions.
P/s: Featured image by Mikael Blomkvist from Pexels
To receive auto email updates about new posts, please register using this form: