Even after an exam, how do we know whether that exam was a good one? It is obvious that any exam can only be as good as the items it comprises, but then what constitutes a good exam item? Our students seem to know, or at least believe they know. But are they correct when they claim that an item was too difficult, too tricky, or too unfair?
Lewis Aiken (1997), the author of a leading textbook on the subject of psychological and educational assessment, contends that a “postmortem” evaluation is just as necessary in classroom testing as it is in medicine. Indeed, just such a postmortem procedure for exams exists--item analysis, a group of procedures for assessing the quality of exam items. The purpose of an item analysis is to improve the quality of an exam by identifying items that are candidates for retention, revision, or removal. More specifically, not only can the item analysis identify both good and deficient items, it can also clarify what concepts the examinees have and have not mastered.
So, what procedures are involved in an item analysis? The specific procedures involved vary, but generally, they fall into one of two broad categories: qualitative and quantitative.
Item Difficulty Index (p)
The item difficulty statistic is an appropriate choice for achievement or aptitude tests when the items are scored dichotomously (i.e., correct vs. incorrect). Thus, it can be derived for true-false, multiple-choice, and matching items, and even for essay items, where the instructor can convert the range of possible point values into the categories “passing” and “failing.”
The item difficulty index, symbolized p, can be computed simply by dividing the number of test takers who answered the item correctly by the total number of students who answered the item. As a proportion, p can range between 0.00, obtained when no examinees answered the item correctly, and 1.00, obtained when all examinees answered the item correctly. Notice that no test item need have only one p value. Not only may the p value vary with each class group that takes the test, an instructor may gain insight by computing the item difficulty level for a number of different subgroups within a class, such as those who did well on the exam overall and those who performed more poorly.
Although the computation of the item difficulty index p is quite straightforward, the interpretation of this statistic is not. To illustrate, consider an item with a difficulty level of 0.20. We do know that 20% of the examinees answered the item correctly, but we cannot be certain why they did so. Does this item difficulty level mean that the item was challenging for all but the best prepared of the examinees? Does it mean that the instructor failed in his or her attempt to teach the concept assessed by the item? Does it mean that the students failed to learn the material? Does it mean that the item was poorly written? To answer these questions, we must rely on other item analysis procedures, both qualitative and quantitative ones.
Item Discrimination Index (D)
Item discrimination analysis deals with the fact that often different test takers will answer a test item in different ways. As such, it addresses questions of considerable interest to most faculty, such as, “does the test item differentiate those who did well on the exam overall from those who did not?” or “does the test item differentiate those who know the material from those who do not?” In a more technical sense then, item discrimination analysis addresses the validity of the items on a test, that is, the extent to which the items tap the attributes they were intended to assess. As with item difficulty, item discrimination analysis involves a family of techniques. Which one to use depends on the type of testing situation and the nature of the items. I’m going to look at only one of those, the item discrimination index, symbolized D. The index parallels the difficulty index in that it can be used whenever items can be scored dichotomously, as correct or incorrect, and hence it is most appropriate for true-false, multiple-choice, and matching items, and for those essay items which the instructor can score as “pass” or “fail.”
We test because we want to find out if students know the material, but all we learn for certain is how they did on the exam we gave them. The item discrimination index tests the test in the hope of keeping the correlation between knowledge and exam performance as close as it can be in an admittedly imperfect system.
The item discrimination index is calculated in the following way:
Though it’s not as unlikely as winning a million-dollar lottery, finding a perfect positive discriminator on an exam is relatively rare. Most psychometricians would say that items yielding positive discrimination index values of 0.30 and above are quite good discriminators and worthy of retention for future exams.
Finally, notice that the difficulty and discrimination are not independent. If all the students in both the upper and lower levels either pass or fail an item, there’s nothing in the data to indicate whether the item itself was good or not. Indeed, the value of the item discrimination index will be maximized when only half of the test takers overall answer an item correctly; that is, when p = 0.50. Once again, the ideal situation is one in which the half who passed the item were students who all did well on the exam overall.
Does this mean that it is never appropriate to retain items on an exam that are passed by all examinees, or by none of the examinees? Not at all. There are many reasons to include at least some such items. Very easy items can reflect the fact that some relatively straightforward concepts were taught well and mastered by all students. Similarly, an instructor may choose to include some very difficult items on an exam to challenge even the best-prepared students. The instructor should simply be aware that neither of these types of items functions well to make discriminations among those taking the test.
Copyright 1996-1999. Published by Oryx Press in conjunction with James
Associates, Inc. (ISSN 1057-2880)
Aiken, L.R. (1997). Psychological testing and assessment (9th ed.). Boston, MA: Allyn and Bacon, Inc.
R.J., Swerdlik, M.E., & Smith, D.K. (1992). Psychological testing
and assessment: An introduction to tests and measurement (2nd ed.).
Mountain View, CA: Mayfield Publishing Company.