30 May 2018

Can SAT Essay Graders Agree on What Good Writing Is?

Submitted by Karl Hagen
In my last post, I argued that the SAT essay might be inappropriately difficult for its intended purposes and that floor effects make it essentially worthless for distinguishing among lower-ability students. In this post, I'm going to argue that the SAT essay score appears similarly unhelpful for distinguishing among high-ability students. There are far fewer top scores than we would expect under reasonable assumptions about the distribution of student ability, and graders appear to disagree significantly when evaluating higher-quality work. Thus although the essay score scale creates the illusion of providing reasonably fine distinctions among ability, at best it can distinguish adequate from inadequate essays.

My data for this analysis mostly comes from Maine's statistical report for its April 2017 school-day administration. This data has obvious limitations, as it is only from one administration in one state, and only includes juniors. The scores on this essay are also noticeably lower than the national mean, but the report does include nearly 13,000 graded essays and provides contingency tables breaking down how the different subscores relate to each other and how the two raters assigned. It's not adequate for a complete validity study, but there's enough here to do some exploratory analysis.

To see why I think there are too few top scores, consider an idealized situation involving a single subscore and see what distribution we would expect. In this scenario, we're going to assume that the graders make occasional mistakes but that they are unbiased. In other words, we assume that they do sometimes misclassify essays, but when they do so the direction of their error is random (if they can err in either direction).

Every essay is scored by two raters, each of whom evaluates three different dimensions of the essay. For an explanation of the rubric, see here. You can think of each of these levels as reflecting a particular level of performance:

  1. inadequate
  2. limited ability
  3. adequate
  4. superior

If the two scores are no more than one point apart, they are added to form a 2-8 score. If the scores are more than a point apart, the essay is regraded by a third reader and that third reader's score is doubled to give the final score. You can also get a 0 if the essay is off-topic or otherwise unscorable (illegible handwriting, written in another language, etc.). About 2.4% of the essays in the Maine data set were scored as a 0. Since the 0 essays don't get that score from a single cause, and some of the reasons for getting a 0 are unrelated to the construct we're trying to measure, I omit them from my analysis.

Suppose that there is "true" score (on the 1-4 scale) distribution. In other words, each essay has a real level of achievement that, if we had perfect knowledge and ability, we could determine and correctly categorize. (You can think of this underlying ability as continuous if you want, but here we're assuming that ability can be placed into discrete categories that have sharp boundaries.) Suppose further that all of our essay graders are of equal ability and can score the essay correctly p percent of the time. In the remaining 1-p% of cases, we'll assume they're off by one point. For the minimum and maximum scores, the direction of the error will always be away from the extrema. For the middle scores, there's an equal chance that the error can occur in either direction. (We might also posit that, since there's only one direction in which you can err at the maxima, the readers are more accurate at the extrema, but that assumption turns out to be an even poorer fit to reality than the one we're making here.)

With these assumptions, the distribution of observed scores will depend on two things: the underlying true score of the students and the accuracy of the graders. For p, we'll pick 80%. Not only is that intuitively sensible, but it implies that readers should agree on a score about 64% of the time, which is close to empirical reality.

For the true-score distribution, let's start by assuming that 25% of students produce an inadequate essay; 35%, an essay of limited ability; 30%, one of adequate ability; and 10%, one of superior ability. I've pulled these numbers out of thin air, but they seem roughly the range of ability you might find in a typical class.

What would our score distribution look like in this case? It's simple to calculate for a single grader. The observed 1s will consist of the true 1s that were accurately observed and the true 2s that were graded one point too low. The observed 2s will consist of the accurately observed 2s, the true 1s that were graded one point too high, and the true 3s that were graded one point too low. We can continue in the same fashion to calculate the expected percentages for each score point.

Expected Observed Scores for One Reader Given an Underlying True-Score Distribution and Grader Accuracy of 80%
True Score Assumed True Distribution Expected Observed Distribution
1 25% 27%
2 35% 33.5%
3 30% 28.5%
4 10% 11%

If we want to see the expected outcomes for the combined score of both readers, we can run a Monte Carlo simulation. I did so, scoring 10,000 essays per run for 10,000 runs. By adding the second reader, there's a chance that both readers can err in opposite directions, resulting in a 2-point difference that in the real world would trigger a third reading. If we assume that our third grader is a highly experienced super-grader who always determines the true score and doubles it, we wind up with the same result as just adding our first two reader's scores, since the errors cancel out. Of course our observed score can be off from the true score by two points if both graders err in the same direction, but we're not concerned with that difference here, only with the pattern of observed scores.

So how do the scores look? Scenario 1 reflects our hypothesized 25/35/30/10 breakdown of true scores, and it's nowhere close to either the national or the Maine distribution. The differences are particularly evident in the tails, which are much fatter than in reality.

Comparison Between Observed and Simulated Score Distributions of an Essay Reading Subscore
Score National Maine 4/17 Scenario 1 Scenario 2 Scenario 3
2 4.19% 6.14% 16.35% 4.91% 6.84%
3 7.78% 8.70% 13.61% 8.94% 10.24%
4 24.37% 26.60% 24.40% 28.50% 29.90%
5 26.32% 24.58% 10.40% 14.72% 14.40%
6 29.37% 30.00% 20.55% 33.47% 30.80%
7 6.59% 3.48% 8.00% 8.95% 7.40%
8 1.38% 0.49% 6.70% 1.11% 0.46%
Postulated True-Score Distribution for Monte Carlo Simulation Scenarios
Score 1 2 3 4
Scenario 1 25% 35% 30% 10%
Scenario 2 7% 44% 48% 1%
Scenario 3 10% 44% 46% 0%

So is it possible to come up with a scenario that is a closer fit to reality while maintaining the same grader accuracy? Scenario 2 attempts this, and the tails of the score scale are certainly closer to what we see in the national sample. To get these numbers, though, I had to set the true-score percentage for a 4 to 1%, and if we are to take that distribution seriously, that means our score categories are ridiculously unbalanced. The reason we have to make the true 4-scores so rare is that a certain number of true 3-scores will be miscategorized as 4s, and as there are so many true 3s, a certain number of essays will get 8s even if there are no students with true-4 scores. To show that, in Scenario 3 we assume that there are no true scores of 4, and it turns out that the resulting distribution gives essentially the same number of 8 scores as we see in the Maine results.

Notice as well that the score of 5 in all three simulations shows a dip, something the real data does not do. This is a consequence of the fact that we assumed the direction of any grading error was unbiased for the middle scores. Instead, it seems likely that graders are more likely biased towards the center.

To come up with a simulation that approximates the observed score distributions under the hypothesis that graders are unbiased requires us to postulate an implausibly low percentage of students in the superior score band. The results of these simulations suggest that the hypothesis is false and that the peculiarities of the score distribution are artifacts of the grading process rather than a consequence of the underlying true distribution of student writing ability. At this point, then, we need to turn to a careful examination of the grader's behavior. The national report gives no further data about this, but the Maine report calculates a number of statistics indicating how well the graders agree with each other (Tables 10-12).

The first of these (inter-rater agreement) gives intuitive but potentially misleading numbers about percent of times the two graders agree, with an average of about 62.5% exact agreement and 35.7% agreement within 1 point. Our naive intuition will tell us that seems pretty good. Percentages can be misleading, though, because they do not correct for the possibility that the agreement occurred by chance. To account for that, the report also gives numbers for Cohen's kappa, a chance-corrected statistic, in both raw and weighted forms. The weighted version accounts for the fact that adjacent scores are not as bad a miss as non-adjacent scores, and is the more appropriate statistic to pay attention to. The weighted kappa values range from .46 to .51, and while the report itself doesn't interpret those numbers, values in that range are typically taken to indicate moderate consistency. If we only look at the numbers calculated for the report, we might shrug and decide that the raters in this case are adequate.

But while Cohen's kappa is a traditional statistic, the one most commonly cited in publications, it does have notable weaknesses that can make it misleading, and it makes a number of assumptions about the distribution of observations that are potentially problematic. In the current case, the highly uneven frequency of the score categories suggests that graders' agreement might be conditioned upon the particular score level. In other words, graders might agree more frequently at some points on the score scale than others.

To see this effect in a fairly simple (i.e., naive but hopefully intuitive) way, we can use the data from the cross-tabulated score distributions (Tables 9a, 9b, and 9c of the Maine report) to ask what is the conditional probability, given that either reader gave a particular score, that the other reader will agree. For example, if either reader gave the essay a 3, what are the chances that the other reader also gave a 3.

Conditional Probability of Agreement by Score
Score Reading Analysis Writing
1 36.29% 63.08% 43.64%
2 43.43% 40.04% 44.44%
3 50.16% 29.67% 50.43%
4 10.27% 2.98% 7.07%

Although, as mentioned above, percentages aren't the best way to evaluate consistency, these differences are truly shocking. Agreement about scores of 4 is very poor. Indeed, for all subscores, the chances of the readers agreeing on a 4 is far less than chance. For the analysis subscore, readers are more than 11 times more likely to disagree by more than 2 points (33.62%) than they are to agree.

We can also calculate a conditional Cohen's kappa for each reader set, which shows similarly poor consistency for scores of 4. The unitary numbers in the Maine report conceal the terrible consistency at the top end of the scale.

Conditional Cohen's kappa (simple)
  Reading Analysis Writing
Score Reader Set 1 Reader Set 2 Reader Set 1 Reader Set 2 Reader Set 1 Reader Set 2
1 0.463 0.493 0.603 0.596 0.542 0.552
2 0.313 0.305 0.300 0.300 0.323 0.320
3 0.407 0.405 0.346 0.351 0.445 0.441
4 0.160 0.172 0.046 0.052 0.109 0.119

These numbers show that the two reader sets behave similarly compared to each other. That's what we'd expect given that readers are assigned randomly from a pool. There should not be order effects there. However the scores of 4, once again, show terrible agreement.

One might posit that we can account for at least some of the low agreement as the result of gamesmanship on the part of the readers. Because they are evaluated on their agreement with other readers, and disagreements of 2 or more points are penalized more than 1-point differences, it's plausible that some readers might decide to play it safe when confronted with an essay that might deserve an extreme score and award a 2 rather than a 1 or a 3 rather than a 4. That way it's less likely that the other reader will have a 2-point disagreement with them.

If readers are using game theory to score the essays, though, they're not applying it equally to both ends of the scale, because scores of 1 show reasonable agreement. Instead, it appears that while readers do award some 4 scores, there is virtually no consensus as to what qualities in an essay merit that score, particularly for the analysis subscore. That's a serious failure in training.

I want to emphasize that these conclusions are tentative. There's not much public data to go on, and the most detailed comes from a single test administration, and so may not be completely representative of overall trends. Even so, there is no score scale applied to essay scores, so the raw numbers assigned by the readers are the scores reported. Even if the grading in the April 2017 Maine administration is an outlier, it directly affects those students. Discrepancies this large, even on a single test administration, suggest that there are serious training problems in establishing consistent grading of better quality essays. It may well be that the entire scale for the SAT essay is ineffective for its intended purpose. If it is going to continue scoring the essay this way, College Board needs to produce adequate validity study as soon as possible. These scores are being used for high-stakes admissions decisions, but there's no evidence that they will bear the weight of the interpretation that users will give them. For anyone using the scores to make decisions, it's probably best to say that scores of 6 or higher probably show the writer did an adequate job, but no more granular distinction can be justified. In other words, if you're ranking students by how many 6s, 7s, or 8s they have, there isn't evidence that these differences are anything more than random variation.