1 Jan 2020

Ambiguous questions and the limitations of statistics

Submitted by Karl Hagen
A question from a student drew my attention to the following SAT Writing question that appeared on the April 2017 SAT.

Here's the relevant paragraph. I've replaced the text of other questions with the correct answers so as not to introduce confusing irrelevancies:

Toward the end of the 1400s, as the Renaissance was reaching its height in Florence, Italy, members of the city's powerful Wool Guild were celebrating their recently completed city cathedral. It was a triumph that added to Florence's reputation 24 from sophistication and beauty, yet the guild members were eager to increase its splendor.

B) for
C) to
D) with

The keyed answer is B, and this is definitely the answer I would pick on first impression. But take a closer look at choice D. What's actually wrong with it? If you're like me, your first instinct is to say something along the lines of, "We say 'a reputation for something' not 'a reputation with something.'" In other words, this is a straightforward usage question, so what. And that's true so far as it goes, but we don't have to read the preposition phrase "with sophistication and beauty" as modifying "reputation." We can also see it as a manner adverbial that tells us how the cathedral added to Florence's reputation.

Expressed in terms of tree diagrams, we're talking about these two different ways of parsing the sentence. The intended reading is this:

PP as a modifier of 'reputation'

The alternative reading is this:

PP as a modifier of 'added'

The second reading isn't something that an experienced reader is likely to think of first, but I can see no reason why it's either syntactically or semantically implausible: the city's reputation was enhanced in a sophisticated and beautiful manner. It fits perfectly with the sentiment of the passage. If we're going to declare the "reputation for" reading correct because it's "better," we're essentially saying that the stereotypical reading is better than the unique one. That can't be a valid justification, unless we also want to say that cliches are preferable to new formulations. This question, therefore, is ambiguous.

Now the fact that the SAT included a technically ambiguous question on an operational form isn't by itself earth-shattering news. It does happen from time to time. Writing good standardized test questions is very difficult, and mistakes slip through occasionally even with the most rigorous checking. In most cases, College Board detects such problems while tests are being scored and omits the flawed item from the scoring. That didn't happen here. This test was released more than two years ago, and as far as I know, no one has noted this ambiguity before, including me, and I've reviewed this test with students multiple times.

The fact that this flaw remained undetected for so long highlights a weakness in how standardized test questions are developed. There are two general ways that College Board, or any test developer, screens for problematic questions: expert review and statistical analysis of student performance. What's entailed by an expert review should, I hope, be obvious. There are various statistical tests that test makers use to screen for problematic questions. One that plays a large role in SAT test development is biserial correlation, which you can think of as the correlation between the chances of a student getting a particular item right and the student's overall score on the test. If high-performing students have essentially the same chance of getting a question right as low-performing students, the question isn't doing its job. Low or negative biserial correlations are usually a sign of a flawed question. For example, better students may be noticing wording issues with the intended answer that drive them away from it, or there may be an unintended interpretation of a distractor that also makes sense.

The reason this process is not an entirely circular one (good questions are those that high performing students get right, and high performing students are those that get a lot of questions right) is that the correctness of a question isn't simply a matter of majority opinion of the students. Content experts have the primary say, and so well-written items must both reflect the judgment of experts in the field and work in a practical sense to distinguish students who understand the material from those who don't.

I have no doubt that this question does fine in a statistical sense. There are more sophisticated measures of item quality than biserial correlation, but they're likely to wind up with similar findings here. A phrase like "reputation for" is conventional. Both the expert reviewers and better students are likely to be drawn to it because it's so familiar, and they are unlikely to spend much time thinking through possible alternative interpretations. But if you are not familiar with the idiom (perhaps because English isn't your native language), you might be more inclined to look for logical, compositional meanings of the phrase, and in this case you'll find one. Because nonnative speakers are likely to have lower overall scores, however, the statistics won't reflect this behavior as a flaw. It's indistinguishable from the case of a good question that nonnative speakers simply have trouble with.

I normally find the claims that standardized tests stifle creativity to be overblown, but a question like this is an instance where the complaint is legitimate.