4 Apr 2017

Most Test-Prep Material Sucks (Part 2)

Submitted by Karl Hagen
This is part two of a multipart series on the quality of commercial test-preparation material.

In part one, I made the rather unoriginal claim that most commercial test-prep material sucks, and suggested that one major reason it's so bad is that doing good work is hard and that there is insufficient financial incentive for publishers to put resources into doing it right.

In this second part, I will argue for a set of standards that all creators of commercial test-prep curriculum should be held accountable to. If you're going to pay money to a test-prep outfit, you ought to be given good value for your money. I'll also outline procedures that every test-prep developer ought to be following in order to achieve those standards. In the subsequent parts of this series, I will analyze several mock SATs from major publishers to see how well they measure up.

By "commercial" I mean material created by organizations that are unaffiliated with the actual creators of the standardized tests. I'm not evaluating the quality of the underlying tests which the test-prep books target. The tests themselves are subject to legitimate criticism for their design decisions, but if you decide to produce test-prep material, your job is to create something that will give students the most realistic and effective preparation possible for the actual test.

I know from first-hand experience that the development steps I suggest are practical to implement by companies of relatively modest size. I have been writing test-prep material for over two decades and implemented these procedures at several different companies whose curriculum development I oversaw. In each case, there were measurable benefits in both total score improvement on the real tests and in customer satisfaction. If you're rigorous about it, you can make tests that are extremely close to the real thing. (Once, I had the rather odd experience of finding a practice SAT that I had written being shared online with the assertion that it was an actual SAT from a specific test date.)

I have also seen the consequences of failing to follow these procedures. No matter how talented a curriculum developer you are, you will make mistakes. Some of the material you make will be terrible. No matter how intelligent you are, no matter how well you score when you take standardized tests, no matter how much experience you have teaching or writing test items, some of what you create will suck. You will write ambiguous and flawed questions. If you don't follow a systematic process that is designed to weed out those mistakes, not only will you embarrass yourself professionally, but you will also inflict crap on unwitting students who deserve much, much better.

So what should we expect from items written for commercial test-prep books?

1. The material must adhere the public test specifications.

Every high-stakes test makes public information about the content and format of the test. Some tests make more of this information available than others, but there's never an excuse for ignoring it. You might think that this is such an obvious point that every publisher would get this right, but my experience is that mock practice tests routinely violate the public specifications in both large and small ways. It's true that they almost always have tests with the correct number of questions and answer choices for each question. Beyond that, however, it's common to see significant inaccuracies. The problem I cited in part one is one example, as it asked a question about a content area that was beyond the scope of the test specification. There are even more egregious examples to be found in other material. For example, if the public documentation says that all reading passages will be 500-750 words long, then you'd better make sure that the passages you use comport with that requirement. If the specification outlines a specific balance of genres for those passages, you also must get that right.

2. The material must, to the extent feasible, reflect the test's implied specifications.

Even where there's a substantial public information about the test, as is the case for the most recent incarnation of the SAT, you won't find everything you need to know in that document. Testing organizations create test forms by following an extremely detailed test specification that is kept confidential. No one is likely to spill the beans to you about exactly what's in those specifications, unless you're a Reuters reporter writing exposés on the testing industry, but you can infer a lot about the specification by carefully analyzing publicly released tests, and by reading the technical research literature.

In doing this analysis, it's just as important to notice is what is not on the tests as what is. The public test specifications typically describe certain categories of questions very broadly, but actual tests feature problem types from only a particular subset of theoretically possible problems. Consider, for example, the topic of pronoun case on the Writing portion of the SAT.

The specification for both the old version of the SAT that was retired in 2016 and the current one indicate that pronoun case is one of the topics tested. So you might think that sentence patterns which worked for the old test would work equally well for the new test. A careful examination of released tests, however, shows that there have been significant changes in the specifics of which kinds of case problems are tested, notably in the case of who vs. whom. On the old SAT, the who/whom distinction was never tested. There were lots of problems involving the distinction between nominative and objective case, but only with the personal pronouns. On the new SAT, however, the who/whom distinction is tested.

In this instance, the changes actually brought the test spec closer to what some mock tests were already doing. Many test-prep books for the old SAT drilled students on the who/whom distinction even though such problems never appeared. At least now, such drills aren't completely wasted effort. However there remain other potential topics in pronoun case that are not tested, and it's a curriculum developer's responsibility to figure out what the range of legitimate questions actually is.

Another important reason to look carefully at actual tests is that, as an item writer, you need to assimilate the sound of real questions. Typically, real questions are written by developers who refer to templates showing model problems. These are patterns that experience has shown tend to produce better questions, and so they get reused a lot. Each test tends to have its own idiom, and part of your analysis must involve figuring it out.

It's also important to note that the idiom of these questions changes over time, most obviously when a test undergoes large-scale revisions. For example, if you look at the wording of SAT Reading questions over time, there are significant differences in question topics and phrasing in each iteration of the test spec. Among other changes, the SAT has changed how it asks about central ideas in the passage. Once upon a time, one way of getting at that concept was to ask what the best title for a passage would be. All the evidence, however, suggests that this question type was retired by the time of the revision that was done in 1994. I've been collecting released practice tests over that whole period, and the latest example of this question type that I have seen comes from a test given in 1991, twenty six years ago. And yet, even in books published in 2016, supposedly updated for the newest version of the test, you will still find this sort of question. The authors of such books are more than a quarter of a century behind the times!

3. Questions should be written according to the accepted standards for item writing used by the producers of high-stakes tests.

Even for tests that have minimal public documentation, test-prep companies could go a long way towards creating better material if they simply followed the accepted techniques for creating good test questions. There are a number of good books available that explain how to do this. They not only codify the principles used by the writers of real test questions but they explain the research which underlies those recommendations. If anyone who's reading this is actually writing their own mock test questions, you need to read the work of Thomas Haladyna now.

One very important guideline you will find in these books, and one that I find routinely violated by commercial test-prep problems, is that there should be no trick questions. A trick question is one that uses misdirection to tempt a student into selecting an incorrect answer because of irrelevant features of the question. A good distractor (i.e., a wrong answer) will reflect predictable misinterpretations or mistakes that students are likely to make if they lack a full comprehension of the topic. An example of a trick question can be found on the Kaplan Literature test that I looked at in part one.

The following question asks about an except from Thoreau's Walden. I provide both the question and the explanation, which shows clearly the sort of trickery that the author uses:

14. From the passage, one can infer that the
(A) geese are back.
(B) pond is melting.
(C) woodpile is well stocked.
(D) martins are singing.
(E) mornings are foggy.

14. B
The correct response must be inferred. (C) contradicts the passage, since the narrator says (lines 35-38) that looking at the woodpile can tell you winter is over—that is, the wood that was stocked up before winter has been used up. (A), (D), and (E) are actually statements of fact, which do not have to be inferred at all. They do, however [sic] add up to the overall, not-too-subtle inference that the pond is melting, (B).

Notice what the question writer assumes here: because this is an inference question, the student is supposed to distinguish between things that are "statements of fact," by which he presumably means something directly stated in the passage, and things that are inferred. And yet none of the supposed statements of fact are, in fact, direct quotes from Thoreau, and so strictly speaking, some degree of inference is required for every answer choice. That means that somehow we need to rank the options on a scale of how directly stated they are, an arbitrary exercise where there are no clear boundaries. At what point does mere paraphrase become inference? Consider in particular the parts of Thoreau's passage that reference the first two answer choices.

The returning geese are described here:

As it grew darker, I was startled by the honking of geese flying low over the woods, like weary travelers getting in late from the southern lakes,...

The melting pond is described here:

I looked out the window, and lo! where yesterday was cold gray ice there lay the transparent pond already calm and full of hope as in a summer evening,

Thoreau doesn't tell us that the geese were actually returning. He says they are like weary travelers. Yes, it takes very little for us to conclude, based on our knowledge of migratory birds and the time of year, that the geese actually are returning. But how is it any bigger a step, when Thorough describes a pond as covered in ice one day and transparent the next, to conclude that the pond has melted?

Not only is the distinction in inferential status between these two choices arbitrary and ill-motivated, but there's a fundamental flaw in the question writer's intention. Some of the wrong answers, particularly (A), deliberately set out to trap students who are demonstrating one of the primary skills that the test is supposed to measure: the ability to find clear textual evidence supporting an assertion.

Authentic test question do not indulge in such cheap tricks. Distractors on real inference questions are logical errors in reasoning, or they are extrapolations that additional, unprovided evidence would be necessary to demonstrate.

(I'd also note that this sort of question reflects a shallow understanding of what is meant by the "best" answer to a question, but that's a topic for another day.)

4. All items must be carefully edited.

At a minimum, careful editing requires two distinct stages:

First, you need content experts who themselves are trained in item development to review all items.

When a test has significant errors, it gives the impression that no one other than the test's author has looked at it. Surely a place like Kaplan or Princeton Review can afford to have one or two other teachers in the subject sit down and solve these problems. In addition to expertise in the specific subject, these reviewers need to have an understanding of what makes a good test item for this particular test.

Second, you need an expert line editor who is not the question writer.

To be useful, standardized tests must adhere to a particularly high editorial standard, and for a published test-prep book to have more than one or two minor typos is inexcusable. The number of issues in some of these books convinces me that they were typeset directly from the author's document files, with perhaps a little work by a book designer but certainly no intervention from an actual copy or line editor. No one should try to be the final editor for anything that they've written. We're all lousy at correcting things that we have written because we see what we expect to be there, not what's actually there.

5. Items should reflect a similar range of difficulty and discrimination (i.e., the ability to distinguish more able from less able students) as authentic test items.

A company that follows steps 1-4 carefully will save itself from obvious embarrassment and will make practice material that looks superficially accurate, but one of the most important steps necessary to produce practice material that is actually helpful to students requires the developer to collect statistical data, analyze it appropriately, and use the results to drive revision of the material.

This is a step that very few test-prep companies perform. I implemented these procedures at the companies I worked for, but I'm aware of no other companies that do so. (I can't claim to have done detailed research on every test-prep company with its own material, but in the instances I've looked at closely, the vague claims that tests match the difficulty of real ones aren't supported by the evidence of the questions themselves.)

Real standardized tests are assembled to precise psychometric specifications that minimize large swings in difficulty from form to form, and which ensure that all individual items are functioning as intended. Mock practice tests typically differ wildly in difficulty, both from each other and from real tests. That's an inevitable outcome if you simply construct a test without seeing how the questions perform when given to actual students. Just as we are bad at editing our own material, we're also bad at estimating how difficult a question is with any precision, especially one we've written ourselves.

Sometimes, the result is that a mock test will be much easier than the real thing, unjustly inflating students' perceptions of their abilities. More often, these tests wind up being significantly harder. Even if a mock test's questions are all individually well written, the cumulative effect can be one of crushing difficulty. It's easy for this to happen because when item developers set out to write questions that are challenging or interesting to them, they turn out questions that are far too hard for the intended audience.

Sometimes, I hear test-prep folk try to justify the excessive difficulty of tests by invoking a weightlifting analogy. The idea is that if you work out with heavier weights, when you repeat the exercise with lighter weights it seems much easier.

While speciously appealing, this argument has a few problems. First of all, the companies that produce these super-hard tests aren't actually doing so because they've deliberately set out to make a slightly more challenging test. If that were the case, we'd expect all their tests to be systematically harder across the board. But what we actually find is tests of greatly differing difficulty. In other words, the specific difficulty is not the result of careful planning. Instead, this excuse is an ad hoc explanation that is invoked to encourage students when they complain. Further, the most common reason for excessively difficult practice tests is not a superabundance of hard but legitimate questions. Rather, these tests normally have many flawed items. They demonstrate what is called, in the jargon of test makers, construct-irrelevant difficulty. In other words, they are hard for reasons that have nothing to do with the academic skills that the test is supposed to measure.

Even if a practice test contains nothing but quality questions, there's still no good reason to create tests that differ significantly from authentic tests in their difficulty. There is a legitimate place for creating sets of challenging practice problems that focus on specific topics. Such problem sets can allow students, especially more able ones, to focus their attention on improving their performance in areas that may only be tested once a test, or even once every few tests. But the whole reason for creating a full-length mock test is to provide students with an equivalent experience to the real thing, one where they can practice doing a specific number of questions in a set amount of time. Tests that vary significantly in difficulty alter the amount of time it takes students to complete in ways that can fool them. They also produce scoring results that might as well be random guesses. That means they give misleading feedback to the student as to how much progress he or she has made.