7 Apr 2017

Most Test-Prep Material Sucks (Part 3)

Submitted by Karl Hagen
In which I start to analyze the mock SATs from three books.

This is the third of a multipart series exploring the proposition that a large percentage of commercial test-preparation material is of scandalously poor quality. If you want to start at the beginning, part one is here.

In the earlier parts of the series, I focused on examples from a specific mock Subject Test in Literature from Kaplan. However, it's reasonable to ask if I was unfairly cherry-picking one egregious example. After all, it has fairly limited official public information available: one in-print test, a few additional practice problems, and a skimpy overview of the test specifications. That gives a developer very modest information to model new questions on. Additionally, comparatively few students take the SAT Literature Subject Test, and as a consequence book sales will be more modest. Perhaps, then, we should be more forgiving if publishers devote fewer resources to a rigorous development process.

Given that students' learning outcomes are at stake, I'm not inclined to give this objection much weight. If you publish a book or offer a class on a topic you have certain professional and ethical obligations regardless of the number of students you serve. But even if you are persuaded by the economic argument, the objection is much less persuasive when we turn to a test like the SAT Reasoning Test. Large numbers of students take the test every year, and the test-preparation industry makes lots of money from SAT classes and books. The profits for even a relatively small company are more than adequate to fund a robust development effort. For the large players like Kaplan or Princeton Review, there is absolutely no excuse.

In addition, the specifications for the new SAT are more detailed than ever before, making easily available information that you once had to glean by reverse-engineering released tests or by scrounging in the minutia of obscure research reports. College Board also regularly releases operational forms, and even during the roll-out of the new format, developers had four full-length sample tests in addition to other problem sets, to use as models. College Board also regularly releases operational forms after several test dates each year, and so over time the sample of authentic models only gets larger.

I analyzed mock SAT Reasoning Tests from three major publishers: Barron's, Kaplan, and Princeton Review, as found in these books: Barron's 6 Practice Tests for the New SAT, 2nd ed. (2016), Kaplan 8 Practice Tests for the SAT 2017, and The Princeton Review 6 Practice Tests for the SAT, 2017 edition. I chose the first test in each book on the assumption that if the authors chose to put it first, it probably reflects what they regard as their best work. If there is a drop-off in quality because of developer fatigue (running out of good ideas for questions, etc.), it ought to show up towards the end of the book. Also, as students are most likely to do the tests in order, the first test is the one where the authors have the best chance to establish their authority as people who understand the intricacies of the test.

I will evaluate these tests by looking at how well they meet the test's specifications and whether they depart from any of the established standards for creating educational assessments. Not all flaws are equivalent. Indeed, a test with multiple small problems will still be better than one with a single fatal flaw. So in addition to noting the issue, I've categorized the severity according to the following levels:

  • OK: No violation
  • Minor: A small violation of the standard that likely has a negligible impact on overall validity. An example of a minor violation would be a passage that was slightly longer or shorter than the specification demands. Real tests sometimes exhibit minor violations.
  • Flawed: A mid-level violation of the standard in a way that is likely to have a measurable impact on overall validity if not corrected for, but which in isolation would only be a small concern. In other words, concerning but not fatal. An example of this sort of violation would be a question with more than one correct answer. An accumulation of these violations will likely result in an invalid test. We don't expect real tests to have such flaws. On the rare occasions they occur in operational forms, there will be significant fallout, ranging from dropping questions from the overall score to national controversy.
  • Major: A serious and obvious violation of the standard that utterly invalidates the test. An example of a major violation would be the use of a passage taken from a completely different genre from the ones that are supposed to be on the test. Even one violation at this level makes the test worthless as an equivalent measure to the real test. Real tests simply never do these things.

I did not attempt to evaluate the appropriateness of these tests' difficulty, even though, as I mentioned in part two, that's a key criterion if you're seriously considering subjecting a student to one of these tests. For one thing, College Board has not publicized its exact psychometric specification. Equally importantly, a serious evaluation would also require administering these tests to a large, representative sample of students under controlled conditions. An informal evaluation would amount to my subjective impression, and even experts are notoriously imprecise at evaluating the exact difficulty of multiple-choice items. That said, it's reasonable to infer that a test with many flawed questions will not come close to matching the specifications of the real test.

I'll examine the four main components of the test (Reading, Writing, Mathematics, and Essay) separately. Today, I'll begin look at the Reading Test. First, I'll cover the reading passages themselves, and next time I'll analyze the questions.

The basic requirements of an SAT Reading Section can be summarized as follows:

100% of the questions are about reading passages. There are four passages and one pair of related passages, each 500-750 words, and each with 10 or 11 questions. The passages fall into three genres: literature, history/social science, and natural science.

Genre Passages Words Total Questions Graphics
Literature 1 500-750 10 None
History/Social Studies 2 or 1 + 1 pair 500-750 10-11 each, 21 total 1-2 in 1 passage
Natural Science 2 or 1 + 1 pair 500-750 10-11 each, 21 total 1-2 in 1 passage
Total 4 + 1 pair 3000-3500* 52

*See the discussion of total word count below.

Within the history/social science genre, one passage or pair involves some sort of political or philosophical examination of fundamental issues in public life such as justice, fairness, suffrage, etc. The College Board calls this the "Great Global Conversation." I'll abbreviate this as GGC.

One of the history/social science passages (in practice this is always the one that is not the GGC passage) and one natural science passage will contain visual information (graphs, charts, tables, etc.) which must be interpreted in conjunction with the passage.

So do these three tests get it right? Here's a chart summarizing the passage-level elements:

Criterion Barron's Kaplan Princeton Review
Test Formatting flawed minor OK
Passage Count OK OK OK
Passage Length minor OK minor
Total Word Count OK OK OK
Genre Distribution major OK major
Passage Pair major OK OK
Passage Order minor minor OK
Passage Sources OK minor OK
Passage Quality major flawed flawed
Passage Fairness major minor flawed
Graphics Distribution flawed OK OK

If you want an executive summary, Barron's is the worst of the bunch by a long shot, with four major violations, followed by Princeton Review and then Kaplan. From the chart, Kaplan may look fairly good, but that's only a relative benchmark. All of these tests have non-trivial flaws. And none of this analysis yet takes into account the quality of individual questions.

Test Formatting

A global issue, relevant to the entire test, is the format of the test material. Although by itself, this is not a serious issue, it does give some indication of the editorial care put into the tests. The new SAT follows a significantly different visual layout than before, and we can infer how carefully the test writers have scrutinized the new format from the choices they make.

Barron's formatting is flawed. Apart from reducing the number of answer choices from 5 to 4, the authors appear to have made no effort to rework the test to its current layout. Even easy changes, such as changing the answer-choice labels from "(A)" to "A)" or putting sentence-final punctuation at the end of answer choices that complete an incomplete stem, are ignored. Nor is there editorial consistency in how individual problems are formatted.

Kaplan's formatting doesn't look much like the real test. It's true that they have updated the format of questions, apart from the way they're numbered. Their treatment of the introductory material at the start of passages, however, does not follow the SAT's format at all. These issues are noticeable but are unlikely to cause difficulty for students, so I rate it as a minor flaw.

Princeton Review's test most closely imitates the layout of real tests. They've gone to great lengths to imitate almost all important elements of an actual test. This is one feature where I'm inclined to say not merely "OK" but "good job."

Passage Layout

All three tests get the basic features of the passage length and quantity right. Each test has four passages and one pair, with the appropriate number of questions for each passage and overall.

Word count, surprisingly, requires some judgment to analyze. Does it include just the passages themselves or also the introductory material? What about the text in tables, etc.? I checked all permutations against the released College Board tests, and found that counting passages only resulted in the fewest spec-violations for the College Board itself, as well as all three mock tests.

The Barron's and Princeton Review tests each have one passage that falls outside the specified lengths, but that's a minor variation. College Board itself has released some practice tests with similar divergences.

The test spec says that the total word count of all passages should be 3250 words. That's only mildly helpful, as it doesn't provide a range the way that it does for individual passages. I've inferred a range of 3000-3500 words based on the preponderance of College Board examples, and an educated guess based on numerical symmetry. All three mock tests fall within this range, although it should be noted that College Board itself occasionally runs over by a few hundred words.

Passage Genres

The Barron's test only has one passage in the natural sciences. In place of the second science passage is one that features a bit of literary criticism. This was a genre that was possible on the old SAT but is absent from the current spec. This is a major violation and by itself makes the test completely unrepresentative.

The Barron's test does have two history/social science passages, the first of which appears to be intended as the GCC passage. The second one, however, doesn't provide any account of research, as this non-GCC passage always does. Instead, it's a general political/historical piece. This passage is also flawed.

The Princeton Review test has two history/social science passages that are research articles, but nothing in the GGC category. This is a major omission.

The Kaplan test has the correct number and balance of genres.

Passage Pair

The Barron's test makes a double passage from two fiction pieces, but fiction is the one genre that does not permit double passages. This is a major violation.

The Kaplan and Princeton Review tests create double passages in the appropriate genres. I consider their quality below under passage appropriateness.

Passage Order

Although not explicitly declared in the public spec, the order of passages appears to be fixed by genre. The following order has been used in every released test so far:

  1. Literature
  2. History/Social Science #1
  3. Science #1
  4. History/Social Science #2
  5. Science #2

The only variation is whether the GGC passage appears in the first or second History/Social Sciences slot.

The Princeton Review test uses this order. Barron's and Kaplan both do not. With Barron's, it's at least partly a consequence of the more serious genre problems described above. In the Kaplan test, the two social sciences and the two natural sciences passages are back to back. Even if a fixed order were not required, it's bad practice to have two long stimuli of the same type adjacent to each other, as it can induce test-taker fatigue. The developers should have known better, but I scored these violations as minor as they're unlikely to make a huge difference to the students' performance. They do, however, betray a lack of attention to detail.

Passage Appropriateness

It's not enough for a passage to fit the assigned genres. It must also be the type that would actually appear on a real test. In considering the appropriateness of the passage's content, there are three main considerations. First, is it a high-quality piece of writing that can support a sufficient number of good questions? Second, does the passage require (or are students significantly advantaged by knowing) overly specific background knowledge? Third, is the content fair to use on the test?

Passage Quality

The test spec describes the first criterion in terms of wanting passages that are "high-quality," "intended to represent some of the best writing and thinking in the field," and "engaging."

As part of achieving that goal, the specification states that passages must be taken from previously published sources. This requirement reflects the fact that a passage deliberately written for a test is less likely to be high quality writing, and you're also dependent on the talents and idiosyncratic preferences of the item writer. By using published material you can tap the talents of a much wider range of good writers who are experts in many more fields.

Barron's and Princeton Review use credited sources for all passages. Three of the passages in the Kaplan test are uncredited and appear to be written by Kaplan.

What constitutes a high-quality passage is, of course a bit subjective. From the practical perspective of an item writer, however, a high-quality passage is one with enough thought and nuance to support a full set of good questions. Passages that are nothing but straightforward exposition are not only boring to read but also likely to result in low-quality problems. I've seen this pattern over and over again when I look at the psychometric data on questions. Interesting passages result in good questions. Dull passages result in questions that are either trivially easy or flawed.

When you're looking at pairs of passages, you need to consider not only their individual quality but also how they relate to each other. Pairs need to exhibit some sort of tension, not necessarily direct opposition but at least a mutual engagement with an overlapping set of ideas, one that reflects two distinct viewpoints.

When I search for passages, I'm always on the lookout for writing that has a certain sinuousness, where the writer's thought doesn't merely plod linearly from start to finish but which entertains alternative ideas, negotiating different perspectives. Because I'm not doing a psychometric evaluation of these questions, however, if my only objection to a passage is that it's boring, I won't score this as a flaw.

A more readily discernible quality issue involves the amount of background knowledge required or permitted on the test. The SAT Reasoning test is a general academic test. It's not supposed to directly measure your knowledge of literature, history, science, etc. So passages need to be chosen such that students can answer all the questions without prior knowledge of the topic. Further, students who do have extra background knowledge should not receive a significant advantage. A red flag would be questions where the right answer was obvious to someone without needing to read the passage.

All but one of the passages on the Barron's test have flaws. Two passages (the fiction pair and the literary criticism passage) are already major violations of the spec. I won't consider them further except to note that the fiction pair is a lousy pair even on its own terms. There's virtually nothing interesting to say about their juxtaposition, and there are only two very simple questions that deal with the two passages together. I've also mentioned that one of the two history/social science passages counts as flawed in context given the subject matter. The GGC passage (the first passage in the test) is somewhat confusing without some background knowledge. The single natural-science passage is on the bland side, but is the closest this test comes to an appropriate passage, so I'll give it a pass.

Two of Kaplan's passages are flawed. Kaplan's fiction passage is OK. Indeed, College Board actually an overlapping portion of the same story on a 2008 SAT. Kaplan appears simply to have gone back to the source and taken a larger chunk. Kaplan's GCC passage (from President Wilson's 14-point program) is flawed because it advantages students with background knowledge on this history unacceptably. The social science double passage is flawed because it's a weak pairing, reflected in the fact that there's only one problem that requires students to synthesize the two passages. I rate both the science passages as OK, even though they're on the bland side.

Three of the passages on the Princeton Review test are flawed. The fiction passage comes from the middle of Mary Shelly's Frankenstein. The particular scene chosen makes repeated allusive references to earlier events in the story. The indirectness of these references make this passage substantially harder for students who have not read Frankenstein than for those who have. Since many students read Frankenstein in high school, this creates a group with a distinct and unfair advantage. The first history/social science passage is flawed not so much for its own qualities but because it's here in place of a GCC passage. The paired science passage is flawed because it represents a weak pairing. These are both essentially reportage on the New Horizons space mission and don't engage with each other in any significant way. The final two passages are acceptable.

Passage Fairness

In a nutshell, fairness involves the attempt to avoid putting any particular group of test-takers at a disadvantage because of factors that are unrelated to the thing we're trying to measure. Real test material is subject to fairness reviews. The standards that are relevant to the SAT are those used by ETS, the company that creates the tests for College Board, but they are substantially the same as those used by makers of high-stakes tests around the country, and so creating material that conforms to these requirements should be considered a basic professional standard.

One essential component of a fair test is that the material should not arouse excessive emotions in the test takers. Those emotions interfere with the ability to answer questions accurately and are irrelevant to what we're trying to measure. By these standards, both of the Barron's test passages in history/social science are definitely inappropriate. The first one is a piece from a Native American perspective that deliberately works to provoke outrage. It also makes repeated references to various religions, including a slighting one towards Christianity. The second passage is a lengthy consideration of the morality of dropping the Hiroshima atomic bomb. It talks extensively about war, death, suicide—all topics that are strongly disfavored in anything other than a passing mention. I mark them both as flawed.

One of the passages in the Kaplan test's paired passages involves a biographical account of a living politician. This alone might cause issues because of partisan feelings, but the ETS fairness guide also discourages biographical pieces on living notable persons because of the possibility that they may be involved in a scandal after the material is written and cause the piece to become unsuitable. (Just imagine, for example, the consternation of a test developer who created a piece about Bill Cosby right before the accusations against him gained prominence.) These aren't huge issues, but they're still things you don't expect to see on a real test. I score this as a minor flaw.

In the Princeton Review test, the Frankenstein passage involves a character experiencing intense grief over the death of his brother, who was murdered. It is highly unlikely to pass muster in a fairness review, and so I mark it as flawed. I did not notice any obvious fairness problems in the remaining passages in this test.

Graphics Layout

The test spec requires one graphical/data stimulus with a social science/history passage and another one with a science passage.

The Baron's test has only one graphic, which accompanies the sole science passage. It's also a low quality image with no quantitative data. It differs significantly from the most common kinds of graphics found on real tests.

The Kaplan and Princeton Review tests both have graphics that conform to the specifications.