Polysyllabic - Psychometrics
http://polysyllabic.com/?q=taxonomy/term/32
enChanging the number of answer choices
http://polysyllabic.com/?q=node/287
<div class="field field-name-taxonomy-vocabulary-1 field-type-taxonomy-term-reference field-label-above"><div class="field-label">Topic: </div><div class="field-items"><div class="field-item even"><a href="/?q=taxonomy/term/32" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Psychometrics</a></div><div class="field-item odd"><a href="/?q=taxonomy/term/12" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">SAT</a></div></div></div><div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><div class="tex2jax">This is part three in my analysis of the changes to the SAT. <a href="http://www.polysyllabic.com/?q=node/285">Part 1</a>. <a href="http://www.polysyllabic.com/?q=node/286">Part 2</a>.
<p>Another forthcoming change to the SAT is the number of answer choices per question: there will be four rather than five options for all questions. This is another way in which the new SAT will more closely resemble the ACT, which already uses four-choice questions for all the tests except Mathematics.</p>
<p>I can already picture the complaints that this makes the SAT easier. And if you are approaching it from the perspective of guessing, it's clearly easier to guess correctly with fewer choices. Your odds for a single question are 25% rather than 20%. But over the course of the whole test, guessing is not a productive strategy, as <a href="http://what-if.xkcd.com/2/">Randall Munroe pointed out</a>.</p>
<p>The odds of guessing correctly on all the multiple-choice questions on the current test are</p>
<p>\[\frac{1}{5^{160}}\approx \frac{1}{6.84 \times 10^{111}}\]</p>
<p>[Note that Munro got the number of writing questions wrong, so I corrected his figure. The actual odds are even worse than he calculated.]</p>
<p>The proposed test spec just released has fewer multiple-choice questions (141) and fewer options to choose from, so your odds rise to</p>
<p>$$\displaystyle\frac{1}{4^{141}} \approx \frac{1}{7.77 \times 10^{84}}$$</p>
<p>Of course that's still an staggeringly remote probability. If we recalculate Munroe's computer-guessing scenario for the new test, the odds of correctly guessing all the math questions alone after 5 billion years rises to only 5.9%.</p>
<p>There has been a lot of research about the optimal number of choices for multiple-choice questions, and it turns out that, <a href="http://en.wikipedia.org/wiki/The_Paradox_of_Choice">as in other areas of life</a>, having more choices is not necessarily better. In fact, as question writers know, it's very hard to come up with more than two or three plausible wrong answers for many questions. If you must come up with five options for every question, it's often the case that one or two are implausible fillers that few, if any, students will pick. The number of options you have does affect how many questions you need to put on the test to have a reliable measurement. For example, you can make a good test with only two options per question, but you need to have more questions on the test. On a test the length of the SAT, the choice between four or five options doesn't significantly affect the reliability, and it doesn't simplify the test for the student because the deleted option would probably have been an implausible choice in the first place.</p>
<p><a href="http://www.polysyllabic.com/?q=node/288">Part 4</a>.</p></div>
</div></div></div>Thu, 17 Apr 2014 17:07:11 +0000Karl Hagen287 at http://polysyllabic.comhttp://polysyllabic.com/?q=node/287#commentsOn Formula Scoring
http://polysyllabic.com/?q=node/286
<div class="field field-name-taxonomy-vocabulary-1 field-type-taxonomy-term-reference field-label-above"><div class="field-label">Topic: </div><div class="field-items"><div class="field-item even"><a href="/?q=taxonomy/term/32" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Psychometrics</a></div><div class="field-item odd"><a href="/?q=taxonomy/term/12" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">SAT</a></div></div></div><div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><div class="tex2jax">This is the second installment of my commentary on the changes to the SAT. Part 1 is <a href="http://www.polysyllabic.com/?q=node/285">here</a>
<p>There are a few changes to the new SAT that I know people will be talking a lot about but which actually matter less than you might think they would to the test taker, although they matter quite a bit to the people making the test. Of these, one has received much press attention since the initial announcement: no more deduction for wrong answers. </p>
<p>I know what most high school students are thinking: no more deduction for incorrect answers. Yay! I'll have a higher score! But hold on before you break out the sparkling cider. (If you're directly concerned about your score on the SAT you're too young for champagne.) It's true that your raw score will likely be higher. But so will everyone else's. And it's not the raw score that's reported to the colleges. It's the scaled score, and that will adapt to the new higher raw scores.</p>
<p>Dropping the "penalty" for incorrect answers, or <em>formula scoring</em> as it is technically known, will chafe a few people. There has been a long argument among specialists over whether or not it's appropriate to calculate raw scores this way. To understand what's at stake here, you need to know both how formula scoring works and what the motivation for introducing it was.</p>
<p>Formula scoring was an attempt to address a fundamental issue with multiple-choice questions: there's a chance that you can guess the right answer with no true understanding of the question at all. Imagine that we have a group of students. Some of them are risk-averse: they are reluctant to answer a question at all if they don't know the answer. These students leave tough questions blank. Others in the group are willing to take risks. When they don't know the answer, they guess and move on. With simple number-correct scoring, the risk-averse students are at a disadvantage, the argument goes, because the risk-taking students will guess, getting some additional number of points and raising their scores without merit. The correction for wrong answers in formula scoring is meant to create a disincentive for random guessing.</p>
<p>The amount of the correction is almost always chosen to be $-\dfrac{1}{k-1}$ where <em>k</em> is the number of answer choices. The logic for that number is that, mathematically, the expected value of random guessing is equal to the expected value of leaving the same number of problems blank. For the SAT, all multiple-choice problems on the current test have 5 choices, so the correction is -0.25. If you guess on 5 questions, the most likely outcome is 1 correct answer and 4 incorrect ones, for a raw score of 0, just as if you had omitted them.</p>
<p>The argument for formula scoring taps into our sense of fairness. We have an intuitive sense that you shouldn't get an unearned benefit. But in the real world, it's far from clear that formula scoring actually serves to protect anyone. For one thing, it makes the game theory about guessing more complex. The optimal strategy with formula scoring is to guess if you can eliminate one or more incorrect answers. But the notion that there is a "penalty" can bias students against guessing when it is to their advantage to do so. With a number-correct scheme, the game theory is simple: always guess rather than leave blank. As long as that strategy is clearly conveyed to all test-takers, there's no solid reason to think that anyone is actually at a disadvantage.</p>
<p>Whether you use formula scoring or not, there is always an optimal strategy and suboptimal strategies. For the test-taker who pursues the optimal guessing strategy, the raw scores with and without formula scoring are just linear transformations. And if we're concerned about fairness, picking a scheme that results in a simpler game theory is desirable because it minimizes differences among test takers with regards to their test wisdom.</p>
<p>From the test maker's point of view, the claims that formula scoring produces a more reliable test are tenuous. (Studies have indicated the effect is small, if it exists at all). It also adds to the mathematical complexity of the models used to calculate score scales. It's notable that the ACT, the GRE, and many other major standardized tests do not use formula scoring. If you're interested in a more detailed account of the arguments for and against formula scoring, <a href="http://ncme.org/linkservid/65C63B8F-1320-5CAE-6EFB0D72A6C55DD8/showMeta/0/">this article</a>, albeit old, gives a good survey of the arguments on both sides.</p>
<p><a href="http://www.polysyllabic.com/?q=node/287">Part 3</a></p></div>
</div></div></div>Thu, 17 Apr 2014 15:05:01 +0000Karl Hagen286 at http://polysyllabic.comhttp://polysyllabic.com/?q=node/286#commentsHow much is an SAT essay worth to your score?
http://polysyllabic.com/?q=node/280
<div class="field field-name-taxonomy-vocabulary-1 field-type-taxonomy-term-reference field-label-above"><div class="field-label">Topic: </div><div class="field-items"><div class="field-item even"><a href="/?q=taxonomy/term/32" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">Psychometrics</a></div><div class="field-item odd"><a href="/?q=taxonomy/term/12" typeof="skos:Concept" property="rdfs:label skos:prefLabel" datatype="">SAT</a></div></div></div><div class="field field-name-body field-type-text-with-summary field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><div class="tex2jax">As you probably know, the 200-800 score for the SAT Writing test is a composite score, based on a combination of an essay and multiple-choice questions. Students (and instructors) often ask me exactly how much the essay counts towards the overall score. Finding an answer to this question is rather tricky, particularly since the score that's reported to you is rounded to the nearest 10 points. (Internally, the ETS psychometricians use unrounded scales for their calculations. The scores are rounded before they are reported so people are less likely to place unwarranted significance on small differences in scores.)
<p>One practical consequence of rounding is that, depending on your multiple-choice raw score, a single point difference in your essay score can mean a difference to your scaled score of anywhere from 0 to 30 points. Such an answer is unsatisfying, however, so I set out to derive a more specific answer by inferring the unrounded contribution of particular essay scores based on the score scales released with publicly available tests.</p>
<p>[<strong>Update (12/6/2013):</strong> I've added more scales to my analysis (nearly doubling the data set) and refined my estimation algorithm so that it gives a more precise result for many scales. The numbers below reflect that revised analysis, which only refines the earlier analysis slightly.]</p>
<p>Here are average amounts, to the nearest tenth, that a particular essay score contributes to your overall composite writing score in comparison to an essay score of 0, which is given for omitted or off-topic essays (the normal range of scores is 2-12). These values were calculated from the score tables for 64 different essays on released tests from 2005-2013.</p>
<p>For reasons that I'll explain below, this breakdown likely is not exactly how ETS calculates things, but thinking of the essay as points added to a base score is conceptually more straightforward.</p>
<table><tr><th>Essay Score</th>
<th>Average Scaled-Score Contribution</th>
<th>Min</th>
<th>Max</th>
<th>Range</th>
<th>St. Dev.</th>
<th>Nominal Score</th>
</tr><tr><td>2</td><td>8.6</td><td>0</td><td>17.9</td><td>17.9</td><td>3.9</td><td>16.4</td>
</tr><tr><td>3</td><td>23.8</td><td>14.0</td><td>32.2</td><td>18.2</td><td>4.1</td><td>32.7</td>
</tr><tr><td>4</td><td>38.8</td><td>31.1</td><td>45.7</td><td>14.6</td><td>3.1</td><td>49.1</td>
</tr><tr><td>5</td><td>50.6</td><td>42.2</td><td>56.8</td><td>14.6</td><td>3.3</td><td>65.5</td>
</tr><tr><td>6</td><td>66.9</td><td>60.6</td><td>73.8</td><td>13.3</td><td>2.9</td><td>81.8</td>
</tr><tr><td>7</td><td>83.3</td><td>75.9</td><td>89.0</td><td>13.1</td><td>3.0</td><td>98.2</td>
</tr><tr><td>8</td><td>103.5</td><td>97.8</td><td>108.2</td><td>10.4</td><td>2.4</td><td>114.5</td>
</tr><tr><td>9</td><td>127.8</td><td>122.7</td><td>136.0</td><td>13.3</td><td>3.4</td><td>130.9</td>
</tr><tr><td>10</td><td>144.3</td><td>139.3</td><td>156.2</td><td>16.9</td><td>3.9</td><td>147.3</td>
</tr><tr><td>11</td><td>162.5</td><td>153.2</td><td>178.5</td><td>25.3</td><td>6.8</td><td>163.6</td>
</tr><tr><td>12</td><td>177.9</td><td>171.2</td><td>186.9</td><td>15.7</td><td>5.4</td><td>180</td>
</tr></table><p>The College Board states that the essay is worth about 30% of the writing score, a weighting that implies a 12 essay should translate to 180 scaled-score points [.3(800-200)=180], and the "nominal score" column shows how many scaled points an essay would be worth if it were simply given an equal portion of that total amount. As you can see, though, that's clearly not how the essay value is calculated, although these nominal amounts are within the observed range for most essay scores. The precise amount will be calculated based on the observed distribution of scores for that essay. </p>
<p>Notice that there's more variability in the extreme scores on both ends, but especially at the high end, than there is for the middle scores. The percentages of 11s and 12s seem to vary significantly more from test to test than any other score point.</p>
<p>If we look at the numbers in terms of incremental payoff, i.e., how many extra scaled-score points do you get, on average, for raising your essay score by one point, the biggest jump comes between the 8 and the 9 essay, which is worth on average 23.7 points to your score, followed next by the step between 7 and 8 (20.1 points). The smallest payoff comes in the step from 0 to 2 (8.6 points). Indeed, for some tests, there's essentially no practical difference, after rounding, between a 0 or a 2 on the essay. The next smallest payoff is the step between 4 and 5 (only 11.8 points). The other steps average about 15-16 scaled-score points for a one-point increase in the essay score.</p>
<p>So, how did I get these numbers?</p>
<p>I started by assuming that ETS would follow normal psychometric practice in creating a composite scale. That means that the composite writing score should be calculated in one of two ways: either by adding weighted raw scores to produce a composite raw score, which is then translated to a scaled score, or by assigning separate scales to the multiple-choice portion and the essay portion and adding the two to produce the final composite scale.</p>
<p>Based on the pattern of numbers in the scale tables, I strongly suspected that the second method was the one used, but I checked them both out to be sure. I could not find any set of numbers that could explain the observed score tables under the assumption that weighted raw scores were summed, but I found solutions for every score table I tried under the second method.</p>
<p>Under this method, we can think of a composite scaled score as the sum of a multiple-choice scaled score (NB: not the same as the multiple-choice subscore reported on the test report) and an essay scaled score. In other words, $S_{m,e} = S_m + S_e + 200$, where $S_m$ is the scaled-score contribution for a multiple-choice raw score of $m$, and $S_e$ is the scaled-score contribution for an essay score of $e$. Values of $S_m$ and $S_e$ need not be integers, or positive, but they must be monotonically increasing.</p>
<p>I then wrote a routine in <a href="http://www.r-project.org/">R</a> to search for a set of values for $S_m$ and $S_e$ that produces the observed scores in the table. Typically, a range of values for each score point will work, so the routine was written to converge on a solution at the midpoint of the range of workable values. In other words, the specific numbers derived for a particular test are probably not exactly right, but they should be within a point or two of the true values). I'm not certain that my routine was the best way to do things, and it was modestly sensitive to the initial conditions, but I did get a solution for every scale that I tried, so it seemed to do the job adequately.</p>
<p>The scaled-score point differences between essay scores are not constant within a single test. In other words, there is no linear equation based on $e$ that will give workable values for $S_e$. For that reason, the scaled-score contribution for the essay can't be based directly on a simple linear transformation. </p>
<p>The exact procedure used to derive a specific $S_e$ for a particular test remains obscure to me. It's almost certainly based on the percentile ranks of essay scores for that test, but is the data smoothed? Is it, for example, transformed to a normal distribution, and if so, what are the parameters of the target distribution? Is the 70-30 weighting a nominal or an effective weighting? (If the latter, the actual weights of the two components will vary depending on their variance and covariance.) I can't answer any of these questions from the score tables alone, nor have I found any literature that answers them.</p>
<p>In the table above, I took the zero essay as the baseline, and for students it's probably most natural to conceive of non-zero essay scores as adding points. But it seems more likely that ETS calculates from the mean. In the terms I used above, then, $S_m$ would represent the multiple-choice scale at the mean essay score (which is normally around 7.2), and values of $S_e$ will be negative for essay scores below the mean.</p>
<p>If my supposition is right, it provides a natural explanation for an oddity that I observed occasionally when there are two or three different composite writing tables in a single test booklet. (A few different essays are typically used with the same multiple choice questions. For example, on the Saturday test dates, different essays are used for the eastern and western halves of the United States. A separate composite table goes with each essay.) Occasionally, the columns for essays with a score of 0 will differ in a few cells for the same multiple-choice score. If the $S_e$ is an amount calculated calculated from the essay mean, the zero essay isn't special. It has a point value just like any other essay score, and if the distribution of zero essays differs enough among the different essays, the zero-essay column can vary. The fact that this is a relatively rare event suggests that the numbers of zero essays tends to be fairly stable.</p>
<p>This procedure leads to an apparently perverse result: you can get a different composite writing score based on which essay you <em>don't</em> write. If you think through the situation, however, keeping in mind that the purpose of a scaled score is to allow comparisons among students who took different versions of the test, this outcome can be justified.</p>
<p>Setting $S_{e0} = 0$ makes sense only if everyone who received that score would have been indifferent to the particular topic they saw. For example, perhaps they decided to skip the essay no matter what it was. </p>
<p>For many students who receive a 0 on the essay, though, this outcome will be affected by the prompt itself. If you consider the universe of potential essay topics, it seems likely that some essay topics may be more likely to provoke students not to respond at all, or to write off topic. If you are a student presented with a highly unusual topic, you might be more likely just to give up and skip the essay, or to write on a completely different subject, than if you received a more pedestrian topic. Under those circumstances, the act of omitting one essay rather than another actually could merit a different scaled score.
</p></div>
</div></div></div>Sat, 30 Mar 2013 15:56:50 +0000Karl Hagen280 at http://polysyllabic.comhttp://polysyllabic.com/?q=node/280#comments