December 4, 2013

How does PISA really work? "Fix it in Post"

How can PISA claim to fairly test in 65 countries in dozens of languages?

My vague hunch is that modern Item Response Theory testing, of which the PISA test's Rasch Model is an example, allows testers to say, much like movie directors of sloppy productions: "We'll fix it in Post." 

You tell me that during the big, expensive action scene I just shot, the leading man's fly was open and in the distant background two homeless guys got into a highly distracting shoving match? And you want to know whether we should do another take, even though we'd have to pay overtime to 125 people? 

"Eh, we'll fix it in Post."

Modern filmmakers have a lot of digital tricks up their sleeves for rescuing scenes, just as modern psychometricians have a lot of computing power available to rescue tests they've already given.

For example, how can the PISA people be sure ahead of time that their Portuguese translations are just as accurate as their Spanish translations? 

Well, that's expensive to do and raises security problems. But, when they see the results come in, they can notice that, say, smart kids in both Brazil and Portugal who scored high overall, did no better on Question 11 than kids who don't score well on the other questions, which suggests the translation of Question 11 might be ambiguous. Oh, yeah, there are, now that we think about it, two legitimately right answers to Question 11 in the Portuguese translation. So we'll drop #11 from the scoring in those two countries. But, in the Spanish-speaking countries, this anomaly doesn't show up in the results, so maybe we'll count Question 11 for those countries.

This kind of post-hoc flexibility allows PISA to wring a lot out of their data. On the other hand, it's also a little scary. 

9 comments:

Anonymous said...

There was some talk of how finnish speaking finnish get better scores than swedish speaking finnish despite similar social backgrounds.
PISA test is literacy test even with maths, if you see the questions, so language plays a greater role than TIMSS where asians still score similarly as on PISA while finnish flounder.

Anonymous said...

still, item response theory lets you equate item difficulties - useful for anchoring - no need to give all the same items in all the countries if certain items act the same across countries. it's useful to find items that act differently in different language (say it's an easy item for average ability English speakers, but a difficult item for high ability Korean speakers - then it's a bad item in Korean, so you toss it out, but you still have enough items that act similarly to be able to equate the test across countries! altho, an item may be difficult in any language b/c it's truly difficult - OR it might simply be a bad item in any language. sorry, IRT is probably a comment thread ender!

David said...

A shot in the arm might be a question about "The Fast and the Furious."

Steve Sailer said...

panjoomby, thanks for clueing us in on Item Response Theory. Am I right in saying a big advance over Classical Test Theory is that you can now fix a lot of stuff after giving the test?

But, is there a danger of over-fixing the results? Are there standards for what not to do in post?

Anonymous said...

What's interesting is the interest in the PISA by US elites even though the PISA seems sort of like a relic, an anachronism from the heyday of the industrial nation-state, which is passe among Western/globalist elites. The industrial nation-state was all about meeting basic needs like education for the majority and raising the majority to basic standards of competence so they could be competent citizens for industrial enterprises, the national government, military, administrative apparatus, etc. The PISA measures how competent and organized a country is at supplying basic education to its majority.

jody said...

observed problem difficulty between groups is one of the ways that test validity is established in research psychology. if a problem is difficult for all groups, regardless of how often each particular group tends to get it right, then the question is considered hard for everybody. if the problem is difficult only for one particular group, then it is probably not a useful problem for testing purposes and will likely be discarded.

add up a bunch of problems which all behave this way - every group seems to encounter the same ramp up in degree of difficulty from question to question - and you have a pretty good test battery. problem 1 is easy for everybody, problem 2 is harder for everybody, problem 3 is hardest for everybody, and so forth.

the difficulty of each problmem can be decreased to create a hurdle that almost anybody of any group can clear, which would be how the civil service exams have been written, or increased to a wall so high that only a few people of a few groups can climb it.

strictly from a testing perspective, what is not relevant are problems which almost nobody from some groups can solve. as long as everybody from every group still has a lot of trouble solving the problem, this means the problem is a still good one, and is useful for raising the ceiling to which the test can measure.

objections to those kinds of problem sets are generally raised from people outside the field of psychology, not from psychologists themselves.

Power Child said...

"We'll fix it in post" are known to production guys as the five most expensive words in filmmaking. Having worked in post, I can say they're still very expensive but getting less so.

Anonymous said...

yep, the problem is similar to deciding when an individual data point "qualifies" as an outlier -- how outlying does it have to be till you decide it's screwed up - & how much "behind the scenes" info do you have to explain why it's so - fixing things in post is subjective - IRT lets you accurately predict what % of the population will get the next item right, say at such & such an ability level a person has a 99% likelihood of getting the next item right, etc. So when an item doesn't follow the prediction, you can't trust the item. but deciding how bad is bad/how far off the mark does it have to be before you yank it - well, that's subjective empiricism at its best!

Anonymous said...

Wow, comparing IQ tests to filming a spectacular action sequence with explosions and airheads actors reading banter from a script. I take back everything I ever said about "psychometrics" lacking empirical rigor. Thanks, Steve.