## Thursday, January 1, 2009

As I argued in my previous blog on three principles to learning, paradoxes are a great way to sharpen one's intuition. The following paradox was brought to my attention by Hans Welling in Evora, Portugal (incidentially, my brother.) Here is the most intuitive version I could think of.

Imagine two driving schools advertising by publishing their success rates. School A reports a success rate of 65% (720 of their 1100 students passe their driving exam), while school B reports 35% (only 380 out of 1100 students passed.) Clear evidence school A is better than school B right? Or not?

Turns out school A had 1000 students under the age of 30 of which 700 passed (70%) and only 100 old students over the age of 50 of which 20 passed (20%). School B on the other hand had only 100 young students of which 80 passed (80%) and 1000 old students of which 300 passed (30%). Here is the surprise: school B was performing better for both the young students (80% versus 70%) and the old students (30% versus 20%). Which school would you pick now? The paradox is resolved by realizing school B had to deal with so many more old students who on average have much lower passing results.

This seems harmless as long as you know which type of subgroups you are dealing with. But now imagine doing a drug test: does drug A or drug B work better? How do you know you didn't accidentally have a high percentage of subjects in group A that have gene X that makes them react much better to a drug A? You don't, and there seem to be an enormous number of possible subgroups that may randomly appear in your sample.

The best one can do is to make sure the subjects were chosen from the same population with no hidden biases to select subjects for drug A or for drug B. For instance, combining two results from the literature is dangerous, because one group me be English while the other American, or one group may have lived 20 years earlier than the other. But even so, to claim statistical significance for drug A to be better than drug B (or vice versa) one would have to correct for the possibility of randomly selecting unbalanced subgroups that react differently to one of the drugs. Seems a pretty daunting task and I am not so sure this is routinely done.