Blog: Simpson’s paradox and causal inference
You’re studying a new medicine, but as you look at the data you notice something strange. When you only look at children, the medicine appears to be effective. When you only look at adults, the medicine also appears to be effective. But when you look at both groups together, the medicine appears to be harmful. What is going on? And how can we decide whether to use this medicine?
To better understand this paradox, let’s look at some numbers:
Simpson's paradox illustrated with children and adults receiving medicine
|untreated children||treated children||untreated adults||treated adults|
We see that for children, 80% of untreated patients recovered, whereas 100% of treated patients recovered. For adults, 40% of untreated patients recovered, whereas 50% of treated patients recovered. However, for all patients, 60% of untreated patients recovered, whereas about 54% of treated patients recovered. If you look more closely at the data, you can see what is happening. Children have much higher recovery rates in general (treated or untreated), but far fewer children receive treatment than adults. When you put both groups together, the treated group has a higher percentage of adults than the untreated group, which makes it look like the medicine is harmful.
This type of situation is more common than you might think, and is known as Simpson’s paradox or Simpson’s reversal. In order to solve the puzzle, we need to think about causality. We do not just want to know the raw association between taking the medicine and recovery, we want to know the causal impact of taking the medicine. For that, we need to think about the cause-and-effect relationships between all the variables in our analysis.
In this case, we know that the reason fewer children receive treatment cannot be because treatment somehow causes people to be children. Instead, age must somehow influence how likely it is that someone will receive the medicine. In other words, there is a cause-and-effect relationship between age (cause) and treatment (effect). We also see a causal relationship between age (cause) and recovery (effect). In order to estimate the effect of medicine on recovery, we need to control for age, since it impacts both.
Suppose we had the same data, but instead of the split being between children and adults, the split was between patients with low and high blood pressure, measured after treatment. Let’s assume that the medicine increases blood pressure, rather than high blood pressure making people more likely to receive treatment (note that we must use additional data or outside knowledge to make this determination—the data are compatible with either hypothesis). Thus, there is a causal relationship between treatment (cause) and blood pressure (effect), as well as a causal relationship between blood pressure (cause) and recovery (effect). In this case, the proper analysis is the opposite of what we saw before: we must not control for blood pressure. The medicine may be harming recovery by causing high blood pressure, and controlling for blood pressure masks that effect.
To properly estimate causal effects, you must understand and account for the causal relationships between variables, treatments, and outcomes. This will enable you to decide which variables to condition on in your model, and which variables to aggregate over or ignore. As you introduce more variables and the causal relationships get more complex, it can be difficult to intuitively decide how to use each variable. In our next post, we will look at a tool that automates this process for you.