Guest post: Judea Pearl on correlation, causation and the psychology of Simpson’s paradox
In a recent post I discussed Judea Pearl’s work on causal inference, and in particular on the causal calculus. The post contained a number of “problems for the author” (i.e., me, Michael Nielsen). Judea Pearl has been kind enough to reply with some enlightening comments on four of those problems, as well as contributing a stimulating mini-essay on the psychology of Simpson’s paradox. The following post contains both the text of my questions, as well as Judea’s responses, which he’s agreed to make public (thanks also to Kaoru Mulvihill for helping make this possible!). The post covers five topics, which can be read independently: some remarks on the intuition behind causal probabilities, the intuition behind d-separation, the independence of the three rules of the causal calculus, practical considerations in applying the causal calculus, and the psychology of Simpson’s paradox. Over to Judea.
Thanks for giving me the opportunity to respond to your thoughtful post on correlation, causation and the causal calculus. I will first answer four of your questions, and then comment on the Simpson’s paradox and its role in understanding causation.
Question (MN): We used causal models in our definition of causal conditional probabilities. But our informal definition — imagine a hypothetical world in which it’s possible to force a variable to take a particular value didn’t obviously require the use of a causal model. Indeed, in a real-world randomized controlled experiment it may be that there is no underlying causal model. This leads me to wonder if there is some other way of formalizing the informal definition we have given?
Yes, there is.
But we need to distinguish between “there is no underlying causal model” and “we have no idea what the underlying causal model is.” The first notion leads to chaotic dead-end, because it may mean: “everything can happen,” or “every time I conduct an experiment the world may change,” or “God may play tricks on us” etc. Even the simple idea that randomized experiments tell us something about policy making requires the assumption that there is some order in the universe, that the randomized coin depends on some things and not others, and that the coin affects some things and not others (e.g.,the outcome). So, let us not deal with this total chaos theory.
Instead, let us consider the “no idea” interpretation. It is not hard to encode this state of ignorance in the language of graphs; we simply draw a big cloud of hidden variables, imagine that each of them can influence and be influenced by all the others and we start from there. The calculus that you described nicely in the section “Causal conditional probabilities” now tells us precisely what conclusions we can still draw in this state of ignorance and what we can no longer draw. For example, every conclusion that relies on the identity and values of (e.g., ) will no longer be computable, because most of ‘s parents will be unmeasured.
Remarkably, there are some conclusions that are still inferrable even with such ignorance, and you mentioned the important ones — conclusions we draw from randomized controlled experiments. Why is that so? Because there are certain things we can assume about the experiment even when we do not have any idea about the causal relationships among variables that are not under our control. For example, we can safely assume that the outcome of the coin is independent on all factors that affect the outcome , hence those factors will be the same (in probability) for treatment and control group, and will remain invariant after the experiment is over. We also know that we can overrule whatever forces compelled subjects to smoke or not smoke before the experiment and, likewise, we can assume that the coin does not affect the outcome (cancer) directly but only through the treatment (smoking). All these assumptions can be represented in the graph by replacing the former parents of by a new parent, the coin, while keeping everything else the same. From this we can derive the basic theorem of experimental studies
which connects policy predictions to experimental finding .
There is another way of representing black-box ignorance of the kind faced by a researcher who conducts randomized experiment with zero knowledge about what affects what. This approach, called “potential outcome” was pioneered by Neyman in the 1930′s and further developed by Don Rubin since 1974. The idea is to define a latent variable which stands for the value that would attain if we were to manipulate to value . The target of analysis is to estimate the expected value of the difference from experimental data, where is only partially observed. is known for the treatment group and for the control group. The analysis within this framework proceeds by assuming that, in a randomized trial, is independent on , an assumption called “ignorability.” The justification of this assumption demands some stretching of imagination — because we really do not know much about the hypothetical , so how can we judge whether it is dependent or independent of other variables, like . The justification becomes compelling if we go back to the graph and ask ourselves: what could account for the statistical variations of (i.e., the variations from individual to individual). The answer is that is none other but all factors that affect when is held constant at . With this understanding, it makes sense to assume that is independent of , since is determined by the coin and the coin is presumed independent of all factors that affect .
Here, again, the graph explicates an independence that the formal counterfactual approach takes for granted and thus provides the justification for inferences under CRT.
In a typical application however, we may have only partial knowledge of the model’s structure and, while the black-box approach fails to utilize what we do know, the graphical approach displays what we know and can be used to determine
whether we know enough to infer what we need. More specifically, while the two approaches are logically equivalent, the potential outcome approach requires that we express what we know in terms of independencies among counterfactual variables, a cognitively formidable task, while the graphical approach permit us to express knowledge in the form of missing arrows. The latter is transparent, the former opaque.
Question (MN): The concept of -separation plays a central role in the causal calculus. My sense is that it should be possible to find a cleaner and more intuitive definition that substantially simplifies many proofs. It’d be good to spend some time trying to find such a definition.
This was done, indeed, by Lauritzen et al. (1990) and is described on pages 426-427 of the Epilogue of my book. It involves “marrying” the parents of any node which is an ancestor of a variable on which we condition. This method may indeed be more convenient for theorem proving in certain circumstances. However, since the conditioning set varies from problem to problem, the graph too would vary from problem to problem and I find it more comfortable to work with -separation. A gentle, no-tear introduction to -separation is given on page 335 of Causality.
Question (MN): Suppose the conditions of rules 1 and 2 [of the causal calculus] hold. Can we deduce that the conditions of rule 3 also hold?
No, these rules are both “independent” and “complete.” The independence can be seen in any graph in which does not affect . Regardless of other conditions, such a graph allows us to invoke rule 3 and conclude
This makes sense, manipulating a later variable does not affect the earlier variable. Completeness means that we have not forgotten any rule, every correct sentence (given that the graph is correct) is derivable syntactically using our three rules. (This was proven only recently, in 2006)
Update (MN) In the comments, below, Judea notes that this needs amendment: Ilya Shpitser noted that my answer to Question 3 needs correction. I said “No, these rules are both ‘independent’ and ‘complete.’” In fact, Huang and Valtorta showed that rule 1 is implied by rules 2 and 3, their proof can be found here: (Lemma 4, link).
Question (MN): In real-world experiments there are many practical issues that must be addressed to design a realizable randomized, controlled experiment. These issues include selection bias, blinding, and many others. There is an entire field of experimental design devoted to addressing such issues. By comparison, my description of causal inference ignores many of these practical issues. Can we integrate the best thinking on experimental design with ideas such as causal conditional probabilities and the causal calculus?
All the issues you described have nice graphical representations in causal graphs and lend themselves to precise formulation, and meaningful solution. Take selection bias, for instance, my student Elias Bareinboim just came up with a definitive criterion that tells us which graphs allow for the removal of selection bias and which do not.
The same can be said about measurement errors, effect decomposition (mediation) and other frills of experimental design. One would be hard pressed today to find a journal of epidemiology without causal diagrams. The last breakthrough I can report is a solution to the century-old problem of “transportability” or “external validity,” that is, under what conditions it is possible to generalize experimental results from one population to another, which may differ in several aspects from the first.
This is truly an incredible breakthrough (see http://ftp.cs.ucla.edu/pub/stat_ser/r372.pdf) and the only reason we have not heard the splash in journals like Science or Nature is that the people who can benefit from it are still living in the Arrow-phobic age, when graphs were considered “too seductive” to have scientific value. Hopefully, some of your readers will help change that.
A full understanding of Simpson’s paradox should explain why an innocent arithmetic reversal of an association, not uncommon, came to be regarded as “paradoxical,” and why it has captured the fascination of statisticians, mathematicians and philosophers for over a century (it was first labelled “paradox” by Blyth (1972)). The arithmetics of proportions has its share of peculiarities, no doubt, but these tend to become objects of curiosity once demonstrated and explained away by vivid examples. For example, naive students of probability expect the average of a product to equal the product of the averages but quickly learn to guard against such expectations, given a few counterexamples. Likewise, students expect an association measured in a mixture distribution to equal a weighted average of the source associations. They are surprised, therefore, when ratios of sums, , are found to be ordered differently than individual ratios, and . Again, such arithmetic peculiarities are quickly accommodated by seasoned students as reminders against simplistic reasoning.
The sign reversal we see in Simpson’s paradox is of different nature because, even after seeing the numbers and checking the arithmetics people are left with a sense of “impossibility.” In other words, the ramifications of reversal seem to clash so intensely with deeply held convictions that people are willing to proclaim the reversal impossible. This is seen particularly vivid in the Kidney Stone treatment example that you mentioned, because it challenges our deeply held beliefs about rational decision making. In this example, it is clear that if one is diagnosed with “Small Stones” or “Large Stones” Treatment would be preferred to Treatment But what if a patient is not diagnosed, and the size of the stone is not known; would it be appropriate then to consult the aggregated data, according to which Treatment is preferred?
This would stand contrary to commonsense; a treatment that is preferred both under one condition and under its negation should also be preferred when the condition is unknown.
Or perhaps we should always prefer the partitioned data over the aggregate? This leads us to another dead alley. What prevents one from partitioning the data into arbitrary sub-categories (say based on eye color or post-treatment pain) each advertising a different treatment preference? Which among the many possible ways of partitioning the data should give us the correct decision?
The last problem, fortunately, was given a complete solution using -separation: A good partition is one that -separates all paths between and that contain
an arrow into . All such partitions will lead to the same decision and the same . So, here we have an instance where a century-old paradox is reduced to graph theory and resolved by algorithmic routine — a very gratifying experience for mathematicians and computer scientists.
But I want to include psychologists and philosophers of mind in this excitement by asking if they can identify the origin of the principle we invoked earlier: “a treatment that is preferred both under one condition and under its negation should also be preferred when the condition is unknown.” Whence comes this principle? Is it merely
strong intuition or an experimentally verifiable proposition. Can a drug be invented that is beneficial to males and females but harmful to the population as a whole? Whatever its origin, how is this conviction encoded and processed in the mind? The fact that people share this conviction and are willing to bet that a miracle drug with such characteristics is simply impossible tells us that we share a calculus through which the conclusion is derived swiftly and unambiguously. What is this calculus? It is certainly not probability calculus, because we have seen that probability does sanction reversals so there is nothing in probability calculus that would rule out the miracle drug that people deem impossible.
I hate to hold readers in suspense, but I am proposing that the calculus we use in our head is none other but the -calculus described in your post. Indeed the impossibility of such a miracle drug follows from a theorem in -calculus (Pearl, 2009, p. 181): which reads:
An action that increases the probability of an event in each subpopulation of must also increase the probability of in the population as a whole, provided that the action does not change the distribution of the subpopulations.
Thus, regardless of whether effect size is measured by differences or ratios, regardless of whether is a confounder or not, and regardless of whether we have the correct causal structure on our hand, our intuition should be offended by any effect reversal due to combining subpopulations.
This is fairly general, and it is hard to imagine another calculus ruling out effect reversal with comparable assertiveness. I dare to conjecture therefore that it is -calculus that drives our intuition about cause and effect relations.