In a recent post I discussed Judea Pearl’s work on causal inference, and in particular on the causal calculus. The post contained a number of “problems for the author” (i.e., me, Michael Nielsen). Judea Pearl has been kind enough to reply with some enlightening comments on four of those problems, as well as contributing a stimulating mini-essay on the psychology of Simpson’s paradox. The following post contains both the text of my questions, as well as Judea’s responses, which he’s agreed to make public (thanks also to Kaoru Mulvihill for helping make this possible!). The post covers five topics, which can be read independently: some remarks on the intuition behind causal probabilities, the intuition behind d-separation, the independence of the three rules of the causal calculus, practical considerations in applying the causal calculus, and the psychology of Simpson’s paradox. Over to Judea.

Dear Michael,

Thanks for giving me the opportunity to respond to your thoughtful post on correlation, causation and the causal calculus. I will first answer four of your questions, and then comment on the Simpson’s paradox and its role in understanding causation.

## Question 1: On the intuition behind causal probabilities

Question (MN): We used causal models in our definition of causal conditional probabilities. But our informal definition — imagine a hypothetical world in which it’s possible to force a variable to take a particular value didn’t obviously require the use of a causal model. Indeed, in a real-world randomized controlled experiment it may be that there is no underlying causal model. This leads me to wonder if there is some other way of formalizing the informal definition we have given?

Yes, there is.

But we need to distinguish between “there is no underlying causal model” and “we have no idea what the underlying causal model is.” The first notion leads to chaotic dead-end, because it may mean: “everything can happen,” or “every time I conduct an experiment the world may change,” or “God may play tricks on us” etc. Even the simple idea that randomized experiments tell us something about policy making requires the assumption that there is some order in the universe, that the randomized coin depends on some things and not others, and that the coin affects some things and not others (e.g.,the outcome). So, let us not deal with this total chaos theory.

Instead, let us consider the “no idea” interpretation. It is not hard to encode this state of ignorance in the language of graphs; we simply draw a big cloud of hidden variables, imagine that each of them can influence and be influenced by all the others and we start from there. The calculus that you described nicely in the section “Causal conditional probabilities” now tells us precisely what conclusions we can still draw in this state of ignorance and what we can no longer draw. For example, every conclusion that relies on the identity and values of $pa(X_j)p$ (e.g., $P(X_i|do(x_j))$) will no longer be computable, because most of $X_j$‘s parents will be unmeasured.

Remarkably, there are some conclusions that are still inferrable even with such ignorance, and you mentioned the important ones — conclusions we draw from randomized controlled experiments. Why is that so? Because there are certain things we can assume about the experiment even when we do not have any idea about the causal relationships among variables that are not under our control. For example, we can safely assume that the outcome of the coin is independent on all factors that affect the outcome $Y$, hence those factors will be the same (in probability) for treatment and control group, and will remain invariant after the experiment is over. We also know that we can overrule whatever forces compelled subjects to smoke or not smoke before the experiment and, likewise, we can assume that the coin does not affect the outcome (cancer) directly but only through the treatment (smoking). All these assumptions can be represented in the graph by replacing the former parents of $X$ by a new parent, the coin, while keeping everything else the same. From this we can derive the basic theorem of experimental studies

$P(y|do(x)) = P(y|x)$

which connects policy predictions $P(y|do(x))$ to experimental finding $P(y|x)$.

There is another way of representing black-box ignorance of the kind faced by a researcher who conducts randomized experiment with zero knowledge about what affects what. This approach, called “potential outcome” was pioneered by Neyman in the 1930’s and further developed by Don Rubin since 1974. The idea is to define a latent variable $Y_x$ which stands for the value that $Y$ would attain if we were to manipulate $X$ to value $x$. The target of analysis is to estimate the expected value of the difference $Y_1 - Y_0$ from experimental data, where $Y_x$ is only partially observed. $Y_1$ is known for the treatment group and $Y_0$ for the control group. The analysis within this framework proceeds by assuming that, in a randomized trial, $Y_x$ is independent on $X$, an assumption called “ignorability.” The justification of this assumption demands some stretching of imagination — because we really do not know much about the hypothetical $Y_x$, so how can we judge whether it is dependent or independent of other variables, like $X$. The justification becomes compelling if we go back to the graph and ask ourselves: what could account for the statistical variations of $Y_x$ (i.e., the variations from individual to individual). The answer is that $Y_x$ is none other but all factors that affect $Y$ when $X$ is held constant at $X=x$. With this understanding, it makes sense to assume that $Y_x$ is independent of $X$, since $X$ is determined by the coin and the coin is presumed independent of all factors that affect $Y$.

Here, again, the graph explicates an independence that the formal counterfactual approach takes for granted and thus provides the justification for inferences under CRT.

In a typical application however, we may have only partial knowledge of the model’s structure and, while the black-box approach fails to utilize what we do know, the graphical approach displays what we know and can be used to determine
whether we know enough to infer what we need. More specifically, while the two approaches are logically equivalent, the potential outcome approach requires that we express what we know in terms of independencies among counterfactual variables, a cognitively formidable task, while the graphical approach permit us to express knowledge in the form of missing arrows. The latter is transparent, the former opaque.

## Question 2: the intuition behind d-separation

Question (MN): The concept of $d$-separation plays a central role in the causal calculus. My sense is that it should be possible to find a cleaner and more intuitive definition that substantially simplifies many proofs. It’d be good to spend some time trying to find such a definition.

This was done, indeed, by Lauritzen et al. (1990) and is described on pages 426-427 of the Epilogue of my book. It involves “marrying” the parents of any node which is an ancestor of a variable on which we condition. This method may indeed be more convenient for theorem proving in certain circumstances. However, since the conditioning set varies from problem to problem, the graph too would vary from problem to problem and I find it more comfortable to work with $d$-separation. A gentle, no-tear introduction to $d$-separation is given on page 335 of Causality.

## Question 3: the independence of the rules of the causal calculus

Question (MN): Suppose the conditions of rules 1 and 2 [of the causal calculus] hold. Can we deduce that the conditions of rule 3 also hold?

No, these rules are both “independent” and “complete.” The independence can be seen in any graph in which $X$ does not affect $Z$. Regardless of other conditions, such a graph allows us to invoke rule 3 and conclude

$P(z|do(x)) = P(z).$

This makes sense, manipulating a later variable does not affect the earlier variable. Completeness means that we have not forgotten any rule, every correct sentence (given that the graph is correct) is derivable syntactically using our three rules. (This was proven only recently, in 2006)

Update (MN) In the comments, below, Judea notes that this needs amendment: Ilya Shpitser noted that my answer to Question 3 needs correction. I said “No, these rules are both ‘independent’ and ‘complete.'” In fact, Huang and Valtorta showed that rule 1 is implied by rules 2 and 3, their proof can be found here: (Lemma 4, link).

## Question 4: practical issues in applying the causal calculus

Question (MN): In real-world experiments there are many practical issues that must be addressed to design a realizable randomized, controlled experiment. These issues include selection bias, blinding, and many others. There is an entire field of experimental design devoted to addressing such issues. By comparison, my description of causal inference ignores many of these practical issues. Can we integrate the best thinking on experimental design with ideas such as causal conditional probabilities and the causal calculus?

Of course.

All the issues you described have nice graphical representations in causal graphs and lend themselves to precise formulation, and meaningful solution. Take selection bias, for instance, my student Elias Bareinboim just came up with a definitive criterion that tells us which graphs allow for the removal of selection bias and which do not.
The same can be said about measurement errors, effect decomposition (mediation) and other frills of experimental design. One would be hard pressed today to find a journal of epidemiology without causal diagrams. The last breakthrough I can report is a solution to the century-old problem of “transportability” or “external validity,” that is, under what conditions it is possible to generalize experimental results from one population to another, which may differ in several aspects from the first.

This is truly an incredible breakthrough (see http://ftp.cs.ucla.edu/pub/stat_ser/r372.pdf) and the only reason we have not heard the splash in journals like Science or Nature is that the people who can benefit from it are still living in the Arrow-phobic age, when graphs were considered “too seductive” to have scientific value. Hopefully, some of your readers will help change that.

## The Psychology of Simpson’s Paradox

A full understanding of Simpson’s paradox should explain why an innocent arithmetic reversal of an association, not uncommon, came to be regarded as “paradoxical,” and why it has captured the fascination of statisticians, mathematicians and philosophers for over a century (it was first labelled “paradox” by Blyth (1972)). The arithmetics of proportions has its share of peculiarities, no doubt, but these tend to become objects of curiosity once demonstrated and explained away by vivid examples. For example, naive students of probability expect the average of a product to equal the product of the averages but quickly learn to guard against such expectations, given a few counterexamples. Likewise, students expect an association measured in a mixture distribution to equal a weighted average of the source associations. They are surprised, therefore, when ratios of sums, $\frac{(a+b)}{(c+d)}$, are found to be ordered differently than individual ratios, $\frac{a}{c}$ and $\frac{b}{d}$. Again, such arithmetic peculiarities are quickly accommodated by seasoned students as reminders against simplistic reasoning.

The sign reversal we see in Simpson’s paradox is of different nature because, even after seeing the numbers and checking the arithmetics people are left with a sense of “impossibility.” In other words, the ramifications of reversal seem to clash so intensely with deeply held convictions that people are willing to proclaim the reversal impossible. This is seen particularly vivid in the Kidney Stone treatment example that you mentioned, because it challenges our deeply held beliefs about rational decision making. In this example, it is clear that if one is diagnosed with “Small Stones” or “Large Stones” Treatment $A$ would be preferred to Treatment $B$ But what if a patient is not diagnosed, and the size of the stone is not known; would it be appropriate then to consult the aggregated data, according to which Treatment $B$ is preferred?
This would stand contrary to commonsense; a treatment that is preferred both under one condition and under its negation should also be preferred when the condition is unknown.

Or perhaps we should always prefer the partitioned data over the aggregate? This leads us to another dead alley. What prevents one from partitioning the data into arbitrary sub-categories (say based on eye color or post-treatment pain) each advertising a different treatment preference? Which among the many possible ways of partitioning the data should give us the correct decision?

The last problem, fortunately, was given a complete solution using $d$-separation: A good partition is one that $d$-separates all paths between $X$ and $Y$ that contain
an arrow into $X$. All such partitions will lead to the same decision and the same $P(y|do(x))$. So, here we have an instance where a century-old paradox is reduced to graph theory and resolved by algorithmic routine — a very gratifying experience for mathematicians and computer scientists.

But I want to include psychologists and philosophers of mind in this excitement by asking if they can identify the origin of the principle we invoked earlier: “a treatment that is preferred both under one condition and under its negation should also be preferred when the condition is unknown.” Whence comes this principle? Is it merely
strong intuition or an experimentally verifiable proposition. Can a drug be invented that is beneficial to males and females but harmful to the population as a whole? Whatever its origin, how is this conviction encoded and processed in the mind? The fact that people share this conviction and are willing to bet that a miracle drug with such characteristics is simply impossible tells us that we share a calculus through which the conclusion is derived swiftly and unambiguously. What is this calculus? It is certainly not probability calculus, because we have seen that probability does sanction reversals so there is nothing in probability calculus that would rule out the miracle drug that people deem impossible.

I hate to hold readers in suspense, but I am proposing that the calculus we use in our head is none other but the $do$-calculus described in your post. Indeed the impossibility of such a miracle drug follows from a theorem in $do$-calculus (Pearl, 2009, p. 181): which reads:

An action $A$ that increases the probability of an event $B$ in each subpopulation $C_i$ of $C$ must also increase the probability of $B$ in the population as a whole, provided that the action does not change the distribution of the subpopulations.

Thus, regardless of whether effect size is measured by differences or ratios, regardless of whether $C$ is a confounder or not, and regardless of whether we have the correct causal structure on our hand, our intuition should be offended by any effect reversal due to combining subpopulations.

This is fairly general, and it is hard to imagine another calculus ruling out effect reversal with comparable assertiveness. I dare to conjecture therefore that it is $do$-calculus that drives our intuition about cause and effect relations.

In the last post I described how to use support vector machines (SVMs) to combine multiple notions of search relevance. After posting I realized I could greatly simplify my discussion of an important subject in the final section of that post. What I can simplify is this: once you’ve found the SVM parameters, how should you use them to rank a long list of documents for a particular query, in practice?

Here’s the setup. We suppose we’ve already taken a bunch of training data, as in the last post, and used that training data to find a SVM with soft margin. That SVM can be used to rank any query and document. In particular, the SVM is specified by a vector $\vec w$, and we rank a document $d$ above a document $d'$ for query $q$ if:

$[*] \,\,\,\, \vec w \cdot \vec \phi(q,d,d') \geq 0,$

where $\vec \phi(q,d,d')$ is the feature difference vector. (In the last post a parameter $b$ also appeared in this condition, but as I noted in one of the exercises in that post, for the particular problem of search $b=0$, which is why it doesn’t appear here.)

In the final section of the last post I did lots of perambulations with the condition [*] to figure out how to construct a linear order ranking documents, given a particular query $q$. We can simplify life quite a bit by recalling that $\vec \phi(q,d,d') = \vec \psi(q,d)-\vec \psi(q,d'),$ where $\vec \psi(q,d)$ is the feature vector for query $q$ and document $d$. The condition above can then be rewritten as:

$[**] \,\,\,\, \vec w \cdot \vec \psi(q,d) \geq \vec w \cdot \vec \psi(q,d')$

This condition suggests a much easier way of ranking documents: simply treat $\vec w \cdot \vec \psi(q,d)$ as a score of how good document $d$ is when the query is $q$, so that we rank $d$ above $d'$ whenever the score for $d$ is better than $d'$. We then just construct a linear ordering of the documents, based on the score. Or, if we prefer, we can find just the top 10 (or 20, or whatever) documents by iterating linearly over the documents, and keeping a running tally of the best 10 (or 20) found to date.

This is quite a bit simpler than the discussion at the end of my last post. It only works, though, because our classifer (the SVM) is a linear function of the feature difference vectors. It’s not difficult to construct other types of classifer for which it really wouldn’t be obvious how to construct this kind of score. For those classifiers you could still fall back on the analysis I did in the last post, so it wasn’t all wasted effort.

Incidentally, one thing I like about the equation [**] is that it makes it a lot clearer why $\vec w$ is called the weight vector. It means that the score for page $d$ on query $q$ is just a linear combination of the different feature values, with weights given by the weight vector. That makes the notation $\vec w$ and nomenclature “weight vector” much clearer, as well as making it a lot clearer what the vector $\vec w$ means!

In earlier posts I’ve described two different ways we can assess how relevant a given webpage is to a search query: (1) the cosine similarity measure; and (2) the PageRank, which is a query-independent measure of the importance of a page. While it’s good that we have multiple insights into what makes a webpage relevant, it also gives rise to a problem: how should we combine these two measures to determine the ranking of a particular webpage for a particular search query?

In fact, in practice the problem gets much more complex than this, because there are more than two useful notions of search relevance. For instance, we’d surely also wish to incorporate a relevance measure that quantifies how likely or unlikely a given page is to be spam. By doing that we could ensure that pages which are likely to be spam receive a much lower ranking. We might also wish to incorporate a measure which ranks a page based on how close together the words in the search query are on that page. Once you start to think about it, we humans combine a very large number of factors when assessing the usefulness of a webpage. This is reflected in the fact that, according to Google, their search engine combines not just two or three measures of relevance but more than 200 measures of relevance. How should we best combine all these multiple measures in order to determine how relevant a page is to a given query?

You might think that the right approach would be to think hard about the meaning of measures like cosine similarity and PageRank, and then on the basis of that understanding, to figure out optimal ways of combining those measures. This approach is certainly worth pursuing, but it suffers from a problem: it doesn’t scale very well. Even if you come up with a good way of combining cosine similarity and PageRank, how would you combine 200 different measures? It’s not so obvious. And if you decide to trial the addition of a 201st measure of relevance, how exactly should you incorporate it into your algorithm, and how should you check to see whether or not it improves search results?

In this post, I’ll describe an approach to combining multiple measures of relevance that doesn’t require us to consider the details of the individual measures. Instead, the procedure I describe lets the machine automatically learn how to combine different measures. It does this with the help of a set of training data, where humans have ranked some set of webpages according to their relevance to some set of training queries. The idea is to figure out the best way of combining the measures of relevance in order to reproduce the results of the training data. Whatever method of combination is found is then applied more broadly, to all queries, and all webpages. The big advantage of this machine learning approach is that it lets us easily combine many different notions of search relevance. But it also has some drawbacks, as we’ll see.

The post is based principally on Chapter 15 of the book about information retrieval by Manning, Raghavan, and Sch\”utze. The book is also available for free on the web.

Originally, I intended this post to be a mix of theory and working code to illustrate how the theory works in practice. This is the style I’ve used for many earlier posts, and is the style I intend to use whenever possible. However, I ran into a problem when I attempted to do that for the current post. The problem was that if I wanted to construct interesting examples (and ask interesting questions about those examples), I needed to add a lot of extra context in order for things to make sense. It would have tripled (or more) the length of an already long post. I ultimately decided the extra overhead wasn’t worth the extra insight. Instead, the post focuses on the theory. However, I have included a few pointers to libraries which make it easy to construct your own working code, if you’re so inclined. At some point I expect I’ll come back to this question, in a context where it makes much more sense to include working code.

As usual, I’ll finish the introduction with the caveat that I’m not an expert on any of this. I’m learning as I go, and there may be mistakes or misunderstandings in the post. Still, I hope the post is useful. At some point, I’ll stop adding this caveat to my posts, but I’m still a long way from being expert enough to do that! Also as per usual, the post will contain some simple exercises for the reader, some slightly harder problems, and also some problems for the author, which are things I’d like to understand better.

### General approach

In this section, I’ll work through some simple hypothetical examples, using them to build up a set of heuristics about how to rank webpages. These heuristics will ultimately suggest a complete algorithm for learning how to rank webpages from a set of training data.

One small point about nomenclature: I’m going to switch from talking about “webpages” (or “pages”), and start referring instead to “documents”. In part this is because the document terminology is more standard. But it’s also because the techniques apply more broadly than the web.

To keep things simple and concrete in our hypothetical examples, we’ll assume that for any given query and document we have just two different measures of relevance, let’s say the cosine similarity and the PageRank (or some other similar measure, if we’re working with documents not from the web). We’ll call these features. And so for any given query and document pair $(q,d)$ we have a feature vector:

$\vec \psi(q,d) = \left[ \begin{array}{c} \mbox{PageRank}(q,d) \\ \mbox{cosine similarity}(q,d) \end{array} \right].$

Actually, the PageRank of a document doesn’t depend on the query $q$, but in general we’ll allow the features in the feature vector to depend on the query. Also, in general there will be more than two features, and so the feature vector might have quite a few components. But it turns out that the generalization beyond the two-dimensional case is straightforward, so we’ll stick with two dimensions for now.

Our broad goal can now be restated in the language of feature vectors. What we want is to find an algorithm which combines the different components of the feature vector $\vec \psi(q,d)$ in order to rank the document $d$ for any particular query $q$. Our task is to find an algorithm which does a pretty good job doing this ranking.

To get some insight into how we should solve this problem, let’s suppose we have an extremely simple set of training data. We’ll suppose a human operator has ranked three documents $d_1, d_2, d_3$ for just a single query $q$. Of course, real training data will need to involve many more documents and queries, but we can learn a lot by starting with this. We have three feature vectors for these three cases, $\vec \psi(q,d_1), \vec \psi(q,d_2), \vec \psi(q,d_3)$. I’ve illustrated these in the following diagram – to avoid clutter, rather than write $\vec \psi(q,d_j)$ repeatedly in the diagram, I’ve labelled each vector simply by its document number, $d_j$, as well as (in parentheses) by its order of relevance as ranked by a human operator:

There are a few reasonable observations:

• If a feature vector $\vec \psi(q,d)$ is up and to the right of $\vec \psi(q,d')$ then it’s better along both axes (PageRank and cosine similarity). It seems reasonable to conclude that $d$ is a better result than $d'$.
• Coversely, if $\vec \psi(q,d)$ is down and to the left of $\vec \psi(q,d')$ then $d$ should always be ranked below $d'$.
• The hard cases are when $\vec \psi(q,d)$ is up and to the left of $\vec \psi(q,d')$, or down and to the right, in which case it’s not entirely clear

Note, by the way, that I don’t want to claim that that these observations are “proveable” in any way: they’re just reasonable observations, at least for the particular features (cosine similarity and PageRank) that we’re using. The idea here is simply to figure out some reasonable heuristics which we will eventually combine to suggest an algorithm for learning how to rank webpages.

With the observations above as motivation we’ll adopt the heuristic that it is the vector of feature differences $\vec \phi(q,d,d') := \vec \psi(q,d)-\vec \psi(q,d')$ that determines whether $d$ should be ranked above $d'$, or vice versa. In other words, we’re going to require that the relative ranking of $d$ and $d'$ is a function of the components of $\vec \phi(q,d,d')$, and not of either individual feature vector alone. So the problem now becomes: how should we use $\vec \phi(q,d,d')$ to rank $d$ and $d'$? A clue is provided by looking at the vectors of feature differences for our training data:

I haven’t labelled the vectors explicitly with $q, d$ and $d'$, but it’s pretty easy to figure out which is which, if you look carefully. Instead, I’ve labelled each feature difference vector $\vec \phi(q,d,d')$ with a filled in oval when $d$ is ranked better than $d'$, and with a star when $d'$ is ranked better than $d$. Examining the vectors carefully, we see that the vectors with an oval all “point the same way” as one another, while the vectors with a star also all point the same way as one another, but in the opposite direction. We can make this intuition more precise by saying that the two sets of vectors are separated into two half-spaces by a line:

So one way we could determine whether a feature difference vector is labelled by an oval or a cross is simply by determining which half-space it is in, i.e., by determing which side of the line it’s on. Or to reformulate it in our original terms: we can tell whether a webpage $d$ ranks higher than a webpage $d'$ simply by determining which half-space the feature difference vector $\vec \phi(q,d,d')$ is in.

In higher-dimensional feature spaces this idea generalizes to separating the feature difference vectors into two half-spaces which are on opposite sides of a separating hyperplane. A convenient way of specifying this separating hyperplane, in any number of dimensions, is to introduce a normal vector $\vec w$ to the hyperplane. I’ve illustrated such a normal vector below for two dimensions, but you should imagine that we’re working in higher dimensions:

The condition for a feature difference vector to be in (say) the upper half-space is that $\vec \phi(q,d,d') \cdot \vec w > 0$. To be in the lower half-space the condition is that $\vec \phi(q,d,d') \cdot \vec w < 0[/latex]. Summing everything up, we can check whether the document $d$ should be ranked above or below $d'$ simply by computing the sign of $\vec \phi(q,d,d') \cdot \vec w$. The above observations suggest an algorithm for using the feature difference vectors to determine search relevance. It's a pretty obvious generalization of what I've just described, but at the risk of repeating myself I'll write it all out explicitly. To start, we assume that a set of training data has been provided by human operators. Those operators have been given a set of training queries $q_1,q_2,\ldots,q_m$ and training documents $d_1,d_2,\ldots,d_n$. For each training query they've ranked each training document in order of relevance to that query. They might decide, for example, that for the query $q_1$, the document $d_{17}$ should be the top-ranked query, $d_5$ the second-ranked query, and so on. This training data provides us with a whole lot of feature difference vectors $\vec \phi(q_i,d_j,d_k)$. These vectors can be divided up into two sets. The first set, which we'll call training data set $A$, contains those feature difference vectors for which $d_j$ has been ranked more highly than $d_k$. The second set, which we'll call training data set $B$, contains those feature difference vectors for which $d_k$ has been ranked more highly than $d_j$. We then find a separating hyperplane that separates these two data sets, i.e., with the upper half-space containing data set $A$ (where $d_j$ is ranked more highly than $d_k$ for $q_i$), and a lower half-space containing data set $B$. Suppose now that $q$ is any query – not necessarily a training query. And suppose $d$ and $d'$ are any two documents, not just training documents. We rank $d$ above $d'$ for query $q$ if the feature difference vector $\vec \phi(q,d,d')$ lies in the upper half-space. And we rank $d$ below $d'$ if it lies in the lower half-space. This completes the description of the algorithm. There are many problems with this basic algorithm. One problem becomes evident by going back to our primitive training data and adding an extra document: Inspecting the feature difference vectors reveals a problem: I’ve simplified the picture by showing only the feature difference vectors $\phi(q,d,d')$ in data set $A$, i.e., those for which $d$ ranks more highly than $d'$; data set $B$ contains the negation of those vectors. It should be clear that there is no half-space which can be used to divide up the vectors into two sets. It’s not geometrically possible. The way we’ll deal with this is by modifying the algorithm in such a way as to find a half-space which approximately divides the two sets of feature difference vectors into half-spaces. I’ll explain how to do this approximate division later in the post. Before getting to the approximate division, though, we’ll warm up by figuring out much more explicitly how to do the division into two half-spaces when it is possible. It’s all very well for me to glibly say that we should “figure out which half-space the vector $\phi(q,d,d')$ is in”, but how can we actually do this in practice? In the next section I’ll introduce support vector machines, which are a way of doing this kind of division explicitly. Once we’ve understood the basics of support vector machines, we’ll come back to the question of how to divide two sets of vectors into approximate half-spaces. ### Problems • Suppose the basic algorithm that I’ve described works, i.e., a division into half-spaces is possible. Suppose that for a particular query $q$ the document $d$ is ranked above $d'$ and $d'$ is ranked above $d''$. Prove that the algorithm will rank $d$ above $d''$. ### Support vector machines Support vector machines are a technique for partitioning two sets of vectors (data set $A$ and data set $B$) into half-spaces separated by a separating hyperplane. The technique is guaranteed to work whenever such a partitioning is possible, and whatsmore the partitioning is optimal, in a sense I’ll make precise. In this section I’ll describe briefly how support vector machines work. As an aside, you’ll note that the notion of search (and related topics) wasn’t mentioned anywhere in the last paragraph. That’s because support vector machines aren’t about search. Instead, they’re a general technique for dividing sets of vectors into half-spaces, a technique which can be applied to many different problems in machine learning and artificial intelligence, not just search. So support vector machines are a useful technique to understand, even if you’re not especially interested in search. End of aside. A priori if someone just gives you two sets of vectors, $A$ and $B$, it’s not always going to be clear whether a partitioning into two half-spaces is possible. But let’s assume that such a partitioning is possible, and we’ll return later to what happens when it’s not. What we’d like to do is to find an explicit way of specifying a separating hyperplane: Note that in the last section our separating hyperplanes passed through the origin. But for support vector machines we’ll also allow hyperplanes which don’t pass through the origin. One way to explicitly specify such a hyperplane is as the set of vectors $\vec x$ satisfying the equation: $\vec w \cdot \vec x + b = 0,$ where $\vec w$ is a vector normal to the separating hyperplane, and $b$ is a constant. Together, $\vec w$ and $b$ specify the hyperplane. The goal of the support vector machine is to take data sets $A$ and $B$, and use them to construct values for $\vec w$ and $b$ which specify a separating hyperplane. Whatsmore, we’ll try to do this in such a way that we maximize the size of the “wedge” between the two data sets, where we require that the two edges of the wedge (called support or supporting hyperplanes) be parallel to the separating hyperplane itself. In other words, the goal is to choose the separating hyperplane (i.e., to choose $\vec w$ and $b$) in such a way that these two supporting hyperplanes are as far apart from one another as possible. Let me mention, by the way, that I’m not mad keen on the notations we’ve been using, such as $\vec w$ and $b$. Unfortunately, it’s not going to get any better – we have some $\xi$‘s and $C$‘s in our near future, and I’m sorry to say that the origin of those notations won’t be much more transparent than the origins of $\vec w$ and $b$. I’m using what seems to be standard notation in the support vector machine literature, but it’s not very good notation, in my opinion: it has little or no mnemonic value, and is very non-uniform to boot. While I’m complaining, I’ll mention one more thing: the vector $\vec w$ is sometimes known as the weight vector, again, for reasons which seem obscure. End of rant. Let’s get back to the problem of choosing $\vec w$ and $b$ so that the two supporting hyperplanes are as far apart from one another as possible. To do that, we need to figure out a way of expressing the distance between the two supporting hyperplanes in terms of $\vec w$ and $b$. Without any loss of generality, we’ll suppose that the separating hyperplane is halfway between the two supporting hyperplanes: Let’s fix our attention on one of the two supporting hyperplanes, say, the one that is “higher up” in the picture above. We’ll suppose, also without any loss of generality, that this is the hyperplane supporting data set $A$. By rescaling $\vec w$ and $b$ if necessary, we can ensure that this support hyperplane has the equation $\vec w \cdot \vec x + b = 1,$ and so the constraint that this hyperplane support data set $A$ may be expressed as $\vec w \cdot \vec x + b \geq 1$. This is an important constraint: it’s the constraint on data set $A$. ### Exercises • If you’re not comfortable with hyperplanes you may not immediately see why the kind of rescaling of $\vec w$ and $b$ which I did in the last paragraph is always possible. If this is the case, then write out an explicit proof. We can now determine the distance between the supporting and separating hyperplane by introducing a vector $\Delta$ which is: (a) normal to both hyperplanes, and (b) connects the two hyperplanes. A bit more precisely, $\vec \Delta := \vec x'-\vec x$, where $\vec x'$ is in the supporting hyperplane and $\vec x$ is the projection of $\vec x'$ onto the separating hyperplane: Our goal is to find the length of $\vec \Delta$. Taking the difference of the equations $\vec w \cdot \vec x' + b = 1$ and $\vec w \cdot \vec x + b = 0$, we have $\vec w \cdot \vec \Delta = 1$. Since $\vec \Delta$ and $\vec w$ are both normal to the separating hyperplane, and thus parallel, it follows that $\Delta w = 1$, where I am omitting the $\vec \cdot$ in order to denote length. It follows that the distance from the separating to the supporting hyperplane is $\Delta = 1/w$. We chose the separating hyperplane to be halfway between the two supporting hyperplanes, and so the total size of the wedge is $2\Delta = 2/w$. As a result, maximizing the size of the wedge is equivalent to maximizing $2/w$. This is subject to the constraint on data set $A$, namely that $\vec w \cdot \vec x_j + b \geq 1$ for all the vectors $\vec x_j$ in data set $A$. There’s also a similar constraint for data set $B$, $\vec w \cdot \vec x_j + b \leq -1$ for all the vectors $\vec x_j$ in data set $B$. The supporting hyperplane for data set $B$ satisfies this condition with equality. ### Exercises • Prove that $\vec w \cdot x + b = -1$ is the equation for the supporting hyperplane for data set $B$. There is a reasonably nice way of combining the two sets of constraint inequalities for data sets $A$ and $B$. That’s to label each vector $\vec x_j$ with a value $y_j = +1$ if $\vec x_j$ is in data set $A$, and with $y_j = -1$ if $\vec x_j$ is in data set $B$. With these labels, we can now express the constraints on $\vec w$ and $b$ in a single set of inequalities $y_j (\vec w \cdot \vec x_j + b) \geq 1,$ with equality holding if the vector in question is on a supporting hyperlane (a so-called support vector). This single set of inequalities represents the constraint that $\vec w$ and $b$ really do define separating and supporting hyperplanes for data sets $A$ and $B$, and that the distance between the supporting hyperplanes is $2/w$. Rather than maximize $2/w$ subject to the above constraints, it’s conventional to instead minimize $w^2/2 = \vec w \cdot \vec w /2$, subject to the same constraints. This is obviously equivalent, and the reason we make this change will become clear shortly. Summing up, our problem is to find $\vec w$ and $b$ which minimize the value of: $\frac{w^2}{2}$ subject to the constraints $y_j (\vec w \cdot \vec x_j + b) \geq 1,$ where $y_j = +1$ if $\vec x_j$ is in data set $A$, and $y_j = -1$ if $\vec x_j$ is in data set $B$. This formulation of our problem represents real progress. The reason is that the problem just described is an instance of a very well-known type of problem in mathematics and computer science, called a quadratic programming problem. Although quadratic programs don’t in general have simple, closed-form solutions, they’ve been studied for decades and are well understood, and there are good algorithms and libraries for solving them. It’s because quadratic programs are so well understood that we changed from maximizing $2/w$ to minimizing $w^2/2$ (the $1/2$ is conventional, since it simplifies certain other expressions in the theory of quadratic programming). Indeed, quadratic programs are so well understood that not only are there many libraries for solving quadratic programs, some of those libraries have been specially designed to solve the quadratic programs which arise from support vector machines. Here’s a rather lengthy list of libraries and packages which are designed for support vector machines, and you may wish to experiment with those libraries. Incidentally, one potentially confusing thing about the quadratic programming problem posed above – and one of the reasons I complain about the notation – is to be clear on exactly what is being solved for. The $y_j$ and $\vec x_j$ are known input data, which should be regarded as fixed, and the task is to solve for $\vec w$ and $b$. This is obviously rather different from the traditional use of $x$ and $y$ to denote what we’re solving for, and it’s good to keep this clearly in mind. We can now explain what a support vector machine actually is. We have some problem – like the search ranking problem – which generate sets of training data, $A$ and $B$, in a vector space. The support vector machine will find the parameters $\vec w$ and $b$ for the separating hyperplane, so that the “wedge” between these two data sets is maximized, i.e., the distance between the supporting hyperplanes is maximized. We can then use the parameters $\vec w$ and $b$ to classify new problem instances $\vec x$ as being of type $A$ or type $B$. In particular, we classify $\vec x$ as being of type $A$ if $\vec w \cdot \vec x + b \geq 0$, and of type $B$ if $\vec w \cdot \vec x + b < 0$. A support vector machine is an example of a linear classifier: a decision procedure which makes a classification decision (Is page $d$ ranked higher than page $d'$“) using some linear function of the feature vector describing the problem instance. It’s a binary linear classifier, since the classification is into two types. Such classifiers are useful because they have such generality. Have a bunch of photos, and want to classify which are photos of fish? Spend some time (or money) generating training data, and some more time coming up with a few features that seem like they might be helpful. Then find a support vector machine which can be used to classify more problem instances, and see how well it works! Of course, nothing guarantees that it’ll work well, but with good feature selection it seems that if often really does work well in practice. I won’t go any deeper into the theory or practice of support vector machines in this post. Instead, we’ll return to the question of how to deal with the fact that sometimes it’s not possible to separate the training data into two half-spaces. What we’ll see is that it is possible to do an approximate separation, using the same techniques of quadratic programming. This is known as a support vector machine with a soft margin. Once we’ve understood how that works I’ll come back to what this all means in the context of search. ### Support vector machines with a soft margin In our earlier discussion of search, we saw an example where the feature difference vectors in a set of training data cannot be partitioned into two half-spaces (again, as earlier, this diagram only shows feature difference vectors $\vec \phi(q,d,d')$ where $d$ should be ranked higher than $d'$, i.e., it shows only data set $A$): What can we do in this situation? There is a way of modifying the support vector machine idea to find an approximate way of partitioning two sets of vectors into half-spaces. The idea is to allow some of the vectors to slightly violate the half-space constraints. We compensate for these violations by paying a penalty in the function being minimized. This will ensure that any violations are quite small. The way this idea is implemented is as follows. We introduce some slack variables $\xi_j$, one for each vector $\vec x_j$. The size of the slack variable $\xi_j$ will determine how much violation of the half-space constraint is allowed for the vector $\vec x_j$. In particular, the constraints on the vectors are now relaxed to be $y_j (\vec w \cdot \vec x_j + b) \geq 1-\xi_j.$ We also impose the constraint on the slack variables that $\xi_j \geq 0,$ simply so as to ensure that we’re not over-constraining our vectors by picking a negative value for $\xi_j$. The larger a given $\xi_j$ the more the vector $\vec x_j$ can violate the partitioning into half-spaces. For that reason we introduce a new term in the function we want to minimize, in order to penalize excursions from the half-space: $\frac{w^2}{2} + C \sum_j \xi_j.$ Here $C > 0$ is called the soft margin parameter. If $C$ is large, then we pay a substantial penalty even for small excursions from the half-spaces. On the other hand, if $C$ is small then we pay only a small penalty, and so we can expect larger excursions from the half-spaces. The problem defined in the last paragraph is (again) a quadratic program, and standard algorithms and libraries can be used to solve the program. Also as in the last section, given a solution to the program we can use that solution as a binary linear classifier. In particular, suppose $\vec w$ and $b$ appear in such a solution. Then given a new problem instance $\vec x$ to be classified, we classify it as being of type $A$ if $\vec w \cdot \vec x + b \geq 0$, and of type $B$ if $\vec w \cdot \vec x + b < 0$. An issue that I haven't addressed - and don't yet know how to solve - is how to choose the soft margin parameter, $C$. As a starting point for understanding this choice, let me at least state the following: Theorem: A solution to the quadratic program above always exists. This is a major improvement over the situation without the soft margin, where it's not even guaranteed that a solution will exist. It means that no matter what the training data or soft margin parameter, we can always build a binary linear classifier using the support vector machine with that soft margin. I won't give the proof of the theorem, but the intuition is simple enough: for any choice of $\vec w$ and $b$, by choosing $\xi_j$ sufficiently large we can ensure that $\vec x_j$ satisfies the constraints. A fairly standard continuity and compactness argument should do the trick (modulo some reasonable assumptions, e.g., that we’re working in finite dimensions and with finite data sets). While the theorem above is an encouraging start, it’s not what we really want. What we’d like is to understand how best to choose $C$. Ideally, this means understanding how different choices of $C$ impact the solutions which are found, and being more precise about what sorts of excursions from the half-spaces we are willing to allow, and what impact this has on the quality of classifications. This is, obviously, a challenging theoretical program! However, for the very practical problem of combining different relevancy factors in search, I think a great deal of progress could be made simply by trying out different values of $C$, and testing to see which value empirically gives the best results. Trial-and-error is often better than being smart. ### Problems • Above, I talked about the soft margin parameter $C$ being “large” or “small”, but didn’t explain exactly what I meant. How could we make these notions more precise? • Suppose the data sets have the property that data set $A$ is the negation of data set $B$, i.e., for each vector $\vec x$ in $A$, there is a corresponding vector $-\vec x$ in $B$, and vice versa. (This is the case in the search problem, for example.) Prove that if a value for $b$ appears in a solution to the quadratic program, then there is also a solution to the quadratic program with value $-b$. In fact, it’s possible to prove (though we won’t do so) that the solution to the quadratic program is unique, and so we must have $b=-b$, and therefore $b=0$. It follows that for the search problem we can assume $b=0$ in the solution to the quadratic program. ### Problems for the author • The modifications to the support vector machine model made in this section are quite ad hoc. In particular, choosing the penalty to be $C \sum_j \xi_j$ seems to me to be quite unmotivated. Is there some natural geometric way of deriving this model? Even better, is there a principled way of deriving the model? ### Search revisited Let’s return to the problem of search. As we saw in the last section, we can use training data to build a support vector machine (with soft margin) that, given a query, $q$, will tell us whether a document $d$ is more relevant than $d'$, or vice versa. Of course, this ranking decision won’t in any sense be “provably” optimal – it’s not even clear what that would mean, exactly. But the support vector machine has the advantage that it takes into account all the different relevancy information we have (all the different features), and it does so in a way that reflects the training data. ### Problems • Suppose that the relevancy ranking algorithm just described ranks the document $d$ above $d'$, and $d'$ is ranked above $d''$, for some query $q$. Prove that it will also rank $d$ above $d''$. How can we use this support vector machine in practice? One way is as follows. Suppose we have $n$ documents. A user enters a query $q$. Then because we can easily compare any two documents in the collection, we can use a sort algorithm such as quicksort to produce a rank ordering of the $n$ documents, using $O(n \log n)$ comparisons. This approach isn’t really the best, though. Suppose, for comparison, that you had $n$ documents and wanted to figure out the top $10$ documents (say), as ranked by PageRank. You could do a single pass over all $n$ documents, keeping a running tally of the best $10$ documents found to date. That’d take $O(n)$ running time, which is significantly faster. Of course, we can apply the same idea with our support vector machine. Simply do a single pass over all $n$ documents, maintaining a running tally of the best $10$ documents found to date. More precisely, each time you examine a new document, simply compare it to the current list of the top $10$ documents, using the support vector machine, and if the new document is better than any of those, update the list. Quite aside from being significantly faster than ranking all documents, a major advantage of the running tally method is that it is easily run on a large cluster. Each machine simply computes the top $10$ documents stored on that machine, and then those short lists are merged at a central location. This post has introduced a simple technique which can be used to combine different notions of search relevancy. It’s very much an introduction, and there are a huge number of problems I have not addressed. Perhaps the biggest one is this: how well does this procedure work in practice? That is, does it give a satisfying search experience, one which is significantly improved by the addition of new relevancy factors? I’d love to test this out, but I don’t have good enough trial data (yet) to do a really good test. Maybe someone can point to some real data along these lines. At this point in the original drafting of this post, I began enumerating problems one might want to think about in applying these ideas to improve a real search engine. The list quickly became very long: there’s a lot to think about! So rather than list all those problems, I’ll conclude with just a couple of problems which you may ponder. Enjoy! ### Problems for the author • Suppose we’re running a real-life search engine, and were thinking of introducing an additional feature. How could we use A/B testing to determine whether or not that additional feature does a little or a lot to help improve relevancy ranking? • At the beginning of this post I proposed figuring out (by thinking hard!) a principled way of combining cosine similarity and PageRank to come up with a composite measure of relevance. Find such a principled way of making the combination. I expect that the value in attacking this problem will lie not so much in whatever explicit combination I find, but rather in better understanding how to make such combinations. It may also shed some light on when we expect the machine learning approach to work well, and when we’d expect it to work poorly. Interested in more? Please follow me on Twitter. You may also enjoy reading my new book about open science, Reinventing Discovery. It is a commonplace of scientific discussion that correlation does not imply causation. Business Week recently ran an spoof article pointing out some amusing examples of the dangers of inferring causation from correlation. For example, the article points out that Facebook’s growth has been strongly correlated with the yield on Greek government bonds: (credit) Despite this strong correlation, it would not be wise to conclude that the success of Facebook has somehow caused the current (2009-2012) Greek debt crisis, nor that the Greek debt crisis has caused the adoption of Facebook! Of course, while it’s all very well to piously state that correlation doesn’t imply causation, it does leave us with a conundrum: under what conditions, exactly, can we use experimental data to deduce a causal relationship between two or more variables? The standard scientific answer to this question is that (with some caveats) we can infer causality from a well designed randomized controlled experiment. Unfortunately, while this answer is satisfying in principle and sometimes useful in practice, it’s often impractical or impossible to do a randomized controlled experiment. And so we’re left with the question of whether there are other procedures we can use to infer causality from experimental data. And, given that we can find more general procedures for inferring causal relationships, what does causality mean, anyway, for how we reason about a system? It might seem that the answers to such fundamental questions would have been settled long ago. In fact, they turn out to be surprisingly subtle questions. Over the past few decades, a group of scientists have developed a theory of causal inference intended to address these and other related questions. This theory can be thought of as an algebra or language for reasoning about cause and effect. Many elements of the theory have been laid out in a famous book by one of the main contributors to the theory, Judea Pearl. Although the theory of causal inference is not yet fully formed, and is still undergoing development, what has already been accomplished is interesting and worth understanding. In this post I will describe one small but important part of the theory of causal inference, a causal calculus developed by Pearl. This causal calculus is a set of three simple but powerful algebraic rules which can be used to make inferences about causal relationships. In particular, I’ll explain how the causal calculus can sometimes (but not always!) be used to infer causation from a set of data, even when a randomized controlled experiment is not possible. Also in the post, I’ll describe some of the limits of the causal calculus, and some of my own speculations and questions. The post is a little technically detailed at points. However, the first three sections of the post are non-technical, and I hope will be of broad interest. Throughout the post I’ve included occasional “Problems for the author”, where I describe problems I’d like to solve, or things I’d like to understand better. Feel free to ignore these if you find them distracting, but I hope they’ll give you some sense of what I find interesting about the subject. Incidentally, I’m sure many of these problems have already been solved by others; I’m not claiming that these are all open research problems, although perhaps some are. They’re simply things I’d like to understand better. Also in the post I’ve included some exercises for the reader, and some slightly harder problems for the reader. You may find it informative to work through these exercises and problems. Before diving in, one final caveat: I am not an expert on causal inference, nor on statistics. The reason I wrote this post was to help me internalize the ideas of the causal calculus. Occasionally, one finds a presentation of a technical subject which is beautifully clear and illuminating, a presentation where the author has seen right through the subject, and is able to convey that crystalized understanding to others. That’s a great aspirational goal, but I don’t yet have that understanding of causal inference, and these notes don’t meet that standard. Nonetheless, I hope others will find my notes useful, and that experts will speak up to correct any errors or misapprehensions on my part. ### Simpson’s paradox Let me start by explaining two example problems to illustrate some of the difficulties we run into when making inferences about causality. The first is known as Simpson’s paradox. To explain Simpson’s paradox I’ll use a concrete example based on the passage of the Civil Rights Act in the United States in 1964. In the US House of Representatives, 61 percent of Democrats voted for the Civil Rights Act, while a much higher percentage, 80 percent, of Republicans voted for the Act. You might think that we could conclude from this that being Republican, rather than Democrat, was an important factor in causing someone to vote for the Civil Rights Act. However, the picture changes if we include an additional factor in the analysis, namely, whether a legislator came from a Northern or Southern state. If we include that extra factor, the situation completely reverses, in both the North and the South. Here’s how it breaks down: North: Democrat (94 percent), Republican (85 percent) South: Democrat (7 percent), Republican (0 percent) Yes, you read that right: in both the North and the South, a larger fraction of Democrats than Republicans voted for the Act, despite the fact that overall a larger fraction of Republicans than Democrats voted for the Act. You might wonder how this can possibly be true. I’ll quickly state the raw voting numbers, so you can check that the arithmetic works out, and then I’ll explain why it’s true. You can skip the numbers if you trust my arithmetic. North: Democrat (145/154, 94 percent), Republican (138/162, 85 percent) South: Democrat (7/94, 7 percent), Republican (0/10, 0 percent) Overall: Democrat (152/248, 61 percent), Republican (138/172, 80 percent) One way of understanding what’s going on is to note that a far greater proportion of Democrat (as opposed to Republican) legislators were from the South. In fact, at the time the House had 94 Democrats, and only 10 Republicans. Because of this enormous difference, the very low fraction (7 percent) of southern Democrats voting for the Act dragged down the Democrats’ overall percentage much more than did the even lower fraction (0 percent) of southern Republicans who voted for the Act. (The numbers above are for the House of Congress. The numbers were different in the Senate, but the same overall phenomenon occurred. I’ve taken the numbers from Wikipedia’s article about Simpson’s paradox, and there are more details there.) If we take a naive causal point of view, this result looks like a paradox. As I said above, the overall voting pattern seems to suggest that being Republican, rather than Democrat, was an important causal factor in voting for the Civil Rights Act. Yet if we look at the individual statistics in both the North and the South, then we’d come to the exact opposite conclusion. To state the same result more abstractly, Simpson’s paradox is the fact that the correlation between two variables can actually be reversed when additional factors are considered. So two variables which appear correlated can become anticorrelated when another factor is taken into account. You might wonder if results like those we saw in voting on the Civil Rights Act are simply an unusual fluke. But, in fact, this is not that uncommon. Wikipedia’s page on Simpson’s paradox lists many important and similar real-world examples ranging from understanding whether there is gender-bias in university admissions to which treatment works best for kidney stones. In each case, understanding the causal relationships turns out to be much more complex than one might at first think. I’ll now go through a second example of Simpson’s paradox, the kidney stone treatment example just mentioned, because it helps drive home just how bad our intuitions about statistics and causality are. Imagine you suffer from kidney stones, and your Doctor offers you two choices: treatment A or treatment B. Your Doctor tells you that the two treatments have been tested in a trial, and treatment A was effective for a higher percentage of patients than treatment B. If you’re like most people, at this point you’d say “Well, okay, I’ll go with treatment A”. Here’s the gotcha. Keep in mind that this really happened. Suppose you divide patients in the trial up into those with large kidney stones, and those with small kidney stones. Then even though treatment A was effective for a higher overall percentage of patients than treatment B, treatment B was effective for a higher percentage of patients in both groups, i.e., for both large and small kidney stones. So your Doctor could just as honestly have said “Well, you have large [or small] kidney stones, and treatment B worked for a higher percentage of patients with large [or small] kidney stones than treatment A”. If your Doctor had made either one of these statements, then if you’re like most people you’d have decided to go with treatment B, i.e., the exact opposite treatment. The kidney stone example relies, of course, on the same kind of arithmetic as in the Civil Rights Act voting, and it’s worth stopping to figure out for yourself how the claims I made above could possibly be true. If you’re having trouble, you can click through to the Wikipedia page, which has all the details of the numbers. Now, I’ll confess that before learning about Simpson’s paradox, I would have unhesitatingly done just as I suggested a naive person would. Indeed, even though I’ve now spent quite a bit of time pondering Simpson’s paradox, I’m not entirely sure I wouldn’t still sometimes make the same kind of mistake. I find it more than a little mind-bending that my heuristics about how to behave on the basis of statistical evidence are obviously not just a little wrong, but utterly, horribly wrong. Perhaps I’m alone in having terrible intuition about how to interpret statistics. But frankly I wouldn’t be surprised if most people share my confusion. I often wonder how many people with real decision-making power – politicians, judges, and so on – are making decisions based on statistical studies, and yet they don’t understand even basic things like Simpson’s paradox. Or, to put it another way, they have not the first clue about statistics. Partial evidence may be worse than no evidence if it leads to an illusion of knowledge, and so to overconfidence and certainty where none is justified. It’s better to know that you don’t know. ### Correlation, causation, smoking, and lung cancer As a second example of the difficulties in establishing causality, consider the relationship between cigarette smoking and lung cancer. In 1964 the United States’ Surgeon General issued a report claiming that cigarette smoking causes lung cancer. Unfortunately, according to Pearl the evidence in the report was based primarily on correlations between cigarette smoking and lung cancer. As a result the report came under attack not just by tobacco companies, but also by some of the world’s most prominent statisticians, including the great Ronald Fisher. They claimed that there could be a hidden factor – maybe some kind of genetic factor – which caused both lung cancer and people to want to smoke (i.e., nicotine craving). If that was true, then while smoking and lung cancer would be correlated, the decision to smoke or not smoke would have no impact on whether you got lung cancer. Now, you might scoff at this notion. But derision isn’t a principled argument. And, as the example of Simpson’s paradox showed, determining causality on the basis of correlations is tricky, at best, and can potentially lead to contradictory conclusions. It’d be much better to have a principled way of using data to conclude that the relationship between smoking and lung cancer is not just a correlation, but rather that there truly is a causal relationship. One way of demonstrating this kind of causal connection is to do a randomized, controlled experiment. We suppose there is some experimenter who has the power to intervene with a person, literally forcing them to either smoke (or not) according to the whim of the experimenter. The experimenter takes a large group of people, and randomly divides them into two halves. One half are forced to smoke, while the other half are forced not to smoke. By doing this the experimenter can break the relationship between smoking and any hidden factor causing both smoking and lung cancer. By comparing the cancer rates in the group who were forced to smoke to those who were forced not to smoke, it would then be possible determine whether or not there is truly a causal connection between smoking and lung cancer. This kind of randomized, controlled experiment is highly desirable when it can be done, but experimenters often don’t have this power. In the case of smoking, this kind of experiment would probably be illegal today, and, I suspect, even decades into the past. And even when it’s legal, in many cases it would be impractical, as in the case of the Civil Rights Act, and for many other important political, legal, medical, and econonomic questions. ### Causal models To help address problems like the two example problems just discussed, Pearl introduced a causal calculus. In the remainder of this post, I will explain the rules of the causal calculus, and use them to analyse the smoking-cancer connection. We’ll see that even without doing a randomized controlled experiment it’s possible (with the aid of some reasonable assumptions) to infer what the outcome of a randomized controlled experiment would have been, using only relatively easily accessible experimental data, data that doesn’t require experimental intervention to force people to smoke or not, but which can be obtained from purely observational studies. To state the rules of the causal calculus, we’ll need several background ideas. I’ll explain those ideas over the next three sections of this post. The ideas are causal models (covered in this section), causal conditional probabilities, and d-separation, respectively. It’s a lot to swallow, but the ideas are powerful, and worth taking the time to understand. With these notions under our belts, we’ll able to understand the rules of the causal calculus To understand causal models, consider the following graph of possible causal relationships between smoking, lung cancer, and some unknown hidden factor (say, a hidden genetic factor): This is a quite general model of causal relationships, in the sense that it includes both the suggestion of the US Surgeon General (smoking causes cancer) and also the suggestion of the tobacco companies (a hidden factor causes both smoking and cancer). Indeed, it also allows a third possibility: that perhaps both smoking and some hidden factor contribute to lung cancer. This combined relationship could potentially be quite complex: it could be, for example, that smoking alone actually reduces the chance of lung cancer, but the hidden factor increases the chance of lung cancer so much that someone who smokes would, on average, see an increased probability of lung cancer. This sounds unlikely, but later we’ll see some toy model data which has exactly this property. Of course, the model depicted in the graph above is not the most general possible model of causal relationships in this system; it’s easy to imagine much more complex causal models. But at the very least this is an interesting causal model, since it encompasses both the US Surgeon General and the tobacco company suggestions. I’ll return later to the possibility of more general causal models, but for now we’ll simply keep this model in mind as a concrete example of a causal model. Mathematically speaking, what do the arrows of causality in the diagram above mean? We’ll develop an answer to that question over the next few paragraphs. It helps to start by moving away from the specific smoking-cancer model to allow a causal model to be based on a more general graph indicating possible causal relationships between a number of variables: Each vertex in this causal model has an associated random variable, $X_1,X_2,\ldots$. For example, in the causal model above $X_2$ could be a two-outcome random variable indicating the presence or absence of some gene that exerts an influence on whether someone smokes or gets lung cancer, $X_3$ indicates “smokes” or “does not smoke”, and $X_4$ indicates “gets lung cancer” or “doesn’t get lung cancer”. The other variables $X_1$ and $X_5$ would refer to other potential dependencies in this (somewhat more complex) model of the smoking-cancer connection. A notational convention that we’ll use often is to interchangeably use $X_j$ to refer to a random variable in the causal model, and also as a way of labelling the corresponding vertex in the graph for the causal model. It should be clear from context which is meant. We’ll also sometimes refer interchangeably to the causal model or to the associated graph. For the notion of causality to make sense we need to constrain the class of graphs that can be used in a causal model. Obviously, it’d make no sense to have loops in the graph: We can’t have $X$ causing $Y$ causing $Z$ causing $X$! At least, not without a time machine. Because of this we constrain the graph to be a directed acyclic graph, meaning a (directed) graph which has no loops in it. By the way, I must admit that I’m not a fan of the term directed acyclic graph. It sounds like a very complicated notion, at least to my ear, when what it means is very simple: a graph with no loops. I’d really prefer to call it a “loop-free graph”, or something like that. Unfortunately, the “directed acyclic graph” nomenclature is pretty standard, so we’ll go with it. Our picture so far is that a causal model consists of a directed acyclic graph, whose vertices are labelled by random variables $X_1,X_2,\ldots$. To complete our definition of causal models we need to capture the allowed relationships between those random variables. Intuitively, what causality means is that for any particular $X_j$ the only random variables which directly influence the value of $X_j$ are the parents of $X_j$, i.e., the collection $X_{\mbox{pa}(j)}$ of random variables which are connected directly to $X_j$. For instance, in the graph shown below (which is the same as the complex graph we saw a little earlier), we have $X_{\mbox{pa}(4)} = (X_2,X_3)$: Now, of course, vertices further back in the graph – say, the parents of the parents – could, of course, influence the value of $X_4$. But it would be indirect, an influence mediated through the parent vertices. Note, by the way, that I’ve overloaded the $X$ notation, using $X_{\mbox{pa}(4)}$ to denote a collection of random variables. I’ll use this kind of overloading quite a bit in the rest of this post. In particular, I’ll often use the notation $X$ (or $W$, $Y$ or $Z$) to denote a subset of random variables from the graph. Motivated by the above discussion, one way we could define causal influence would be to require that $X_j$ be a function of its parents: $X_j = f_j(X_{\mbox{pa}(j)}),$ where $f_j(\cdot)$ is some function. In fact, we’ll allow a slightly more general notion of causal influence, allowing $X_j$ to not just be a deterministic function of the parents, but a random function. We do this by requiring that $X_j$ be expressible in the form: $X_j = f_j(X_{\mbox{pa}(j)},Y_{j,1},Y_{j,2},\ldots),$ where $f_j$ is a function, and $Y_{j,\cdot}$ is a collection of random variables such that: (a) the $Y_{j,\cdot}$ are independent of one another for different values of $j$; and (b) for each $j$, $Y_{j,\cdot}$ is independent of all variables $X_k$, except when $X_k$ is $X_j$ itself, or a descendant of $X_j$. The intuition is that the $Y_{j,\cdot}$ are a collection of auxiliary random variables which inject some extra randomness into $X_j$ (and, through $X_j$, its descendants), but which are otherwise independent of the variables in the causal model. Summing up, a causal model consists of a directed acyclic graph, $G$, whose vertices are labelled by random variables, $X_j$, and each $X_j$ is expressible in the form $X_j = f_j(X_{\mbox{pa}(j)},Y_{j,\cdot})$ for some function $f_j$. The $Y_{j,\cdot}$ are independent of one another, and each $Y_{j,\cdot}$ is independent of all variables $X_k$, except when $X_k$ is $X_j$ or a descendant of $X_j$. In practice, we will not work directly with the functions $f_j$ or the auxiliary random variables $Y_{j,\cdot}$. Instead, we’ll work with the following equation, which specifies the causal model’s joint probability distribution as a product of conditional probabilities: $p(x_1,x_2,\ldots) = \prod_j p(x_j | \mbox{pa}(x_j)).$ I won’t prove this equation, but the expression should be plausible, and is pretty easy to prove; I’ve asked you to prove it as an optional exercise below. ### Exercises • Prove the above equation for the joint probability distribution. ### Problems • (Simpson’s paradox in causal models) Consider the causal model of smoking introduced above. Suppose that the hidden factor is a gene which is either switched on or off. If on, it tends to make people both smoke and get lung cancer. Find explicit values for conditional probabilities in the causal model such that $p(\mbox{cancer} | \mbox{smokes}) > p(\mbox{cancer} | \mbox{doesn't smoke})$, and yet if the additional genetic factor is taken into account this relationship is reversed. That is, we have both $p(\mbox{cancer} | \mbox{smokes, gene on}) \,\, \textless \,\, p(\mbox{cancer} | \mbox{does not smoke, gene on})$ and $p(\mbox{cancer} | \mbox{smokes, gene off}) \,\, \textless \,\, p(\mbox{cancer} | \mbox{doesn't smoke, gene off})$. ### Problems for the author • An alternate, equivalent approach to defining causal models is as follows: (1) all root vertices (i.e., vertices with no parents) in the graph are labelled by independent random variables. (2) augment the graph by introducing new vertices corresponding to the $Y_{j,k}$. These new vertices have single outgoing edges, pointing to $X_j$. (3) Require that non-root vertices in the augmented graph be deterministic functions of their parents. The disadvantage of this definition is that it introduces the overhead of dealing with the augmented graph. But the definition also has the advantage of cleanly separating the stochastic and deterministic components, and I wouldn’t be surprised if developing the theory of causal inference from this point of view was stimulating, at the very least, and may possibly have some advantages compared to the standard approach. So the problem I set myself (and anyone else who is interested!) is to carry the consequences of this change through the rest of the theory of causal inference, looking for advantages and disadvantages. I’ve been using terms like “causal influence” somewhat indiscriminately in the discussion above, and so I’d like to pause to discuss a bit more carefully about what is meant here, and what nomenclature we should use going forward. All the arrows in a causal model indicate are the possibility of a direct causal influence. This results in two caveats on how we think about causality in these models. First, it may be that a child random variable is actually completely independent of the value of one (or more) of its parent random variables. This is, admittedly, a rather special case, but is perfectly consistent with the definition. For example, in a causal model like it is possible that the outcome of cancer might be independent of the hidden causal factor or, for that matter, that it might be independent of whether someone smokes or not. (Indeed, logically, at least, it may be independent of both, although of course that’s not what we’ll find in the real world.) The second caveat in how we think about the arrows and causality is that the arrows only capture the direct causal influences in the model. It is possible that in a causal model like $X_1$ will have a causal influence on $X_5$ through its influence on $X_2$ and $X_3$. This would be an indirect causal influence, mediated by other random variables, but it would still be a causal influence. In the next section I’ll give a more formal definition of causal influence that can be used to make these ideas precise. ### Causal conditional probabilities In this section I’ll explain what I think is the most imaginative leap underlying the causal calculus. It’s the introduction of the concept of causal conditional probabilities. The notion of ordinary conditional probabilities is no doubt familiar to you. It’s pretty straightforward to do experiments to estimate conditional probabilities such as $p(\mbox{cancer}| \mbox{smoking})$, simply by looking at the population of people who smoke, and figuring out what fraction of those people develop cancer. Unfortunately, for the purpose of understanding the causal relationship between smoking and cancer, $p(\mbox{cancer}| \mbox{smoking})$ isn’t the quantity we want. As the tobacco companies pointed out, there might well be a hidden genetic factor that makes it very likely that you’ll see cancer in anyone who smokes, but that wouldn’t therefore mean that smoking causes cancer. As we discussed earlier, what you’d really like to do in this circumstance is a randomized controlled experiment in which it’s possible for the experimenter to force someone to smoke (or not smoke), breaking the causal connection between the hidden factor and smoking. In such an experiment you really could see if there was a causal influence by looking at what fraction of people who smoked got cancer. In particular, if that fraction was higher than in the overall population then you’d be justified in concluding that smoking helped cause cancer. In practice, it’s probably not practical to do this kind of randomized controlled experiment. But Pearl had what turns out to be a very clever idea: to imagine a hypothetical world in which it really is possible to force someone to (for example) smoke, or not smoke. In particular, he introduced a conditional causal probability $p(\mbox{cancer}| \mbox{do(smoking)})$, which is the conditional probability of cancer in this hypothetical world. This should be read as the (causal conditional) probability of cancer given that we “do” smoking, i.e., someone has been forced to smoke in a (hypothetical) randomized experiment. Now, at first sight this appears a rather useless thing to do. But what makes it a clever imaginative leap is that although it may be impossible or impractical to do a controlled experiment to determine $p(\mbox{cancer}|\mbox{do}(\mbox{smoking}))$, Pearl was able to establish a set of rules – a causal calculus – that such causal conditional probabilities should obey. And, by making use of this causal calculus, it turns out to sometimes be possible to infer the value of probabilities such as $p(\mbox{cancer}|\mbox{do}(\mbox{smoking}))$, even when a controlled, randomized experiment is impossible. And that’s a very remarkable thing to be able to do, and why I say it was so clever to have introduced the notion of causal conditional probabilities. We’ll discuss the rules of the causal calculus later in this post. For now, though, let’s develop the notion of causal conditional probabilities. Suppose we have a causal model of some phenomenon: Now suppose we introduce an external experimenter who is able to intervene to deliberately set the value of a particular variable $X_j$ to $x_j$. In other words, the experimenter can override the other causal influences on that variable. This is equivalent to having a new causal model: In this new causal model, we’ve represented the experimenter by a new vertex, which has as a child the vertex $X_j$. All other parents of $X_j$ are cut off, i.e., the edges from the parents to $X_j$ are deleted from the graph. In this case that means the edge from $X_2$ to $X_3$ has been deleted. This represents the fact that the experimenter’s intervention overrides the other causal influences. (Note that the edges to the children of $X_j$ are left undisturbed.) In fact, it’s even simpler (and equivalent) to consider a causal model where the parents have been cut off from $X_j$, and no extra vertex added: This model has no vertex explicitly representing the experimenter, but rather the relation $X_j = f_j(X_{{\rm pa}(j},Y_{j,\cdot})$ is replaced by the relation $X_j = x_j$. We will denote this graph by $G_{\overline X_j}$, indicating the graph in which all edges pointing to $X_j$ have been deleted. We will call this a perturbed graph, and the corresponding causal model a perturbed causal model. In the perturbed causal model the only change is to delete the edges to $X_j$, and to replace the relation $X_j = f_j(X_{{\rm pa}(j},Y_{j,\cdot})$ by the relation $X_j = x_j$. Our aim is to use this perturbed causal model to compute the conditional causal probability $p(x_1,\ldots,\hat x_j, \ldots, x_n | \mbox{do}(x_j))$. In this expression, $\hat x_j$ indicates that the $x_j$ term is omitted before the $|$, since the value of $x_j$ is set on the right. By definition, the causal conditional probability $p(x_1,\ldots,\hat x_j, \ldots, x_n | \mbox{do}(x_j))$ is just the value of the probability distribution in the perturbed causal model, $p'(x_1,\ldots,x_n)$. To compute the value of the probability in the perturbed causal model, note that the probability distribution in the original causal model was given by $p(x_1,\ldots,x_n) = \prod_k p(x_k| \mbox{pa}(x_k)),$ where the product on the right is over all vertices in the causal model. This expression remains true for the perturbed causal model, but a single term on the right-hand side changes: the conditional probability for the $x_j$ term. In particular, this term gets changed from $p(x_j| \mbox{pa}(x_j))$ to $1$, since we have fixed the value of $X_j$ to be $x_j$. As a result we have: $p(x_1,\ldots,\hat x_j,\ldots,x_n | \mbox{do}(x_j)) = \frac{p(x_1,\ldots,x_n)}{p(x_j|\mbox{pa}(x_j))}.$ This equation is a fundamental expression, capturing what it means for an experimenter to intervene to set the value of some particular variable in a causal model. It can easily be generalized to a situation where we partition the variables into two sets, $X$ and $Y$, where $X$ are the variables we suppose have been set by intervention in a (possibly hypothetical) randomized controlled experiment, and $Y$ are the remaining variables: $[1] \,\,\,\, p(Y=y| \mbox{do}(X=x)) = \frac{p(X=x,Y=y)}{\Pi_j p(X_j = x_j| \mbox{pa}(X_j))}.$ Note that on the right-hand side the values for $\mbox{pa}(X_j)$ are assumed to be given by the appropriate values from $x$ and $y$. The expression [1] can be viewed as a definition of causal conditional probabilities. But although this expression is fundamental to understanding the causal calculus, it is not always useful in practice. The problem is that the values of some of the variables on the right-hand side may not be known, and cannot be determined by experiment. Consider, for example, the case of smoking and cancer. Recall our causal model: What we’d like is to compute $p(\mbox{cancer}| \mbox{do(smoking)})$. Unfortunately, we immediately run into a problem if we try to use the expression on the right of equation [1]: we’ve got no way of estimating the conditional probabilities for smoking given the hidden common factor. So we can’t obviously compute $p(\mbox{cancer}| \mbox{do(smoking)})$. And, as you can perhaps imagine, this is the kind of problem that will come up a lot whenever we’re worried about the possible influence of some hidden factor. All is not lost, however. Just because we can’t compute the expression on the right of [1] directly doesn’t mean we can’t compute causal conditional probabilities in other ways, and we’ll see below how the causal calculus can help solve this kind of problem. It’s not a complete solution – we shall see that it doesn’t always make it possible to compute causal conditional probabilities. But it does help. In particular, we’ll see that although it’s not possible to compute $p(\mbox{cancer}| \mbox{do(smoking)})$ for this causal model, it is possible to compute $p(\mbox{cancer}| \mbox{do(smoking)})$ in a very similar causal model, one that still has a hidden factor. With causal conditional probabilities defined, we’re now in position to define more precisely what we mean by causal influence. Suppose we have a causal model, and $X$ and $Y$ are distinct random variables (or disjoint subsets of random variables). Then we say $X$ has a causal influence over $Y$ if there are values $x_1$ and $x_2$ of $X$ and $y$ of $y$ such that $p(y|\mbox{do}(x_1)) \neq p(y|\mbox{do}(x_2))$. In other words, an external experimenter who can intervene to change the value of $X$ can cause a corresponding change in the distribution of values at $Y$. The following exercise gives an information-theoretic justification for this definition of causal influence: it shows that an experimenter who can intervene to set $X$ can transmit information to $Y$ if and only if the above condition for causal inference is met. ### Exercises • (The causal capacity) This exercise is for people with some background in information theory. Suppose we define the causal capacity between $X$ and $Y$ to be $\max_{p(\hat x)} H(\hat X: \hat Y)$, where $H(\cdot:\cdot)$ is the mutual information, the maximization is over possible distributions $p(\hat x)$ for $\hat X$ (we use the hat to indicate that the value of $X$ is being set by intervention), and $\hat Y$ is the corresponding random variable at $Y$, with distribution $p(\hat y) = \sum_{\hat x} p(\hat y|\mbox{do}(\hat x)) p(\hat x)$. Shannon’s noisy channel coding theorem tells us that an external experimenter who can intervene to set the value of $X$ can transmit information to an observer at $Y$ at a maximal rate set by the causal capacity. Show that the causal capacity is greater than zero if and only if $X$ has a causal influence over $Y$. We’ve just defined a notion of causal influence between two random variables in a causal model. What about when we say something like “Event A” causes “Event B”? What does this mean? Returning to the smoking-cancer example, it seems that we would say that smoking causes cancer provided $p(\mbox{cancer} | \mbox{do}(\mbox{smoking})) > p(\mbox{cancer})$, so that if someone makes the choice to smoke, uninfluenced by other causal factors, then they would increase their chance of cancer. Intuitively, it seems to me that this notion of events causing one another should be related to the notion of causal influence just defined above. But I don’t yet see quite how to do that. The first problem below suggests a conjecture in this direction: ### Problems for the author • Suppose $X$ and $Y$ are random variables in a causal model such that $p(Y=y | \mbox{do}(X=x)) > p(Y=y)$ for some pair of values $x$ and $y$. Does this imply that $X$ exerts a causal influence on $Y$? • (Sum-over-paths for causal conditional probabilities?) I believe a kind of sum-over-paths formulation of causal conditional probabilities is possible, but haven’t worked out details. The idea is as follows (the details may be quite wrong, but I believe something along these lines should work). Supose $X$ and $Y$ are single vertices (with corresponding random variables) in a causal model. Then I would like to show first that if $X$ is not an ancestor of $Y$ then $p(y|\mbox{do}(x)) = p(y)$, i.e., intervention does nothing. Second, if $X$ is an ancestor of $Y$ then $p(y|\mbox{do}(x))$ may be obtained by summing over all directed paths from $X$ to $Y$ in $G_{\overline X}$, and computing for each path a contribution to the sum which is a product of conditional probabilities along the path. (Note that we may need to consider the same path multiple times in the sum, since the random variables along the path may take different values). • We used causal models in our definition of causal conditional probabilities. But our informal definiton – imagine a hypothetical world in which it’s possible to force a variable to take a particular value – didn’t obviously require the use of a causal model. Indeed, in a real-world randomized controlled experiment it may be that there is no underlying causal model. This leads me to wonder if there is some other way of formalizing the informal definition we’ve given? • Another way of framing the last problem is that I’m concerned about the empirical basis for causal models. How should we go about constructing such models? Are they fundamental, representing true facts about the world, or are they modelling conveniences? (This is by no means a dichotomy.) It would be useful to work through many more examples, considering carefully the origin of the functions $f_j(\cdot)$ and of the auxiliary random variables $Y_{j,\cdot}$. ### d-separation In this section we’ll develop a criterion that Pearl calls directional separation (d-separation, for short). What d-separation does is let us inspect the graph of a causal model and conclude that a random variable $X$ in the model can’t tell us anything about the value of another random variable $Y$ in the model, or vice versa. To understand d-separation we’ll start with a simple case, and then work through increasingly complex cases, building up our intuition. I’ll conclude by giving a precise definition of d-separation, and by explaining how d-separation relates to the concept of conditional independence of random variables. Here’s the first simple causal model: Clearly, knowing $X$ can in general tell us something about $Y$ in this kind of causal model, and so in this case $X$ and $Y$ are not d-separated. We’ll use the term d-connected as a synonym for “not d-separated”, and so in this causal model $X$ and $Y$ are d-connected. By contrast, in the following causal model $X$ and $Y$ don’t give us any information about each other, and so they are d-separated: A useful piece of terminology is to say that a vertex like the middle vertex in this model is a collider for the path from $X$ to $Y$, meaning a vertex at which both edges along the path are incoming. What about the causal model: In this case, it is possible that knowing $X$ will tell us something about $Y$, because of their common ancestry. It’s like the way knowing the genome for one sibling can give us information about the genome of another sibling, since similarities between the genomes can be inferred from the common ancestry. We’ll call a vertex like the middle vertex in this model a fork for the path from $X$ to $Y$, meaning a vertex at which both edges are outgoing. ### Exercises • Construct an explicit causal model demonstrating the assertion of the last paragraph. For example, you may construct a causal model in which $X$ and $Y$ are joined by a fork, and where $Y$ is actually a function of $X$. • Suppose we have a path from $X$ to $Y$ in a causal model. Let $c$ be the number of colliders along the path, and let $f$ be the number of forks along the path. Show that $|f-c|$ can only take the values $0, 1$ or $-1$, i.e., the number of forks and colliders is either the same or differs by at most one. We’ll say that a path (of any length) from $X$ to $Y$ that contains a collider is a blocked path. By contrast, a path that contains no colliders is called an unblocked path. (Note that by the above exercise, an unblocked path must contain either one or no forks.) In general, we define $X$ and $Y$ to be d-connected if there is an unblocked path between them. We define them to be d-separated if there is no such unblocked path. It’s worth noting that the concepts of d-separation and d-connectedness depend only on the graph topology and on which vertices $X$ and $Y$ have been chosen. In particular, they don’t depend on the nature of the random variables $X$ and $Y$, merely on the identity of the corresponding vertices. As a result, you can determine d-separation or d-connectdness simply by inspecting the graph. This fact – that d-separation and d-connectdness are determined by the graph – also holds for the more sophisticated notions of d-separation and d-connectedness we develop below. With that said, it probably won’t surprise you to learn that the concept of d-separation is closely related to whether or not the random variables $X$ and $Y$ are independent of one another. This is a connection you can (optionally) develop through the following exercises. I’ll state a much more general connection below. ### Exercises • Suppose that $X$ and $Y$ are d-separated. Show that $X$ and $Y$ are independent random variables, i.e., that $p(x,y) = p(x)p(y)$. • Suppose we have two vertices which are d-connected in a graph $G$. Explain how to construct a causal model on that graph such that the random variables $X$ and $Y$ corresponding to those two vertices are not independent. • The last two exercises almost but don’t quite claim that random variables $X$ and $Y$ in a causal model are independent if and only if they are d-separated. Why does this statement fail to be true? How can you modify the statement to make it true? So far, this is pretty simple stuff. It gets more complicated, however, when we extend the notion of d-separation to cases where we are conditioning on already knowing the value of one or more random variables in the causal model. Consider, for example, the graph: (Figure A.) Now, if we know $Z$, then knowing $X$ doesn’t give us any additional information about $Y$, since by our original definition of a causal model $Y$ is already a function of $Z$ and some auxiliary random variables which are independent of $X$. So it makes sense to say that $Z$ blocks this path from $X$ to $Y$, even though in the unconditioned case this path would not have been considered blocked. We’ll also say that $X$ and $Y$ are d-separated, given $Z$. It is helpful to give a name to vertices like the middle vertex in Figure A, i.e., to vertices with one ingoing and one outgoing edge. We’ll call such vertices a traverse along the path from $X$ to $Y$. Using this language, the lesson of the above discussion is that if $Z$ is in a traverse along a path from $X$ to $Y$, then the path is blocked. By contrast, consider this model: In this case, knowing $X$ will in general give us additional information about $Y$, even if we know $Z$. This is because while $Z$ blocks one path from $X$ to $Y$ there is another unblocked path from $X$ to $Y$. And so we say that $X$ and $Y$ are d-connected, given $Z$. Another case similar to Figure A is the model with a fork: Again, if we know $Z$, then knowing $X$ as well doesn’t give us any extra information about $Y$ (or vice versa). So we’ll say that in this case $Z$ is blocking the path from $X$ to $Y$, even though in the unconditioned case this path would not have been considered blocked. Again, in this example $X$ and $Y$ are d-separated, given $Z$. The lesson of this model is that if $Z$ is located at a fork along a path from $X$ to $Y$, then the path is blocked. A subtlety arises when we consider a collider: (Figure B.) In the unconditioned case this would have been considered a blocked path. And, naively, it seems as though this should still be the case: at first sight (at least according to my intuition) it doesn’t seem very likely that $X$ can give us any additional information about $Y$ (or vice versa), even given that $Z$ is known. Yet we should be cautious, because the argument we made for the graph in Figure A breaks down: we can’t say, as we did for Figure A, that $Y$ is a function of $Z$ and some auxiliary independent random variables. In fact, we’re wise to be cautious because $X$ and $Y$ really can tell us something extra about one another, given a knowledge of $Z$. This is a phenomenon which Pearl calls Berkson’s paradox. He gives the example of a graduate school in music which will admit a student (a possibility encoded in the value of $Z$) if either they have high undergraduate grades (encoded in $X$) or some other evidence that they are exceptionally gifted at music (encoded in $Y$). It would not be surprising if these two attributes were anticorrelated amongst students in the program, e.g., students who were admitted on the basis of exceptional gifts would be more likely than otherwise to have low grades. And so in this case knowledge of $Y$ (exceptional gifts) would give us knowledge of $X$ (likely to have low grades), conditioned on knowledge of $Z$ (they were accepted into the program). Another way of seeing Berkson’s paradox is to construct an explicit causal model for the graph in Figure B. Consider, for example, a causal model in which $X$ and $Y$ are independent random bits, $0$ or $1$, chosen with equal probabilities $1/2$. We suppose that $Z = X \oplus Y$, where $\oplus$ is addition modulo $2$. This causal model does, indeed, have the structure of Figure B. But given that we know the value $Z$, knowing the value of $X$ tells us everything about $Y$, since $Y = Z \oplus X$. As a result of this discussion, in the causal graph of Figure B we’ll say that $Z$ unblocks the path from $X$ to $Y$, even though in the unconditioned case the path would have been considered blocked. And we’ll also say that in this causal graph $X$ and $Y$ are d-connected, conditional on $Z$. The immediate lesson from the graph of Figure B is that $X$ and $Y$ can tell us something about one another, given $Z$, if there is a path between $X$ and $Y$ where the only collider is at $Z$. In fact, the same phenomenon can occur even in this graph: (Figure C.) To see this, suppose we choose $X$ and $Y$ as in the example just described above, i.e., independent random bits, $0$ or $1$, chosen with equal probabilities $1/2$. We will let the unlabelled vertex be $W = X \oplus Y$. And, finally, we choose $Z = W$. Then we see as before that $X$ can tell us something about $Y$, given that we know $Z$, because $X = Y \oplus Z$. The general intuition about graphs like that in Figure C is that knowing $Z$ allows us to infer something about the ancestors of $Z$, and so we must act as though those ancestors are known, too. As a result, in this case we say that $Z$ unblocks the path from $X$ to $Y$, since $Z$ has an ancestor which is a collider on the path from $X$ to $Y$. And so in this case $X$ is d-connected to $Y$, given $Z$. Given the discussion of Figure C that we’ve just had, you might wonder why forks or traverses which are ancestors of $Z$ can’t block a path, for similar reasons? For instance, why don’t we consider $X$ and $Y$ to be d-separated, given $Z$, in the following graph: The reason, of course, is that it’s easy to construct examples where $X$ tells us something about $Y$ in addition to what we already know from $Z$. And so we can’t consider $X$ and $Y$ to be d-separated, given $Z$, in this example. These examples motivate the following definition: Definition: Let $X$, $Y$ and $Z$ be disjoint subsets of vertices in a causal model. Consider a path from a vertex in $X$ to a vertex in $Y$. We say the path is blocked by $Z$ if the path contains either: (a) a collider which is not an ancestor of $Z$, or (b) a fork which is in $Z$, or (c) a traverse which is in $Z$. We say the path is unblocked if it is not blocked. We say that $X$ and $Y$ are d-connected, given $Z$, if there is an unblocked path between some vertex in $X$ and some vertex in $Y$. $X$ and $Y$ are d-separated, given $Z$, if they are not d-connected. Saying “$X$ and $Y$ are d-separated given $Z$” is a bit of a mouthful, and so it’s helpful to have an abbreviated notation. We’ll use the abbreviation $(X \perp Y|Z)_G$. Note that this notation includes the graph $G$; we’ll sometimes omit the graph when the context is clear. We’ll write $(X \perp Y)_G$ to denote unconditional d-separation. As an aside, Pearl uses a similar but slightly different notation for d-separation, namely $(X \perp \! \! \perp Y|Z)_G$. Unfortunately, while the symbol $\perp \! \! \perp$ looks like a LaTeX symbol, it’s not, but is most easily produced using a rather dodgy LaTeX hack. Instead of using that hack over and over again, I’ve adopted a more standard LaTeX notation. While I’m making asides, let me make a second: when I was first learning this material, I found the “d” for “directional” in d-separation and d-connected rather confusing. It suggested to me that the key thing was having a directed path from one vertex to the other, and that the complexities of colliders, forks, and so on were a sideshow. Of course, they’re not, they’re central to the whole discussion. For this reason, when I was writing these notes I considered changing the terminology to i-separated and i-connected, for informationally-separated and informationally-connected. Ultimately I decided not to do this, but I thought mentioning the issue might be helpful, in part to reassure readers (like me) who thought the “d” seemed a little mysterious. Okay, that’s enough asides, let’s get back to the main track of discussion. We saw earlier that (unconditional) d-separation is closely connected to the independence of random variables. It probably won’t surprise you to learn that conditional d-separation is closely connected to conditional independence of random variables. Recall that two sets of random variables $X$ and $Y$ are conditionally independent, given a third set of random variables $Z$, if $p(x,y|z) = p(x|z)p(y|z)$. The following theorem shows that d-separation gives a criterion for when conditional independence occurs in a causal model: Theorem (graphical criterion for conditional independence): Let $G$ be a graph, and let $X$, $Y$ and $Z$ be disjoint subsets of vertices in that graph. Then $X$ and $Y$ are d-separated, given $Z$, if and only if for all causal models on $G$ the random variables corresponding to $X$ and $Y$ are conditionally independent, given $Z$. (Update: Thanks to Rob Spekkens for pointing out an error in my original statement of this theorem.) I won’t prove the theorem here. However, it’s not especially difficult if you’ve followed the discussion above, and is a good problem to work through: ### Problems • Prove the above theorem. ### Problems for the author • The concept of d-separation plays a central role in the causal calculus. My sense is that it should be possible to find a cleaner and more intuitive definition that substantially simplifies many proofs. It’d be good to spend some time trying to find such a definition. ### The causal calculus We’ve now got all the concepts we need to state the rules of the causal calculus. There are three rules. The rules look complicated at first, although they’re easy to use once you get familiar with them. For this reason I’ll start by explaining the intuition behind the first rule, and how you should think about that rule. Having understood how to think about the first rule it’s easy to get the hang of all three rules, and so after that I’ll just outright state all three rules. In what follows, we have a causal model on a graph $G$, and $W, X, Y, Z$ are disjoint subsets of the variables in the causal model. Recall also that $G_{\overline X}$ denotes the perturbed graph in which all edges pointing to $X$ from the parents of $X$ have been deleted. This is the graph which results when an experimenter intervenes to set the value of $X$, overriding other causal influences on $X$. Rule 1: When can we ignore observations: I’ll begin by stating the first rule in all its glory, but don’t worry if you don’t immediately grok the whole rule. Instead, just take a look, and try to start getting your head around it. What we’ll do then is look at some simple special cases, which are easily understood, and gradually build up to an understanding of what the full rule is saying. Okay, so here’s the first rule of the causal calculus. What it tells us is that when $(Y \perp Z|W,X)_{G_{\overline X}}$, then we can ignore the observation of $Z$ in computing the probability of $Y$, conditional on both $W$ and an intervention to set $X$: $p(y|w,\mbox{do}(x),z) = p(y|w,\mbox{do}(x))$ To understand why this rule is true, and what it means, let’s start with a much simpler case. Let’s look at what happens to the rule when there are no $X$ or $W$ variables in the mix. In this case, our starting assumption simply becomes that $Y$ is d-separated from $Z$ in the original (unperturbed) graph $G$. There’s no need to worry about $G_{\overline X}$ because there’s no $X$ variable whose value is being set by intervention. In this circumstance we have $(Y \perp Z)_G$, so $Y$ is independent of $Z$. But the statement of the rule in this case is merely that $p(y|z) = p(y)$, which is, indeed, equivalent to the standard definition of $Y$ and $Z$ being independent. In other words, the first rule is simply a generalization of what it means for $Y$ and $Z$ to be independent. The full rule generalizes the notion of independence in two ways: (1) by adding in an extra variable $W$ whose value has been determined by passive observation; and (2) by adding in an extra variable $X$ whose value has been set by intervention. We’ll consider these two ways of generalizing separately in the next two paragraphs. We begin with generalization (1), i.e., there is no $X$ variable in the mix. In this case, our starting assumption becomes that $Y$ is d-separated from $Z$, given $W$, in the graph $G$. By the graphical criterion for conditional independence discussed in the last section this means that $Y$ is conditionally independent of $Z$, given $W$, and so $p(y|z,w) = p(y|w)$, which is exactly the statement of the rule. And so the first rule can be viewed as a generalization of what it means for $Y$ and $Z$ to be independent, conditional on $W$. Now let’s look at the other generalization, (2), in which we’ve added an extra variable $X$ whose value has been set by intervention, and where there is no $W$ variable in the mix. In this case, our starting assumption becomes that $Y$ is d-separated from $Z$, given $X$, in the perturbed graph $G_{\overline X}$. In this case, the graphical criterion for conditional indepenence tells us that $Y$ is independent from $Z$, conditional on the value of $X$ being set by experimental intervention, and so $p(y|\mbox{do}(x),z) = p(y|\mbox{do}(x))$. Again, this is exactly the statement of the rule. The full rule, of course, merely combines both these generalizations in the obvious way. It is really just an explicit statement of the content of the graphical criterion for conditional independence, in a context where $W$ has been observed, and the value of $X$ set by experimental intervention. The rules of the causal calculus: All three rules of the causal calculus follow a similar template to the first rule: they provide ways of using facts about the causal structure (notably, d-separation) to make inferences about conditional causal probabilities. I’ll now state all three rules. The intuition behind rules 2 and 3 won’t necessarily be entirely obvious, but after our discussion of rule 1 the remaining rules should at least appear plausible and comprehensible. I’ll have bit more to say about intuition below. As above, we have a causal model on a graph $G$, and $W, X, Y, Z$ are disjoint subsets of the variables in the causal model. $G_{\overline X}$ denotes the perturbed graph in which all edges pointing to $X$ from the parents of $X$ have been deleted. $G_{\underline X}$ denotes the graph in which all edges pointing out from $X$ to the children of $X$ have been deleted. We will also freely use notations like $G_{\overline W, \overline X, \underline Z}$ to denote combinations of these operations. Rule 1: When can we ignore observations: Suppose $(Y \perp Z|W,X)_{G_{\overline X}}$. Then: $p(y|w,\mbox{do}(x),z) = p(y|w,\mbox{do}(x)).$ Rule 2: When can we ignore the act of intervention: Suppose $(Y \perp Z|W,X)_{G_{\overline X,\underline Z}}$. Then: $p(y|w,\mbox{do}(x),\mbox{do}(z)) = p(y|w,\mbox{do}(x),z).$ Rule 3: When can we ignore an intervention variable entirely: Let $Z(W)$ denote the set of nodes in $Z$ which are not ancestors of $W$. Suppose $(Y \perp Z|W,X)_{G_{\overline X, \overline{Z(W)}}}$. Then: $p(y|w,\mbox{do}(x),\mbox{do}(z)) = p(y|w,\mbox{do}(x)).$ In a sense, all three rules are statements of conditional independence. The first rule tells us when we can ignore an observation. The second rule tells us when we can ignore the act of intervention (although that doesn’t necessarily mean we can ignore the value of the variable being intervened with). And the third rule tells us when we can ignore an intervention entirely, both the act of intervention, and the value of the variable being intervened with. I won’t prove rule 2 or rule 3 – this post is already quite long enough. (If I ever significantly revise the post I may include the proofs). The important thing to take away from these rules is that they give us conditions on the structure of causal models so that we know when we can ignore observations, acts of intervention, or even entire variables that have been intervened with. This is obviously a powerful set of tools to be working with in manipulating conditional causal probabilities! Indeed, according to Pearl there’s even a sense in which this set of rules is complete, meaning that using these rules you can identify all causal effects in a causal model. I haven’t yet understood the proof of this result, or even exactly what it means, but thought I’d mention it. The proof is in papers by Shpitser and Pearl and Huang and Valtorta. If you’d like to see the proofs of the rules of the calculus, you can either have a go at proving them yourself, or you can read the proof. ### Problems for the author • Suppose the conditions of rules 1 and 2 hold. Can we deduce that the conditions of rule 3 also hold? ### Using the causal calculus to analyse the smoking-lung cancer connection We’ll now use the causal calculus to analyse the connection between smoking and lung cancer. Earlier, I introduced a simple causal model of this connection: The great benefit of this model was that it included as special cases both the hypothesis that smoking causes cancer and the hypothesis that some hidden causal factor was responsible for both smoking and cancer. It turns out, unfortunately, that the causal calculus doesn’t help us analyse this model. I’ll explain why that’s the case below. However, rather than worrying about this, at this stage it’s more instructive to work through an example showing how the causal calculus can be helpful in analysing a similar but slightly modified causal model. So although this modification looks a little mysterious at first, for now I hope you’ll be willing to accept it as given. The way I’m going to modify the causal model is by introducing an extra variable, namely, whether someone has appreciable amounts of tar in their lungs or not: (By tar, I don’t mean “tar” literally, but rather all the material deposits found as a result of smoking.) This causal model is a plausible modification of the original causal model. It is at least plausible to suppose that smoking causes tar in the lungs and that those deposits in turn cause cancer. But if the hidden causal factor is genetic, as the tobacco companies argued was the case, then it seems highly unlikely that the genetic factor caused tar in the lungs, except by the indirect route of causing those people to smoke. (I’ll come back to what happens if you refuse to accept this line of reasoning. For now, just go with it.) Our goal in this modified causal model is to compute probabilities like $p(\mbox{cancer}|\mbox{do}(smoking)) = p(y| \mbox{do}(x))$. What we’ll show is that the causal calculus lets us compute this probability entirely in terms of probabilities like $p(y|z), p(z|y)$ and other probabilities that don’t involve an intervention, i.e., that don’t involve $\mbox{do}$. This means that we can determine $p(\mbox{cancer}|\mbox{do}(smoking))$ without needing to know anything about the hidden factor. We won’t even need to know the nature of the hidden factor. It also means that we can determine $p(\mbox{cancer}|\mbox{do}(smoking))$ without needing to intervene to force someone to smoke or not smoke, i.e., to set the value for $X$. In other words, the causal calculus lets us do something that seems almost miraculous: we can figure out the probability that someone would get cancer given that they are in the smoking group in a randomized controlled experiment, without needing to do the randomized controlled experiment. And this is true even though there may be a hidden causal factor underlying both smoking and cancer. Okay, so how do we compute $p(\mbox{cancer}|\mbox{do}(smoking)) = p(y| \mbox{do}(x))$? The obvious first question to ask is whether we can apply rule 2 or rule 3 directly to the conditional causal probability $p(y|\mbox{do}(x))$. If rule 2 applies, for example, it would say that intervention doesn’t matter, and so $p(y|\mbox{do}(x)) = p(y|x)$. Intuitively, this seems unlikely. We’d expect that intervention really can change the probability of cancer given smoking, because intervention would override the hidden causal factor. If rule 3 applies, it would say that $p(y|\mbox{do}(x)) = p(y)$, i.e., that an intervention to force someone to smoke has no impact on whether they get cancer. This seems even more unlikely than rule 2 applying. However, as practice and a warm up, let’s work through the details of seeing whether rule 2 or rule 3 can be applied directly to $p(y|\mbox{do}(x))$. For rule 2 to apply we need $(Y\perp X)_{G_{\underline X}}$. To check whether this is true, recall that $G_{\underline X}$ is the graph with the edges pointing out from $X$ deleted: Obviously, $Y$ is not d-separated from $X$ in this graph, since $X$ and $Y$ have a common ancestor. This reflects the fact that the hidden causal factor indeed does influence both $X$ and $Y$. So we can’t apply rule 2. What about rule 3? For this to apply we’d need $(Y \perp X)_{G_{\overline X}}$. Recall that $G_{\overline X}$ is the graph with the edges pointing toward $X$ deleted: Again, $Y$ is not d-separated from $X$, in this case because we have an unblocked path directly from $X$ to $Y$. This reflects our intuition that the value of $X$ can influence $Y$, even when the value of $X$ has been set by intervention. So we can’t apply rule 3. Okay, so we can’t apply the rules of the causal calculus directly to determine $p(y|\mbox{x})$. Is there some indirect way we can determine this probability? An experienced probabilist would at this point instinctively wonder whether it would help to condition on the value of $z$, writing: $[2] \,\,\,\, p(y| \mbox{do}(x)) = \sum_z p(y|z,\mbox{do}(x)) p(z|\mbox{do}(x)).$ Of course, saying an experienced probabilist would instinctively do this isn’t quite the same as explaining why one should do this! However, it is at least a moderately obvious thing to do: the only extra information we potentially have in the problem is $z$, and so it’s certainly somewhat natural to try to introduce that variable into the problem. As we shall see, this turns out to be a wise thing to do. ### Exercises • I used without proof the equation $p(y| \mbox{do}(x)) = \sum_z p(y|z,\mbox{do}(x)) p(z|\mbox{do}(x))$. This should be intuitively plausible, but really requires proof. Prove that the equation is correct. To simplify the right-hand side of equation [2], we first note that we can apply rule 2 to the second term on the right-hand side, obtaining $p(z|\mbox{do}(x)) = p(z|x)$. To check this explicitly, note that the condition for rule 2 to apply is that $(Z \perp X)_{G_{\underline X}}$. We already saw the graph $G_{\underline X}$ above, and, indeed, $Z$ is d-separated from $X$ in that graph, since the only path from $Z$ to $X$ is blocked at $Y$. As a result, we have: $[3] \,\,\,\, p(y| \mbox{do}(x)) = \sum_z p(y|z,\mbox{do}(x)) p(z|x).$ At this point in the presentation, I’m going to speed the discussion up, telling you what rule of the calculus to apply at each step, but not going through the process of explicitly checking that the conditions of the rule hold. (If you’re doing a close read, you may wish to check the conditions, however.) The next thing we do is to apply rule 2 to the first term on the right-hand side of equation [3], obtaining $p(y|z,\mbox{do}(x)) = p(y|\mbox{do}(z),\mbox{do}(x))$. We then apply rule 3 to remove the $\mbox{do}(x)$, obtaining $p(y|z,\mbox{do}(x)) = p(y|\mbox{do}(z))$. Substituting back in gives us: $[4] \,\,\,\, p(y| \mbox{do}(x)) = \sum_z p(y|\mbox{do}(z)) p(z|x).$ So this means that we’ve reduced the computation of $p(y|\mbox{do}(x))$ to the computation of $p(y|\mbox{do}(z))$. This doesn’t seem terribly encouraging: we’ve merely substituted the computation of one causal conditional probability for another. Still, let us continue plugging away, and see if we can make progress. The obvious first thing to try is to apply rule 2 or rule 3 to simplify $p(y|\mbox{do}(z)$. Unfortunately, though not terribly surprisingly, neither rule applies. So what do we do? Well, in a repeat of our strategy above, we again condition on the other variable we have available to us, in this case $x$: $p(y|\mbox{do}(z)) = \sum_x p(y|x,\mbox{do}(z)) p(x|\mbox{do}(z)).$ Now we’re cooking! Rule 2 lets us simplify the first term to $p(y|x,z)$, while rule 3 lets us simplify the second term to $p(x)$, and so we have $p(y|\mbox{do}(z)) = \sum_x p(y|x,z) p(x)$. To substitute this expression back into equation [4] it helps to change the summation index from $x$ to $x'$, since otherwise we would have a duplicate summation index. This gives us: $[5] \,\,\,\, p(y| \mbox{do}(x)) = \sum_{x'z} p(y|x',z) p(z|x) p(x') .$ This is the promised expression for $p(y|\mbox{do}(x))$ (i.e., for probabilities like $p(\mbox{cancer}| \mbox{do(smoking)})$, assuming the causal model above) in terms of quantities which may be observed directly from experimental data, and which don’t require intervention to do a randomized, controlled experiment. Once $p(\mbox{cancer}| \mbox{do(smoking)})$ is determined, we can compare it against $p(\mbox{cancer})$. If $p(\mbox{cancer}| \mbox{do(smoking)})$ is larger than $p(\mbox{cancer})$ then we can conclude that smoking does, indeed, play a causal role in cancer. Something that bugs me about the derivation of equation [5] is that I don’t really know how to “see through” the calculations. Yes, it all works out in the end, and it’s easy enough to follow along. Yet that’s not the same as having a deep understanding. Too many basic questions remain unanswered: Why did we have to condition as we did in the calculation? Was there some other way we could have proceeded? What would have happeed if we’d conditioned on the value of the hidden variable? (This is not obviously the wrong thing to do: maybe the hidden variable would ultimately drop out of the calculation). Why is it possible to compute causal probabilities in this model, but not (as we shall see) in the model without tar? Ideally, a deeper understanding would make the answers to some or all of these questions much more obvious. ### Problems for the author • Why is it so much easier to compute $p(y|\mbox{do}(z))$ than $p(y|\mbox{do}(x))$ in the model above? Is there some way we could have seen that this would be the case, without needing to go through a detailed computation? • Suppose we have a causal model $G$, with $S$ a subset of vertices for which all conditional probabilities are known. Is it possible to give a simple characterization of for which subsets $X$ and $Y$ of vertices it is possible to compute $p(y|\mbox{do}(x))$ using just the conditional probabilities from $S$? Unfortunately, I don’t know what the experimentally observed probabilities are in the smoking-tar-cancer case. If anyone does, I’d be interested to know. In lieu of actual data, I’ll use some toy model data suggested by Pearl; the data is quite unrealistic, but nonetheless interesting as an illustration of the use of equation [5]. The toy model data is as follows: (1) 47.5 percent of the population are nonsmokers with no tar in their lungs, and 10 percent of these get cancer. (2) 2.5 percent are smokers with no tar, and 90 percent get cancer. (3) 2.5 percent are nonsmokers with tar, and 5 percent get cancer. (4) 47.5 percent are smokers with tar, and 85 percent get cancer. In this case, we get: $p(\mbox{cancer} | \mbox{do}(smoking)) = 45.25 \$ By contrast, $p(\mbox{cancer}) = 47.5$ percent, and so if this data was correct (obviously it’s not even close) it would show that smoking actually somewhat reduces a person’s chance of getting lung cancer. This is despite the fact that $p(\mbox{cancer} | \mbox{smoking}) = 85.25$ percent, and so a naive approach to causality based on correlations alone would suggest that smoking causes cancer. In fact, in this imagined world smoking might actually be useable as a preventative treatment for cancer! Obviously this isn’t truly the case, but it does illustrate the power of this method of analysis. Summing up the general lesson of the smoking-cancer example, suppose we have two competing hypotheses for the causal origin of some effect in a system, A causes C or B causes C, say. Then we should try to construct a realistic causal model which includes both hypotheses, and then use the causal calculus to attempt to distinguish the relative influence of the two causal factors, on the basis of experimentally accessible data. Incidentally, the kind of analysis of smoking we did above obviously wasn’t done back in the 1960s. I don’t actually know how causality was established over the protestations that correlation doesn’t impy causation. But it’s not difficult to think of ways you might have come up with truly convincing evidence that smoking was a causal factor. One way would have been to look at the incidence of lung cancer in populations where smoking had only recently been introduced. Suppose, for example, that cigarettes had just been introduced into the (fictional) country of Nicotinia, and that this had been quickly followed by a rapid increase in rates of lung cancer. If this pattern was seen across many new markets then it would be very difficult to argue that lung cancer was being caused solely by some pre-existing factor in the population. ### Exercises • Construct toy model data where smoking increases a person’s chance of getting lung cancer. Let’s leave this model of smoking and lung cancer, and come back to our original model of smoking and lung cancer: What would have happened if we’d tried to use the causal calculus to analyse this model? I won’t go through all the details, but you can easily check that whatever rule you try to apply you quickly run into a dead end. And so the causal calculus doesn’t seem to be any help in analysing this problem. This example illustrates some of the limitations of the causal calculus. In order to compute $p(\mbox{cancer}| \mbox{do}(smoking))$ we needed to assume a causal model with a particular structure: While this model is plausible, it is not beyond reproach. You could, for example, criticise it by saying that it is not the presence of tar deposits in the lungs that causes cancer, but maybe some other factor, perhaps something that is currently unknown. This might lead us to consider a causal model with a revised structure: So we could try instead to use the causal calculus to analyse this new model. I haven’t gone through this exercise, but I strongly suspect that doing so we wouldn’t be able to use the rules of the causal calculus to compute the relevant probabilities. The intuition behind this suspicion is that we can imagine a world in which the tar may be a spurious side-effect of smoking that is in fact entirely unrelated to lung cancer. What causes lung cancer is really an entirely different mechanism, but we couldn’t distinguish the two from the statistics alone. The point of this isn’t to say that the causal calculus is useless. It’s remarkable that we can plausibly get information about the outcome of a randomized controlled experiment without actually doing anything like that experiment. But there are limitations. To get that information we needed to make some presumptions about the causal structure in the system. Those presumptions are plausible, but not logically inevitable. If someone questions the presumptions then it may be necessary to revise the model, perhaps adopting a more sophisticated causal model. One can then use the causal calculus to attempt to analyse that more sophisticated model, but we are not guaranteed success. It would be interesting to understand systematically when this will be possible and when it will not be. The following problems start to get at some of the issues involved. ### Problems for the author • Is it possible to make a more precise statement than “the causal calculus doesn’t seem to be any help” for the original smoking-cancer model? • Given a probability distribution over some random variables, it would be useful to have a classification theorem describing all the causal models in which those random variables could appear. • Extending the last problem, it’d be good to have an algorithm to answer questions like: in the space of all possible causal models consistent with a given set of observed probabilities, what can we say about the possible causal probabilities? It would also be useful to be able to input to the algorithm some constraints on the causal models, representing knowledge we’re already sure of. • In real-world experiments there are many practical issues that must be addressed to design a realiable randomized, controlled experiment. These issues include selection bias, blinding, and many others. There is an entire field of experimental design devoted to addressing such issues. By comparison, my description of causal inference ignores many of these practical issues. Can we integrate the best thinking on experimental design with ideas such as causal conditional probabilities and the causal calculus? • From a pedagogical point of view, I wonder if it might have been better to work fully through the smoking-cancer example before getting to the abstract statement of the rules of the causal calculus. Those rules can all be explained and motivated quite nicely in the context of the smoking-cancer example, and that may help in understanding. ### Conclusion I’ve described just a tiny fraction of the work on causality that is now going on. My impression as an admittedly non-expert outsider to the field is that this is an exceptionally fertile field which is developing rapidly and giving rise to many fascinating applications. Over the next few decades I expect the theory of causality will mature, and be integrated into the foundations of disciplines ranging from economics to medicine to social policy. Causal discovery: One question I’d like to understand better is how to discover causal structures inside existing data sets. After all, human beings do a pretty good (though far from perfect) job at figuring out causal models from their observation of the world. I’d like to better understand how to use computers to automatically discover such causal models. I understand that there is already quite a literature on the automated discovery of causal models, but I haven’t yet looked in much depth at that literature. I may come back to it in a future post. I’m particularly fascinated by the idea of extracting causal models from very large unstructured data sets. The KnowItAll group at the University of Washington (see Oren Etzioni on Google Plus) have done fascinating work on a related but (probably) easier problem, the problem of open information extraction. This means taking an unstructured information source (like the web), and using it to extract facts about the real world. For instance, using the web one would like computers to be able to learn facts like “Barack Obama is President of the United States”, without needing a human to feed it that information. One of the things that makes this task challenging is all the misleading and difficult-to-understand information out on the web. For instance, there are also webpages saying “George Bush is President of the United States”, which was probably true at the time the pages were written, but which is now misleading. We can find webpages which state things like “[Let’s imagine] Steve Jobs is President of the United States“; it’s a difficult task for an unsupervised algorithm to figure out how to interpret that “Let’s imagine”. What the KnowItAll team have done is made progress on figuring out how to learn facts in such a rich but uncontrolled environment. What I’m wondering is whether such techniques can be adapted to extract causal models from data? It’d be fascinating if so, because of course humans don’t just reason with facts, they also reason with (informal) causal models that relate those facts. Perhaps causal models or a similar concept may be a good way of representing some crucial part of our knowledge of the world. ### Problems for the author • What systematic causal fallacies do human beings suffer from? We certainly often make mistakes in the causal models we extract from our observations of the world – one example is that we often do assume that correlation implies causation, even when that’s not true – and it’d be nice to understand what systematic biases we have. • Humans aren’t just good with facts and causal models. We’re also really good at juggling multiple causal models, testing them against one another, finding problems and inconsistencies, and making adjustments and integrating the results of those models, even when the results conflict. In essence, we have a (working, imperfect) theory of how to deal with causal models. Can we teach machines to do this kind of integration of causal models? • We know that in our world the sun rising causes the rooster to crow, but it’s possible to imagine a world in which it is the rooster crowing that causes the sun to rise. This could be achieved in a suitably designed virtual world, for example. The reason we believe the first model is correct in our world is not intrinsic to the data we have on roosters and sunrise, but rather depends on a much more complex network of background knowledge. For instance, given what we know about roosters and the sun we can easily come up with plausible causal mechanisms (solar photons impinging on the rooster’s eye, say) by which the sun could cause the rooster to crow. There do not seem to be any similarly plausible causal models in the other direction. How do we determine what makes a particular causal model plausible or not? How do we determine the class of plausible causal models for a given phenomenon? Can we make this kind of judgement automatically? (This is all closely related to the last problem). Continuous-time causality: A peculiarity in my post is that even though we’re talking about causality, and time is presumably important, I’ve avoided any explicit mention of time. Of course, it’s implicitly there: if I’d been a little more precise in specifying my models they’d no doubt be conditioned on events like “smoked at least a pack a day for 10 or more years”. Of course, this way of putting time into the picture is rather coarse-grained. In a lot of practical situations we’re interested in understanding causality in a much more temporally fine-grained way. To explain what I mean, consider a simple model of the relationship between what we eat and our insulin levels: This model represents the fact that what we eat determines our insulin levels, and our insulin levels in turn play a part in determining how hungry we feel, and thus what we eat. But as a model, it’s quite inadequate. In fact, there’s a much more complex feedback relationship going on, a constant back-and-forth between what we eat at any given time, and our insulin levels. Ideally, this wouldn’t be represented by a few discrete events, but rather by a causal model that reflects the continual feedback between these possibilities. What I’d like to see developed is a theory of continuous-time causal models, which can address this sort of issue. It would also be useful to extend the calculus to continuous spaces of events. So far as I know, at present the causal calculus doesn’t work with these kinds of ideas. ### Problems for the author • Can we formulate theories like electromagnetism, general relativity and quantum mechanics within the framework of the causal calculus (or some generalization)? Do we learn anything by doing so? Other notions of causality: A point I’ve glossed over in the post is how the notion of causal influence we’ve been studying relates to other notions of causality. The notion we’ve been exploring is based on the notion of causality that is established by a (hopefully well-designed!) randomized controlled experiment. To understand what that means, think of what it would mean if we used such an experiment to establish that smoking does, indeed, cause cancer. All this means is that in the population being studied, forcing someone to smoke will increase their chance of getting cancer. Now, for the practical matter of setting public health policy, that’s obviously a pretty important notion of causality. But nothing says that we won’t tomorrow discover some population of people where no such causal influence is found. Or perhaps we’ll find a population where smoking actively helps prevent cancer. Both these are entirely possible. What’s going on is that while our notion of causality is useful for some purposes, it doesn’t necessarily say anything about the details of an underlying causal mechanism, and it doesn’t tell us how the results will apply to other populations. In other words, while it’s a useful and important notion of causality, it’s not the only way of thinking about causality. Something I’d like to do is to understand better what other notions of causality are useful, and how the intervention-based approach we’ve been exploring relates to those other approaches. ### Acknowledgments Thanks to Jen Dodd, Rob Dodd, and Rob Spekkens for many discussions about causality. Especial thanks to Rob Spekkens for pointing me toward the epilogue of Pearl’s book, which is what got me hooked on causality! ### Principal sources and further reading A readable and stimulating overview of causal inference is the epilogue to Judea Pearl’s book. The epilogue, in turn, is based on a survey lecture by Pearl on causal inference. I highly recommend getting a hold of the book and reading the epilogue; if you cannot do that, I suggest looking over the survey lecture. A draft copy of the first edition of the entire book is available on Pearl’s website. Unfortunately, the draft does not include the full text of the epilogue, only the survey lecture. The lecture is still good, though, so you should look at it if you don’t have access to the full text of the epilogue. I’ve also been told good things about the book on causality by Spirtes, Glymour and Scheines, but haven’t yet had a chance to have a close look at it. An unfortunate aspect of the current post is that it gives the impression that the theory of causal inference is entirely Judea Pearl’s creation. Of course that’s far from the case, a fact which is quite evident from both Pearl’s book, and the Spirtes-Glymour-Scheines book. However, the particular facets I’ve chosen to focus on are due principally to Pearl and his collaborators: most of the current post is based on chapter 3 and chapter 1 of Pearl’s book, as well as a 1994 paper by Pearl, which established many of the key ideas of the causal calculus. Finally, for an enjoyable and informative discussion of some of the challenges involved in understanding causal inference I recommend Jonah Lehrer’s recent article in Wired. Interested in more? Please follow me on Twitter. You may also enjoy reading my new book about open science, Reinventing Discovery. In this post I describe a simple, single-machine web crawler that I’ve written, and do some simple profiling and benchmarking. In the next post I intend to benchmark it against two popular open source crawlers, the scrapy and Nutch crawlers. I’m doing this as part of an attempt to answer a big, broad question: if you were trying to build a web-scale crawler, does it make most sense to start from scratch (which gives you a lot of flexibility), or would it make more sense to start from an existing project, like Nutch? Of course, there are many aspects to answering this question, but obviously one important aspect is speed: how fast can we download pages? I’m especially interested in understanding where the bottlenecks are in my code. Is it the fact that I’ve used Python? Is it download speed over the network? Is it access to the database server? Is it parsing content? Are we CPU-bound, network-bound, or disk-bound? The answers to these questions will help inform decisions about whether to work on improving the crawler, or perhaps to work starting from an existing crawler. The code for my test crawler is at GitHub. The crawler uses as a set of seed urls a listing of some of the top blogs from Technorati. Only urls from within the corresponding domains are crawled. I won’t explicitly show the code for getting the seed urls, but it’s in the same GitHub repository (link). Here’s the code for the crawler: """crawler.py crawls the web. It uses a domain whitelist generated from Technorati's list of top blogs. USEAGE python crawler.py & The crawler ingests input from external sources that aren't under centralized control, and so needs to deal with many potential errors. By design, there are two broad classes of error, which we'll call anticipated errors and unanticipated errors. Anticipated errors are things like a page failing to download, or timing out, or a robots.txt file disallowing crawling of a page. When anticipated errors arise, the crawler writes the error to info.log, and continues in an error-appropriate manner. Unanticipated errors are, not surprisingly, errors which haven't been anticipated and designed for. Rather than the crawler falling over, we log the error and continue. At the same time, we also keep track of how many unanticipated errors have occurred in close succession. If many unanticipated errors occur rapidly in succession it usually indicates that some key piece of infrastructure has failed --- maybe the network connection is down, or something like that. In that case we shut down the crawler entirely.""" import cPickle import json import logging import logging.handlers import os from Queue import Queue import re import robotparser from StringIO import StringIO import sys import threading import time import traceback import urllib import urllib2 from urlparse import urlparse # Third party libraries from lxml import etree import MySQLdb from redis import Redis # Configuration parameters # NUM_THREADS is the number of crawler threads: setting this parameter # requires some experimentation. A rule of thumb is that the speed of # the crawler should scale roughly proportionally to NUM_THREADS, up # to a point at which performance starts to saturate. That's the # point at which to stop. NUM_THREADS = 15 # MAX_LENGTH is the largest number of bytes to download from any # given url. MAX_LENGTH = 100000 # NUM_PAGES is the number of pages to crawl before halting the crawl. # Note that the exact number of pages crawled will be slightly higher, # since each crawler thread finishes its current downloads before # exitting. At the moment, NUM_PAGES is set quite modestly. To # increase it dramatically --- say up to 10 million --- would require # moving away from Redis to maintain the URL queue. NUM_PAGES = 5000 # Global variables to keep track of the number of unanticipated # errors, and a configuration parameter --- the maximum number of # close unanticipated errors in a row that we'll tolerate before # shutting down. count_of_close_unanticipated_errors = 0 time_of_last_unanticipated_error = time.time() MAX_CLOSE_UNANTICIPATED_ERRORS = 5 # Counter to keep track of the number of pages crawled. r = Redis() r.set("count",0) # total_length: the total length of all downloaded files. # Interpreting it depends on total_length = 0 def main(): create_logs() establish_url_queues() get_seed_urls() get_domain_whitelist() establish_mysql_database() start_time = time.time() for j in range(NUM_THREADS): crawler = Crawler() crawler.setName("thread-%s" % j) print "Launching crawler %s" % (j,) crawler.start() crawler.join() end_time = time.time() elapsed_time = end_time-start_time r = Redis() num_pages = int(r.get("count")) print "%s pages downloaded in %s seconds." % (num_pages,elapsed_time) print "That's %s pages per second." % (num_pages/elapsed_time) print "\nTotal length of download is %s." % total_length print "Assuming UTF-8 encoding (as for most English pages) that's the # of bytes downloaded." print "Bytes per second: %s" % (total_length/elapsed_time) def create_logs(): """Set up two logs: (1) info_logger logs routine events, including both pages which have been crawled, and also anticipated errors; and (2) critical_logger records unanticipated errors.""" global info_logger global critical_logger info_logger = logging.getLogger('InfoLogger') info_logger.setLevel(logging.INFO) info_handler = logging.handlers.RotatingFileHandler( 'info.log', maxBytes=1000000, backupCount=5) info_logger.addHandler(info_handler) critical_logger = logging.getLogger('CriticalLogger') critical_logger.setLevel(logging.CRITICAL) critical_handler = logging.handlers.RotatingFileHandler( 'critical.log',maxBytes=100000,backupCount=5) critical_logger.addHandler(critical_handler) def establish_url_queues(): """Checks whether the Redis database has been set up. If not, set it up.""" r = Redis() # Strictly, we should check that the lists for all threads are empty. # But this works in practice. if r.llen("thread-0") == 0: get_seed_urls() def get_seed_urls(): """Puts the seed urls generated by get_seed_urls_and_domains.py into the crawl queues.""" f = open("seed_urls.json") urls = json.load(f) f.close() append_urls(urls) def append_urls(urls): """Appends the contents of urls to the crawl queues. These are implemented as Redis lists, with names "thread-0", "thread-1", and so on, corresponding to the different threads.""" r = Redis() for url in urls: thread = hash(domain(url)) % NUM_THREADS r.rpush("thread-%s" % str(thread),url) def domain(url): """A convenience method to return the domain associated to a url.""" return urlparse(url).netloc def get_domain_whitelist(): global domain_whitelist f = open("domain_whitelist.json") domain_whitelist = set(json.load(f)) f.close() def establish_mysql_database(): """Checks whether the tables in the MySQL database "crawl" have been set up. If not, then set them up. Note that this routine assumes that the "crawl" database has already been created. """ conn = MySQLdb.connect("localhost","root","","crawl") cur = conn.cursor() if int(cur.execute("show tables")) == 0: create_tables(conn) conn.close() def create_tables(conn): """Creates the MySQL tables and indices for the crawl. We create two tables: (1) robot_parser, which is used to store the (parsed) robots.txt file for each domain; and (2) pages, which stores urls and the corresponding title and content text.""" cur = conn.cursor() cur.execute( 'create table robot_parser (domain text, robot_file_parser text)') cur.execute('create table pages (url text, title text, content mediumtext)') cur.execute('create index domain_idx on robot_parser(domain(255))') cur.execute('create index url_pages_idx on pages(url(255))') class Crawler(threading.Thread): def run(self): global total_length r = Redis() conn = MySQLdb.connect("localhost","root","","crawl") parser = etree.HTMLParser() while int(r.get("count")) < NUM_PAGES: try: urls = self.pop_urls() new_urls = [] except: self.error_handler() continue for url in urls: try: if not self.is_crawling_allowed(conn,url): continue try: request = urllib2.urlopen(url,timeout=5) except urllib2.URLError: info_logger.info("%s: could not open" % url) continue headers = request.headers length = get_length(url,headers) try: content = request.read(length) except urllib2.URLError: info_logger.info("%s: could not download" % url) continue try: tree = etree.parse(StringIO(content),parser) except: info_logger.info("%s: lxml could not parse" % url) continue self.add_to_index(conn,url,content,tree) r.incr("count") total_length += len(content) info_logger.info("Crawled %s" % url) for url in tree.xpath("//a/@href"): if not(domain(url) in domain_whitelist): pass else: new_urls.append(url) except: self.error_handler() continue try: append_urls(new_urls) conn.commit() except: self.error_handler() continue def pop_urls(self): """Returns 10 urls from the current thread's url queue.""" urls = [] r = Redis() for j in range(10): urls.append(r.lpop(self.name)) return urls def error_handler(self): """Logs unanticipated errors to critical_logger. Also checks whether or not the error occurred close to (within 3 seconds) another unanticipated error, and if that occurs too many times in a row, shuts down the crawler.""" global time_of_last_unanticipated_error global count_of_close_unanticipated_errors critical_logger.critical( "Error getting URLs at %s:\n\n%s" % (time.asctime(), traceback.format_exc())) # Check whether the error is close (within 3 seconds) to the # last unanticipated error if (time.time() < time_of_last_unanticipated_error + 3.0): critical_logger.critical( "\nThis error occurred close to another.\n") # Not threadsafe, but shouldn't cause major problems count_of_close_unanticipated_errors += 1 else: count_of_close_unanticipated_errors = 0 # Shut down if we have too many close unanticipated errors if (count_of_close_unanticipated_errors >= MAX_CLOSE_UNANTICIPATED_ERRORS): critical_logger.critical( "\nExit: too many close unanticipated errors.") sys.exit(1) time_of_last_unanticipated_error = time.time() def is_crawling_allowed(self,conn,url): """Checks that url can be crawled. At present, this merely checks that robots.txt allows crawling, and that url signifies a http request. It could easily be extended to do other checks --- on the language of the page, for instance.""" return (self.is_url_a_http_request(url) and self.does_robots_txt_allow_crawling(conn,url)) def is_url_a_http_request(self,url): """Checks that url is for a http request, and not (say) ftp.""" u = urlparse(url) if u.scheme == "http": return True else: info_logger.info("%s: not a http request" % url) return False def does_robots_txt_allow_crawling(self,conn,url): """Check that robots.txt allows url to be crawled.""" cur = conn.cursor() cur.execute( "select robot_file_parser from robot_parser where domain='%s'" \ % domain(url)) rfp_db = cur.fetchone() if rfp_db: # we've crawled this domain before rfp = cPickle.loads(str(rfp_db[0])) else: # we've never crawled this domain rfp = robotparser.RobotFileParser() try: rfp.set_url("http://"+domain(url)+"/robots.txt") rfp.read() except: info_logger.info("%s: couldn't read robots.txt" % url) return False rfp_pickle = cPickle.dumps(rfp) cur = conn.cursor() cur.execute( "insert into robot_parser(domain,robot_file_parser) values(%s,%s)", (domain,rfp_pickle)) conn.commit() if rfp.can_fetch("*",url): return True else: info_logger.info("%s: robots.txt disallows fetching" % url) return False def add_to_index(self,conn,url,content,tree): title = get_title(tree) cur = conn.cursor() try: cur.execute("insert into pages (url,title,content) values(%s,%s,%s)", (url,title,content)) except UnicodeEncodeError: info_logger.info("%s: Couldn't store title %s" % (url,title)) conn.commit() def get_length(url,headers): """Attempts to find the length of a page, based on the "content-length" header. If the information is not available, then sets the length to MAX_LENGTH. Otherwise, sets it to the minimum of the "content-length" header and MAX_LENGTH.""" try: length = int(headers["content-length"]) except (KeyError,ValueError): info_logger.info( "%s: could not retrieve content-length" % url) length = MAX_LENGTH if length > MAX_LENGTH: info_logger.info( "%s: had length %s, truncating to %s" % (url,length,MAX_LENGTH)) length = MAX_LENGTH return length def get_title(tree): """Takes an lxml etree for a HTML document, and returns the text from the first (if any) occurrence of the title tag.""" x = tree.xpath("//title") if len(x) > 0 and x[0].text: return x[0].text else: return "" if __name__ == "__main__": main()  (Feedback on the code is welcome. I’m not an experienced Python programmer, and I know I could learn a lot from people with more experience. Aspects of this code that I suspect could be substantially improved with advice from the right person include the way I do error-handling, and logging.) I used this code on a Slicehost 512 slice – a very lightweight Xen virtual private server, definitely not heavy machinery! Download speed maxed out at around 15 crawler threads, and so that’s the number of threads I used. The bottleneck appeared to be CPU (which was routinely in the range 50-80 percent), although, as we’ll see below, we were also using a substantial fraction of network capacity. Neither memory nor disk speed seemed to be an issue. The crawler downloaded 5043 pages in 229 seconds, which is 22 pages per second. Another way of looking at it is that that’s about 2 million pages per day. The total length of the downloaded pages was 386 million characters – around 370 megabytes, assuming UTF-8 encoding, which gives a sustained download rate of about 1.7 megabytes per second. Using wget alone I’m ordinarily able to get several times that (3-6 megabytes per second), and sometimes up to peak speeds over 10 Mb/sec. This suggests that the bottleneck is not network speed, although, as we’ll see below, a substantial fraction of the program’s time is spent downloading. I profiled the crawler using yappi, a multi-threaded Python profiler. (The standard Python profiler only copes with a single thread.) Here are some things I learnt (note that all numbers are approximate): • Networking – both downloading data and making connections – consumes 40 percent of time. • A surprising fraction of that time (6-7 percent) seems to be spent on just opening the connection, using urllib2.urlopen. I’m not sure what’s taking the time – DNS lookup maybe? • Another surprising aspect of networking is that dealing with urllib2 errors takes 5 percent of the time. • Redis consumes about 20 percent of time. Well over half of that is spent in client.py._execute_command. • The parser for robots.txt is suprisingly CPU intensive, consuming about 5 percent of time. • urlparse consumes about 4 percent of the time. • For reasons I don’t understand, lxml and etree don’t show up in the profiling results. I have no idea why this is. • append_urls consumes only 3-4 percent of the time. • MySQL and the logging were both about 1/4 of a percent. I must admit to being suspicious about the first of these, when compared to the Redis results. It’s true that the program makes far more calls to Redis than MySQL – half a million or so calls to Redis, versus about 16,000 to MySQL – but this still seems wrong. Other possibilities that deserve further consideration: (1) I’ve got Redis poorly configured; or (2) Redis lists are slower than I thought. Many actions suggest themselves. Most significantly, I’ll probably eliminate Redis, and store the queue of urls to be crawled (the url frontier) using a combination of MySQL (for persistence) and in-memory Python queues (for speed). This is the right thing to do regardless of performance. The reason is that Redis stores the url frontier in-memory, and that severely limits the size of the url frontier, unless you spend a lot of money on memory. The obvious thing to do is to move to a disk-based solution to store the url frontier. With some alterations it should be easy to make this crawler scalable, i.e., to make it run on a cluster, rather than a single machine. In particular, because the crawler is based on whitelisting of domains, all that’s needed is to start the crawl off by allocating different parts of the whitelist to different machines. Those sites can then be crawled in parallel, with no necessity for inter-machine communication. Many natural follow-on questions suggest themselves. Are any of these top blogs serving spam, or otherwise compromised in some way? What fraction ban crawling by unknown crawlers (like this one)? What fraction of pages from the crawl are duplicates, or near-duplicates? Do any of the webpages contain honeypots, i.e., url patterns designed to trap or otherwise mislead crawlers? I plan to investigate these questions in future posts. Interested in more? Please follow me on Twitter. My new book about open science, Reinventing Discovery will be published in October, and can be pre-ordered at Amazon. Working notes ahead! This post is different to my last two posts. Those posts were broad reviews of topics of general interest (at least if you’re interested in data-driven intelligence) – the Pregel graph framework, and the vector space model of documents. This post is not a review or distillation of a topic in the style of those posts. Rather, it’s my personal working notes – me, thinking out loud – as I try to understand a particular question in some depth. You can think of it as a variation on open notebook science. As such, the notes go into much more detail than previous posts, as I try to wrap my head around all the details of the problem at hand. In writing this blog I expect to alternate between these two formats (working notes and broader reviews). When it’s my personal working notes, I’ll make that clear in the first paragraph or so, so people who just want carefully distilled highlights can skip the more specialized posts. With that out of the way, let’s get on with the main problem. In the last post I described a way of measuring how relevant a document is to a given search query. That notion of relevance was based on the cosine similarity (essentially, the angle) between the document vector and the query vector. Those vectors, in turn, were defined based on a notion of the importance $\mbox{imp}(t,d)$ of a term $t$ in a document $d$. In the post I used what I called the standard importance function (I’ll remind you how that’s defined in a moment). But although the standard importance function is widely used, that doesn’t mean it’s without problems, and in this post I describe a problem caused by the standard importance function. The strategy in the post is to first describe a specific example that illustrates the problem with the standard importance function. The example is somewhat artificial, but, once we understand it, we’ll see that it’s part of a broader problem that is very real, and not at all artificial. To describe the example, let’s suppose our search query contains two terms, “hobbit baggins”. We can represent this by a query vector $\vec q$ whose components include a term for “hobbit” and a term for “baggins”. Recall from the last post that, if we’re using the standard importance function, then those components are given by the inverse document frequency for each term, $\log(N/{\rm df}_t)$. Here, logarithms are taken to base two, $N$ is the total number of documents in the search corpus, $t$ is the term (“hobbit” and “baggins” being the relevant terms in this example), and ${\rm df}_t$ is the document frequency for term $t$, i.e., the number of documents in the corpus in which $t$ occurs. Now, both “hobbit” and “baggins” are likely to be pretty unusual terms in the corpus, and so the inverse document frequency will be pretty large. Suppose, for example, that approximately one in a thousand documents contains “hobbit”, and similarly for “baggins”. Then the inverse document frequency will be $\log(1000) \approx 10$. So to a good approximation the query vector will be: $\vec q = [10 10 0 0 \ldots]^T,$ where I’ve written the query vector so the first component is the component for “hobbit”, the second component is the component for “baggins”, and the other components are for other terms – all those components are $0$ in this case, since those other terms don’t appear in the query. I’ve also written the query vector as a transposed row vector, because row vectors display better on the blog. Now suppose we have a document that contains ten occurrences of “hobbit”, and ten occurrences of “baggins”. Using the standard importance function, its document vector will have the form $\vec d_1 = [100 100 \vec d_r^T]^T.$ Here, I’ve used the fact that the component of $\vec d_1$ in the “direction” of term $t$ is given by the standard importance function: $\mbox{imp}(t,d) \equiv \mbox{tf}_{t,d} \log(N/\mbox{df}_t),$ where $\mbox{tf}_{t,d}$ is the term frequency of $t$ in document $d$. I’ve also used $\vec d_r$ to denote the document vector for the remainder of the document, i.e., for all the terms not involving “hobbit” or “baggins”. Suppose, now, that we change the document, so that for some reason we replace all occurrences of “baggins” by “hobbit”, so there are now $20$ occurrences of “hobbit”, and none of “baggins”. Intuitively, it seems to me that the resulting modified document is much less relevant to the query “baggins hobbit”. The reason is that although it contains twice as many occurrences of “hobbit”, those extra occurrences don’t compensate for the complete absence of the term “baggins”. Put another way, I think a small decrease in the frequency of “baggins” should only be compensated for by a much larger increase in the frequency of “hobbit”. As we’ll see, though, the cosine similarity doesn’t reflect this intuition very well. To compute the cosine similarity, note that the document vector for the modified document is: $\vec d_2 = [200 0 \vec d_r^T]^T.$ Recall also that the cosine similarity for an arbitrary document $d$ is defined to be $\cos \theta \equiv (\vec q \cdot \vec d) / q d$, where I use the convention that $q \equiv \|\vec q\|, d \equiv \|\vec d\|$. In principle this notation could cause confusion, since we now use $q$ (or $d$) to refer to the query (or document), as well as the length of the corresponding query (or document) vector. In practice it should be clear from context which is meant. We get a hint of the problem with the standard importance function when we observe that $\vec q \cdot \vec d_1 = \vec q \cdot \vec d_2 = 2000$. I.e., changing all occurrences of “hobbit” to “baggins” has no effect at all on the inner product, simply moving weight in the inner product from one term to the other. And so the only difference in the cosine similarity is due to the change in length between $d_1$ and $d_2$. One way of expressing this is to observe that: $\frac{\cos \theta_1}{\cos \theta_2} = \frac{\vec q \cdot \vec d_1}{q d_1} \frac{q d_2}{\vec q \cdot \vec d_2} = \frac{d_2}{d_1}.$ Note furthermore that $d_1 = \sqrt{20000+d_r^2}; \,\,\,\, d_2 = \sqrt{40000+d_r^2},$ and so we have: $\frac{\cos \theta_1}{\cos \theta_2} = \sqrt{\frac{40000+d_r^2}{20000+d_r^2}} > 1.$ The good news is that this gives us the right ordering – the document $d_1$ is more parallel to $q$ than is $d_2$, since $\cos \theta_1 > \cos \theta_2$, and so $\theta_1 < \theta_2$. That's what we hoped for, and so at least the ordering of relevance will be correct. The bad news is that, for complex documents, the difference in cosine similarity is tiny. The reason is that for a long, complex document containing many unusual terms ("Gandalf", "Galadriel", "Saruman"), the value of $d_r^2$ may be enormous - let us say, conservatively, billions. In this approximation, we see that $\theta_1 \approx \theta_2$. By itself this isn't a problem. But it means that the ordering of $\theta_1$ and $\theta_2$ is enormously sensitive to changes in $\vec d_r$. If, when we modified the first document, we changed not just "hobbit" to "baggins", but also other terms, then that might cause the value of $d_r^2$ to decrease. If it decreased by more than $20000$ - easy to imagine, with just a few minor changes - then it would reverse the ordering of the two terms. And that doesn't seem like such a good thing. The example I have given may be artificial, but the broader phenomenon is not. More broadly, if we use the standard importance function when computing cosine similarity, then it becomes much too easy to trade off query terms in a document. For two-term queries, $q = t_1 t_2$, an increasing number of occurrences of $t_1$ can be traded off almost exactly by a corresponding decrease in the number of occurrences of $t_2$. So, for example, doubling the number of occurrences of $t_1$ allows us to (approximately) completely wipe out the occurrences of $t_2$, while making only a tiny difference in cosine similarity. In broad terms it's pretty clear how to address this problem - change the scoring function so that we can't trade off query terms in this way. It's also easy to see how we might go about changing the scoring function to do that. In fact, it's a bit too easy - there are so many ways we could go about doing it that it's not at all clear which will give the best results. Here's one trivial way: redefine the importance function to be: $\mbox{imp}(t,d) \equiv \sqrt{\mbox{tf}_{t,d}} \log(N/\mbox{df}_t).$ That is, modify the importance function so it has the square root of the term frequency in it rather than the term frequency. If we do this for the hobbit''-baggins'' example, then increasing the frequency of hobbit'' to [latex]20$ occurrences (from $10$) will require there to be no fewer than $4$ occurrences of “baggins” for the score to remain as high (assuming the rest of the document is complex, i.e., $d_r \gg 40000$). This matches our intuition much better. Unfortunately, it’s by no means clear that it preserves whatever other properties of cosine similarity made it a good measure of relevance in the first place. That’s a question that most likely would require empirical user testing to resolve.

To wrap up, my takeaways from this are that: (1) the standard importance function makes it too easy to trade off query terms against one another in a document; (2) it’s easy to make ad hoc modifications to the standard importance function that remove this problem; but (3) it’s not clear which of those modifications would preserve the other qualities we want. The obvious thing to do is some empirical testing to see how satisfied users are with different measures. It’d also be good to think more deeply about the foundation of the problem: how should we define the relevance of a document to a given user query?

Interested in more? Please follow me on Twitter. My new book about open science, Reinventing Discovery will be published in October, and can be pre-ordered at Amazon.

When we type a query into a search engine – say “Einstein on relativity” – how does the search engine decide which documents to return? When the document is on the web, part of the answer to that question is provided by the PageRank algorithm, which analyses the link structure of the web to determine the importance of different webpages. But what should we do when the documents aren’t on the web, and there is no link structure? How should we determine which documents most closely match the intent of the query?

In this post I explain the basic ideas of how to rank different documents according to their relevance. The ideas used are very beautiful. They are based on the fearsome-sounding vector space model for documents. Although it sounds fearsome, the vector space model is actually very simple. The key idea is to transform search from a linguistic problem into a geometric problem. Instead of thinking of documents and queries as strings of letters, we adopt a point of view in which both documents and queries are represented as vectors in a vector space. In this point of view, the problem of determining how relevant a document is to a query is just a question of determining how parallel the query vector and the document vector are. The more parallel the vectors, the more relevant the document is.

This geometric way of treating documents turns out to be very powerful. It’s used by most modern web search engines, including (most likely) web search engines such as Google and Bing, as well as search libraries such as Lucene. The ideas can also be used well beyond search, for problems such as document classification, and for finding clusters of related documents. What makes this approach powerful is that it enables us to bring the tools of geometry to bear on the superficially very non-geometric problem of understanding text.

I’ve based my treatment on chapters 6 and 7 of a very good book about information retrieval, by Manning, Raghavan, and Schuetze. The book is also available for free on the web here.

It’s worth mentioning that although this post is a mixture of mathematics and algorithms, search is not a purely mathematical or algorithmic problem. That is, search is not like solving a quadratic equation or sorting a list: there is no right answer, no well-defined notion of what it means to “solve” the search problem. To put it another way, as pointed out by Peter Norvig in an interview in Peter Seibel’s book Coders at Work, it makes no sense to think about “proving” that Google is correct, as one might perhaps do for a sorting algorithm. And yet we would no doubt all agree that some search algorithms are better, and others are worse.

### The vector space model explained

Suppose $d$ is a document in the corpus of documents we’d like to search. As a trivial example of a corpus, later in this post I’ll use a corpus containing just four documents, which I’ve chosen to be product descriptions for the books Lord of the Rings, The Hobbit, The Silmarillion and Rainbows End (all taken from Amazon). A real corpus would be much larger. In any case, what we’re going to do is to associate with $d$ a document vector, for which we’ll use the notation $\vec d$. You can think of the document vector $\vec d$ as a geometric way of representing the document $d$.

To explain how $\vec d$ is defined, we need first to define the vector space $\vec d$ lives in. To do that we need to go through a few preparatory steps. Starting from our corpus, we’ll extract all the words in that corpus, and we’ll call them the terms that make up the corpus dictionary. To each term $t$ in the dictionary we will associate a different unit vector $\vec e_t$ in the vector space. So, for instance, if the word “Einstein” appears in the corpus, then there is a unit vector $\vec e_{\rm Einstein}$. We’ll further stiuplate that the unit vectors for different terms are orthogonal to one another. And so the vector space we’re working in – our geometric arena – is, by definition, the vector space spanned by all the unit vectors $\vec e_t$ corresponding to the different terms in the dictionary.

Somewhat oddly, so far as I know this vector space doesn’t have a name. I think it should: it is, after all, the geometric arena in which the vector space model for documents works. I will call it document space.

With all the above in mind, we can now define the document vector:

$\vec d \equiv \sum_{t} \mbox{imp}(t,d) \vec e_t.$

Here, $\mbox{imp}(t,d)$ is some function that measures the importance of term $t$ in document $d$ (more on this function below). The key point to take away for now is that the document vector $\vec d$ is just the vector whose component in the direction $\vec e_t$ of term $t$ is a measure of the importance of term $t$ in the document $d$. That’s a pretty simple and natural geometric notion. It means that if some term is very important in a document, say “Einstein”, then $\vec d$ will have a large component in the $\vec e_{\rm Einstein}$ direction. But if some term isn’t very important – say “turnips” – then the component in the $\vec e_{\rm turnips}$ direction will be small.

How should we define the importance function $\mbox{imp}(t,d)$? There is, of course, no unique way of defining such a function, and below we’ll discuss several different concrete approaches to defining $\mbox{imp}(t,d)$. But as a simple example, one way you could define importance is to set $\mbox{imp}(t,d)$ equal to the frequency of the term $t$ in the document $d$. Later we’ll come up with an even better measure, but this illustrates the flavour of how $\mbox{imp}(t,d)$ is defined.

A caveat about the document vector $\vec d$ is that typically doesn’t preserve all the information about the document. For instance, if we use frequency as a measure of importance, then the document vector for the document “She sells sea shells by the sea shore” will be the same as the document vector for the document “Sea shells by the sea shore she sells”, because the term frequencies are the same in both documents. So the document vector is only an imperfect representation of the original document. Still, provided you choose a reasonable function for importance, the document vector preserves quite a lot of information about the original document.

### Finding documents relevant to a query

Having defined document vectors, we can now say how to find the document most relevant to a given search query, $q$. We do it by defining a query vector, $\vec q$, exactly as we defined the document vector. That is, we regard the query text $q$ as a document, and construct the corresponding vector in document space:

$\vec q \equiv \sum_{t} \mbox{imp}(t,q) \vec e_t$

Now, how should me measure how relevant a document $d$ is to a particular query $q$? One way is to simply consider the angle $\theta$ between the query vector $\vec q$ and the document vector $\vec d$, as defined by the equation:

$\cos \theta = \frac{\vec q \cdot \vec d}{\| \vec q \| \| \vec d \|}.$

This is called the cosine similarity between the document and the query. Our notion of relevance, then, is that the smaller the angle $\theta$ is, the more relevant the document $d$ is to the query $q$.

What makes the cosine similarity so beautiful is that it lets us use our geometric intuition to think about the problem of ranking documents. Want the top $K$ ranked documents for the query? Just compute the angle $\theta$ for every document in the corpus, and return the $K$ documents whose angles are smallest. Want to find documents that are similar to a document $d$ (“find documents like this one”)? Just look for the other documents whose document vectors have the smallest angle to $\vec d$.

Of course, using the angle is a very ad hoc choice. The justification is not theoretical, it is empirical: search engines which use cosine similarity on the whole leave their users pretty satisfied. If you can find another measure which leaves users more satisfied, then by all means, feel free to use that measure instead.

### Problems

• Is there a simple mathematical model of user behaviour and satisfication in which it’s possible to prove that cosine similarity is optimal for satisfaction, for some particular choice of the importance function? Alternately, can you find another measure which is optimal? Note: As with many good research problems, this problem is quite ill posed. Making it simultaneously well posed and soluble is part of the problem.

### The importance function

We’re now in position to come back and think more carefully about how to define the importance function. Earlier, I mentioned that one way of definining $\mbox{imp}(t,d)$ is as the frequency of term $t$ in the document $d$. This is a pretty good measure, but has the problem that it over-emphasizes common words, like “the”, “and”, “but”, and so on. If I’m interested in comparing a document and a query, it’s not so useful to know that both contain the word “the”, even if the document contains “the” twenty times. On the other hand, if both document and query contain the word “hobbit”, that’s a much more pertinent fact, even if the document only contains “hobbit” once. The reason is that in most search corpi the term “hobbit” is a relatively rare term.

There’s an improved notion of importance that takes this difference between common and uncommon words in the corpus into account. To define this improved notion of importance, first let $N$ be the total number of documents in the corpus. Let ${\rm df}_t$ be the number of documents the term $t$ appears in, known as the document frequency for term $t$. And finally, let ${\rm tf}_{t,d}$ be the numbe of times $t$ appears in document $d$, known as the term frequency. One common choice for the importance function is (logarithm to base $2$):

$\mbox{imp}(t,d) \equiv {\rm tf}_{t,d} * \log(N/{\rm df}_t)$

While this isn’t the only possible choice for the importance function, it is the one we’ll focus on the most, and so we’ll call it the standard importance function.

How should we think about the standard importance function? Perhaps the strangest looking part of the function is the $\log(N/{\rm df}_t)$ term, often called the inverse document frequency for $t$. In fact, the inverse document frequency isn’t so strange. Suppose that we’re trying to identify a document in the corpus. Initially, we know nothing about the identity of the document. Then we learn that the document contains the term $t$. The inverse document frequency $\log(N/{\rm df}_t)$ is just the number of bits of information we acquire when we learn this information. If $t$ is a very common term – say “the” or “and” – then this number will be tiny, because the document frequency ${\rm df}_t$ is very close to the total number of documents, $N$. But if $t$ is an uncommon term in the corpus – as the term “hobbit” is likely to be (unless the corpus is on Tolkien!) – then the document frequency ${\rm df}_t$ will be small compared to $N$, and we’ll get quite a lot of information when we learn that the document contains $t$.

With this interpretation of $\log(N/{\rm df}_t)$, we can now think of $\mbox{imp}(t,d)$ as simply being the frequency of $t$ in $d$, but rescaled according to a measure $\log(N/{\rm df}_t)$ of the importance of $t$ in the overall corpus. Of course, as with cosine similarity, this is a somewhat ad hoc choice; in no sense have we mathematically derived this measure from a mathematical model of user behaviour. Yet in practice it works pretty well as a measure of importance. Intuitively, with this importance function, two document vectors $\vec d_1$ and $\vec d_2$ will be nearly parallel when they make use of a very similar vocabulary, and, in particular, when they use a similar vocabulary of unusual words, and the unusual words occur with similar (relative) frequencies in the two documents.

### Exercises

• Suppose the document $d_2$ is just the document $d_1$ concatenated with itself $3$ times. Show that $d_1$ and $d_2$ are parallel (i.e., $\theta = 0$) when the standard importance function is used.

### A toy Python implementation of the vector space model

Let’s now take a look at a toy Python (v2.6) implementation of a search engine that uses the vector space model. The code is straightforward – just a few dozen lines of non-boilerplate. Perhaps the most interesting bits are the data structures used, which are mostly set up in the header of the program, and in initalize_terms_and_postings, initialize_document_frequencies and initialize_lengths. The computation of cosine similarity is done in similarity. Here’s the source code (see also the source at GitHub):

"""vsm.py implements a toy search engine to illustrate the vector
space model for documents.

It asks you to enter a search query, and then returns all documents
matching the query, in decreasing order of cosine similarity,
according to the vector space model."""

from collections import defaultdict
import math
import sys

# We use a corpus of four documents.  Each document has an id, and
# these are the keys in the following dict.  The values are the
# corresponding filenames.
document_filenames = {0 : "documents/lotr.txt",
1 : "documents/silmarillion.txt",
2 : "documents/rainbows_end.txt",
3 : "documents/the_hobbit.txt"}

# The size of the corpus
N = len(document_filenames)

# dictionary: a set to contain all terms (i.e., words) in the document
# corpus.
dictionary = set()

# postings: a defaultdict whose keys are terms, and whose
# corresponding values are the so-called "postings list" for that
# term, i.e., the list of documents the term appears in.
#
# The way we implement the postings list is actually not as a Python
# list.  Rather, it's as a dict whose keys are the document ids of
# documents that the term appears in, with corresponding values equal
# to the frequency with which the term occurs in the document.
#
# As a result, postings[term] is the postings list for term, and
# postings[term][id] is the frequency with which term appears in
# document id.
postings = defaultdict(dict)

# document_frequency: a defaultdict whose keys are terms, with
# corresponding values equal to the number of documents which contain
# the key, i.e., the document frequency.
document_frequency = defaultdict(int)

# length: a defaultdict whose keys are document ids, with values equal
# to the Euclidean length of the corresponding document vector.
length = defaultdict(float)

# The list of characters (mostly, punctuation) we want to strip out of
# terms in the document.
characters = " .,!#\$%^&*();:\n\t\\\"?!{}[]<>"

def main():
initialize_terms_and_postings()
initialize_document_frequencies()
initialize_lengths()
while True:
do_search()

def initialize_terms_and_postings():
"""Reads in each document in document_filenames, splits it into a
list of terms (i.e., tokenizes it), adds new terms to the global
dictionary, and adds the document to the posting list for each
term, with value equal to the frequency of the term in the
document."""
global dictionary, postings
for id in document_filenames:
f = open(document_filenames[id],'r')
f.close()
terms = tokenize(document)
unique_terms = set(terms)
dictionary = dictionary.union(unique_terms)
for term in unique_terms:
postings[term][id] = terms.count(term) # the value is the
# frequency of the
# term in the
# document

def tokenize(document):
"""Returns a list whose elements are the separate terms in
document.  Something of a hack, but for the simple documents we're
using, it's okay.  Note that we case-fold when we tokenize, i.e.,
we lowercase everything."""
terms = document.lower().split()
return [term.strip(characters) for term in terms]

def initialize_document_frequencies():
"""For each term in the dictionary, count the number of documents
it appears in, and store the value in document_frequncy[term]."""
global document_frequency
for term in dictionary:
document_frequency[term] = len(postings[term])

def initialize_lengths():
"""Computes the length for each document."""
global length
for id in document_filenames:
l = 0
for term in dictionary:
l += imp(term,id)**2
length[id] = math.sqrt(l)

def imp(term,id):
"""Returns the importance of term in document id.  If the term
isn't in the document, then return 0."""
if id in postings[term]:
return postings[term][id]*inverse_document_frequency(term)
else:
return 0.0

def inverse_document_frequency(term):
"""Returns the inverse document frequency of term.  Note that if
term isn't in the dictionary then it returns 0, by convention."""
if term in dictionary:
return math.log(N/document_frequency[term],2)
else:
return 0.0

def do_search():
"""Asks the user what they would like to search for, and returns a
list of relevant documents, in decreasing order of cosine
similarity."""
query = tokenize(raw_input("Search query >> "))
if query == []:
sys.exit()
# find document ids containing all query terms.  Works by
# intersecting the posting lists for all query terms.
relevant_document_ids = intersection(
[set(postings[term].keys()) for term in query])
if not relevant_document_ids:
print "No documents matched all query terms."
else:
scores = sorted([(id,similarity(query,id))
for id in relevant_document_ids],
key=lambda x: x[1],
reverse=True)
print "Score: filename"
for (id,score) in scores:
print str(score)+": "+document_filenames[id]

def intersection(sets):
"""Returns the intersection of all sets in the list sets. Requires
that the list sets contains at least one element, otherwise it
raises an error."""
return reduce(set.intersection, [s for s in sets])

def similarity(query,id):
"""Returns the cosine similarity between query and document id.
Note that we don't bother dividing by the length of the query
vector, since this doesn't make any difference to the ordering of
search results."""
similarity = 0.0
for term in query:
if term in dictionary:
similarity += inverse_document_frequency(term)*imp(term,id)
similarity = similarity / length[id]
return similarity

if __name__ == "__main__":
main()


Try it out and see how it works. Here’s some typical output:

Search query >> lord of the rings
Score: filename
0.28036158811: documents/lotr.txt
0.0723574605292: documents/the_hobbit.txt


Not exactly unexpected, but gratifying: it correctly identifies The Lord of the Rings and The Hobbit as relevant, in the correct order, and omits The Silmarillion and Rainbows End. About the only surprise is the omission of The Silmarillion, which could plausibly have been in the list of results, but would presumably have ranked last. All in all, it’s about what we’d expect.

### Improvements and variations

There are many improvements and variations possible for the toy program I presented above. I won’t describe them in detail here, but will briefly mention a few.

Performance: The toy algorithm I’ve described doesn’t work very well for large numbers of documents. The problem is that it computes the cosine similarity for every single document in the corpus. If you have ten billion documents in your document corpus, then the last thing you want to be doing is computing ten billion inner products. Fortunately, you don’t have to. It’s possible to take a lot of shortcuts that dramatically speed up the process. I won’t describe these in detail here, but here’s a simple idea that should start to stimulate more ideas: for each term $t$ in the dictionary, it’s straightforward to precompute the top 1,000 (say) documents for that term. Then for a given multi-term query it’s pretty likely that the top search results will come from one of the pre-computed lists of top documents for the terms in the query.

Find what the user wants, not necessarily what they say they want: A user searching for “automobiles” probably also wants documents which talk about “cars”, one searching for “elefants” probably wants documents about “elephants”, and one looking for “moviemaking” might well be interested in documents about “moviemakers”. The system I described can and should be extended to deal with synonyms, with misspellings, and to ensure that we look for different forms of a word. I hope to deal with these topics in future posts.

Combine with information about a document’s importance: The model I’ve described treats all documents on an equal footing. In practice, we need to combine a notion of relevance like cosine similarity with some notion of the intrinsic importance of the document. A webpage from the New York Times shoud appear higher in search results than a webpage from a spammer, even if the spam page appears more relevant according to cosine similarity. So we need to extend the approach to incorporate some notion of relevance, such as PageRank.

The ultimate search engine: These are just a few of the many issues that must be dealt with by a search engine. Ultimately, the goal is to build a search engine that tells us what we need to know. That’s a very open-ended goal, and we can break it down in many different ways; I’ve just presented one analytical frame for thinking about it. In future posts I’ll return to investiage some of the issues raised above in more depth, and also to explore other ways of thinking about the problem of search.

Interested in more? Please subscribe to this blog, or follow me on Twitter. My new book about open science, Reinventing Discovery will be published in October, and can be pre-ordered at Amazon now.

In this post, I describe a simple but powerful framework for distributed computing called Pregel. Pregel was developed by Google, and is described in a 2010 paper written by seven Googlers. In 2009, the Google Research blog announced that the Pregel system was being used in dozens of applications within Google.

Pregel is a framework oriented toward graph-based algorithms. I won’t formally define graph-based algorithms here – we’ll see an example soon enough – but roughly speaking a graph-based algorithm is one which can be easily expressed in terms of the vertices of a graph, and their adjacent edges and vertices. Examples of problems which can be solved by graph-based algorithms include determining whether two vertices in a graph are connected, where there are clusters of connected vertices in a graph, and many other well-known graph problems. As a concrete example, in this post I describe how Pregel can be used to determine the PageRank of a web page.

What makes Pregel special is that it’s designed to scale very easily on a large-scale computer cluster. Typically, writing programs for clusters requires the programmer to get their hands dirty worrying about details of the cluster architecture, communication between machines in the cluster, considerations of fault-tolerance, and so on. The great thing about Pregel is that Pregel programs can be scaled (within limits) automatically on a cluster, without requiring the programmer to worry about the details of distributing the computation. Instead, they can concentrate on the algorithm they want to implement. In this, Pregel is similar to the MapReduce framework. Like MapReduce, Pregel gains this ability by concentrating on a narrow slice of problems. What makes Pregel interesting and different to MapReduce is that it is well-adapted to a somewhat different class of problems.

### Using Pregel to compute PageRank

A Pregel program takes as input a graph, with many vertices and (directed) edges. The graph might, for example, be the link graph of the web, with the vertices representing web pages, and the edges representing links between those pages. Each vertex is also initialized with a value. For our PageRank example, the value will just be an initial guess for the PageRank.

Since I’m using PageRank as an example, let me briefly remind you how PageRank works. Imagine a websurfer surfing the web. A simple model of surfing behaviour might involve them either: (1) following a random link from the page they’re currently on; or (2) deciding they’re bored, and “teleporting” to a completely random page elsewhere on the web. Furthermore, the websurfer chooses to do the first type of action with probability $0.85$, and the second type of action with probability $0.15$. If they repeat this random browsing behaviour enough times, it turns out that the probability they’re on a given webpage eventually converges to a steady value, regardless of where they started. This probability is the PageRank for the page, and, roughly speaking, it’s a measure of the importance of the page: the higher the PageRank, the more important the page.

I won’t get into any more of the ins-and-outs of PageRank, but will assume that you’re happy enough with the description above. If you’re looking for more details, I’ve written an extended introduction.

We’re going to use Pregel to compute PageRank. During the initialization stage for the graph we’ll start each vertex (i.e., webpage) off with an estimate for its PageRank. It won’t make any difference to the final result what that starting estimate it, so let’s just use a very easy-to-compute probability distribution, say, initializing each vertex with a value that is just one over the total number of web pages (i.e., the number of vertices).

Once the input graph is initialized, the Pregel computation proceeds through a series of supersteps. During each superstep each vertex does two things: (1) it updates its own value; and (2) it can send messages to adjacent vertices. The way it updates its own value is by computing a user-specified function which depends on the value of the vertex at the end of the previous superstep, as well as the messages sent to the vertex during the last superstep. Similarly, the messages the vertex sends can depend both on its value and the messages it was sent during the last superstep.

To make this more concrete, here’s some python-like pseudocode showing how Pregel can be used to compute the PageRank of a link graph:

import pregel # includes pregel.Vertex class which we'll subclass

class PageRankVertex(pregel.Vertex):

def update(self):
if self.superstep < 50:
# compute a new estimate of PageRank, based on the most recent
# estimated PageRank of adjacent vertices
self.value = 0.15 * 1/num_vertices + 0.85 * sum(self.messages)
# send the new estimated PageRank to adjacent vertices, dividing
# it equally between vertices
else:
# stop after 50 supersteps
self.active = False

initialize link structure for a 10-vertex graph
num_vertices = 10
for each vertex: # initialize values in the graph
vertex.value = 1/num_vertices
run pregel
output values for all vertices


The code should be self-explanatory: we initialize the graph, with each vertex starting with an initial estimate for its own PageRank. We update that by imagining the websurfer making a single step, either teleporting to a new random vertex, or else randomly following a link to one of the adjacent vertices. This is repeated many times - I've arbitrarily chosen 50 supersteps - before halting.

With the PageRank pseudocode under our belts, let's return to the bigger picture and summarize the basic Pregel model a bit more formally.

Input phase: The input to a Pregel computation is a set of vertices. Each vertex has an initial value, and may also have an associated set of outgoing edges.

Computation phase: This is split up into supersteps. In any given superstep each vertex can update its value and also its outgoing edges. It can also emit messages, which are sent to adjacent vertices. The updated value, outgoing edges and messages are determined by a user-defined function of the vertice's value, edges, and incoming messages at the start of the superstep.

Note that Google's Pregel system allows messages to be sent to any other vertex, not just adjacent vertices, although the paper implies that in most cases messages are usually sent only to adjacent vertices. Also, Google's Pregel system allows both vertices and edges to have values. I've omitted both these for simplicity in the code below, although both are easily restored.

Halting: Each vertex has an attribute which determines whether it is active or not. Vertices start active, but can change to inactive at any superstep. The computation halts when every vertex is inactive. The paper notes that inactive vertices can be reactivated, although it is a little vague on when this happens.

Pregel's synchronous nature makes Pregel programs easy to think about. Although updates are done at many vertices in any single superstep, it doesn't matter in what order the updates are done, or if they're done in parallel, because the update at any specific vertex doesn't affect the result of updates at other vertices. That means there's no possibility of race conditions arising.

### Single-machine Pregel library

I'll now describe a toy single-machine Pregel library, written in Python (v 2.6). The main additional feature beyond the description of Pregel given above is that this library partitions the vertices and assigns the different parts of the partition to workers, which in this implementation are separate Python threads. As we'll see below, on a cluster this idea is extended so the partitioned vertices aren't just assigned to different threads, but may be assigned to different machines. Here's the code (see also GitHub):

"""pregel.py is a python 2.6 module implementing a toy single-machine
version of Google's Pregel system for large-scale graph processing."""

import collections

class Vertex():

def __init__(self,id,value,out_vertices):
# This is mostly self-explanatory, but has a few quirks:
#
# self.id is included mainly because it's described in the
# Pregel paper.  It is used briefly in the pagerank example,
# but not in any essential way, and I was tempted to omit it.
#
# Each vertex stores the current superstep number in
# self.superstep.  It's arguably not wise to store many copies
# of global state in instance variables, but Pregel's
# synchronous nature lets us get away with it.
self.id = id
self.value = value
self.out_vertices = out_vertices
self.incoming_messages = []
self.outgoing_messages = []
self.active = True
self.superstep = 0

class Pregel():

def __init__(self,vertices,num_workers):
self.vertices = vertices
self.num_workers = num_workers

def run(self):
"""Runs the Pregel instance."""
self.partition = self.partition_vertices()
while self.check_active():
self.superstep()
self.redistribute_messages()

def partition_vertices(self):
"""Returns a dict with keys 0,...,self.num_workers-1
representing the worker threads.  The corresponding values are
lists of vertices assigned to that worker."""
partition = collections.defaultdict(list)
for vertex in self.vertices:
partition[self.worker(vertex)].append(vertex)
return partition

def worker(self,vertex):
"""Returns the id of the worker that vertex is assigned to."""
return hash(vertex) % self.num_workers

def superstep(self):
"""Completes a single superstep.

Note that in this implementation, worker threads are spawned,
and then destroyed during each superstep.  This creation and
destruction causes some overhead, and it would be better to
make the workers persistent, and to use a locking mechanism to
synchronize.  The Pregel paper suggests that this is how
workers = []
for vertex_list in self.partition.values():
worker = Worker(vertex_list)
workers.append(worker)
worker.start()
for worker in workers:
worker.join()

def redistribute_messages(self):
"""Updates the message lists for all vertices."""
for vertex in self.vertices:
vertex.superstep +=1
vertex.incoming_messages = []
for vertex in self.vertices:
for (receiving_vertix,message) in vertex.outgoing_messages:
receiving_vertix.incoming_messages.append((vertex,message))

def check_active(self):
"""Returns True if there are any active vertices, and False
otherwise."""
return any([vertex.active for vertex in self.vertices])

def __init__(self,vertices):
self.vertices = vertices

def run(self):
self.superstep()

def superstep(self):
"""Completes a single superstep for all the vertices in
self."""
for vertex in self.vertices:
if vertex.active:
vertex.update()


Here's the Python code for a computation of PageRank using both the Pregel library just given and, as a test, a more conventional matrix-based approach. You should not worry too much about the test code (at least initially), and concentrate on the bits related to Pregel.

"""pagerank.py illustrates how to use the pregel.py library, and tests
that the library works.

It illustrates pregel.py by computing the PageRank for a randomly
chosen 10-vertex web graph.

It tests pregel.py by computing the PageRank for the same graph in a
different, more conventional way, and showing that the two outputs are
near-identical."""

from pregel import *

# The next two imports are only needed for the test.
from numpy import *
import random

num_workers = 4
num_vertices = 10

def main():
vertices = [PageRankVertex(j,1.0/num_vertices,[])
for j in range(num_vertices)]
create_edges(vertices)
pr_test = pagerank_test(vertices)
print "Test computation of pagerank:\n%s" % pr_test
pr_pregel = pagerank_pregel(vertices)
print "Pregel computation of pagerank:\n%s" % pr_pregel
diff = pr_pregel-pr_test
print "Difference between the two pagerank vectors:\n%s" % diff
print "The norm of the difference is: %s" % linalg.norm(diff)

def create_edges(vertices):
"""Generates 4 randomly chosen outgoing edges from each vertex in
vertices."""
for vertex in vertices:
vertex.out_vertices = random.sample(vertices,4)

def pagerank_test(vertices):
"""Computes the pagerank vector associated to vertices, using a
standard matrix-theoretic approach to computing pagerank.  This is
used as a basis for comparison."""
I = mat(eye(num_vertices))
G = zeros((num_vertices,num_vertices))
for vertex in vertices:
num_out_vertices = len(vertex.out_vertices)
for out_vertex in vertex.out_vertices:
G[out_vertex.id,vertex.id] = 1.0/num_out_vertices
P = (1.0/num_vertices)*mat(ones((num_vertices,1)))
return 0.15*((I-0.85*G).I)*P

def pagerank_pregel(vertices):
"""Computes the pagerank vector associated to vertices, using
Pregel."""
p = Pregel(vertices,num_workers)
p.run()
return mat([vertex.value for vertex in p.vertices]).transpose()

class PageRankVertex(Vertex):

def update(self):
# This routine has a bug when there are pages with no outgoing
# links (never the case for our tests).  This problem can be
# solved by introducing Aggregators into the Pregel framework,
# but as an initial demonstration this works fine.
if self.superstep < 50:
self.value = 0.15 / num_vertices + 0.85*sum(
[pagerank for (vertex,pagerank) in self.incoming_messages])
outgoing_pagerank = self.value / len(self.out_vertices)
self.outgoing_messages = [(vertex,outgoing_pagerank)
for vertex in self.out_vertices]
else:
self.active = False

if __name__ == "__main__":
main()


You might observe that in the PageRank example the matrix-based code is no more complex than the Pregel-based code. So why bother with Pregel? What's different is that the Pregel code can, in principle, be automatically scaled on a cluster. And that's not so easy for the matrix-based approach.

The PageRank example also illustrates another point made in the Pregel paper. The authors note that users of Pregel at Google find it easy to program using Pregel once they begin to "think like a vertex". In a sense, the update method for instances of PageRankVertex is simply the vertex asking itself "What should I do in this superstep?"

There is, incidentally, a slight bug in the PageRank program. The bug arises because of a quirk in PageRank. Sometimes the random websurfer may come to a webpage that has no outgoing links. When that happens, they can't simply choose a random outgoing link. Instead, the PageRank algorithm is modified in this instance so that they always teleport to a page chosen completely at random.

A standard way of modelling this is to model a page with no outgoing links as a page that is linked to every single other page. It sounds counterintuitive, but it works - if it is connected to every single other page, then a random websurfer will, indeed, teleport to a completely random page.

Unfortunately, while this modification is fine in principle, it's not so good for Pregel. When Pregel is put on a cluster, it requires a lot of network overhead to send messages to large numbers of vertices. But we'll see below that there's a way of modifying Pregel so it can deal with this.

### Problems

• Write a Pregel program to determine whether two vertices in a graph are connected.

### Distributed implementation of Pregel

To implement Pregel on a cluster, we need a way of assigning vertices to different machines and threads in the cluster. This can be done using a hashing scheme, as was done in the code above to assign vertices to different worker threads. It can also be done using other approaches, if desired, such as consistent hashing.

Between supersteps, messages between vertices on different machines are passed over the network. For efficiency this is done (in part) asynchronously, during the superstep, as batches of vertices finish being updated. Provided the messages are short, and the graph is relatively sparse, as in the PageRank example, then the network overhead will be relatively small. Pregel will incur much more network overhead for highly-connected graphs, or if the messages are very large, or both.

In the case of PageRank, the network overhead can be reduced by changing the vertex-assignment procedure so that pages from the same domain are assigned to the same machine in the cluster. This reduces overhead because most links on the web are between pages from the same domain.

### Notes on implementation

Combiners: In some cases, network communication can be substantially reduced by combining messages. For instance, in computing PageRank, suppose vertices $v_1,v_2$ and $v_3$ on one machine all send messages to a vertex $v$ on another machine. It would cut down on bandwidth to simply add all those messages together before they're sent. Google's Pregel implementation allows the user to define a way to combine messages in this way. Note that in general there is no way for this combining to be done automatically - the user must specify how it is to be done.

Aggregators: These are a modification of the basic Pregel framework described above. The idea is that during each superstep each vertex can provide a value to a single central aggegrator. The aggregator combines all the values using a reducing function, and makes the result available to all vertices at the beginning of the next superstep. Aggregators could, for example, be used to check that the computation of PageRank is converging.

### Exercises

• Modify the pregel.py program so that Pregel instances can include aggregators, as described above.
• How could an aggregator be used to fix the bug in the PageRank example above, so webpages with no outgoing links are dealt with correctly?
• How could an aggregator be used to detect whether or not the computation of PageRank has converged, instead of doing 50 supersteps?

Fault-tolerance: For very large clusters, Pregel must be able to recover if one or more machines in the cluster die. To do this, Pregel uses a checkpointing procedure. The idea is that between supersteps Pregel occasionally saves the entire state of the graph to persistent storage. If a machine dies, Pregel returns to the most recent checkpoint, repartitions the vertices, reloads the state of the graph, and continues the computation. Note that there is a tradeoff between how frequently checkpoints are done and how often machines die. According to the paper the implementation tries to determine a checkpointing frequency that ensures minimal average cost.

Miscellanea:

• Google implement Pregel in C++.
• The state of the graph is held in-memory, although the authors note that buffered messages are sometimes held on local disk.
• Everything is co-ordinated by a single master machine.
• The authors note that they've scaled Pregel to run on clusters containing hundreds of machines, and with graphs containing billions of vertices.
• The authors compare Pregel to other graph-based frameworks, and note that theirs seems to be the only available framework which has been tested on clusters containing hundreds of machines.
• Pregel is based on the Bulk Synchronous Parallel model of programming developed by Les Valiant.
• There is a potential for bottlenecks when the computation at some vertex takes far longer than at most other vertices, stopping the superstep from completing. This may require careful design and profiling to avoid.
• The paper is a bit handwavy on why and when Pregel is superior to MapReduce, although it is quite clear that the authors believe this is often the case for graph-based algorithms.

Concluding thought: Pregel can be used to do much more than just computing PageRank. In the Pregel paper, the authors describe how Pregel can be used to solve other problems, including finding shortest paths in a graph, and finding clusters in social networks. I'll conclude by noting a fun problem that they don't address, but which I believe could be addressed using Pregel or a similar system:

### Problems

• Can you use Pregel to compute (say) book recommendations, based on a set of user-supplied rating data for books?

I did a little experiment to see what people discuss on the most popular blogs.

I downloaded Technorati’s list of the top 1,000 blogs, and crawled 50,000 pages from those blogs. Then I determined the percentage of pages containing words such as “sex”, “Einstein” and “Gaga”. As a baseline for comparison and sanity check, I figured out the percentage of pages containing the top ten most frequently used nouns in English:

1. time 94%
2. person 52%
3. year 71%
4. way 83%
5. day 91%
6. thing 78%
7. man 96%
8. world 70%
9. life 59%
10. hand 56%

To get these numbers I did the simplest possible pattern matching, ignoring case completely and just looking for strings anywhere inside a page’s html. This will throw up some false positives, especially for words like “day” and “man” which are part of longer words (“daydream”, “manual”), and such words will have inflated numbers (less so words like “person” and “year”).

Some stats for celebrities:

• Gaga 9%
• Bieber 5%
• LeBron 3%
• Tiger 5%

Note that I don’t distinguish between “tiger woods” and “tiger” the animal. Here’s a couple of scientists:

• Einstein 2%
• Feynman 0.1%

Einstein isn’t quite at the level of Lady Gaga et al, but he clearly remains a cultural icon. Feynman, by contrast, seems to be more an icon of geek culture. Indeed, I’d guess most of that 0.1% comes from the fact that quite a few of the top blogs have a geeky focus. Speaking of geeky focus, let’s take a look at a couple of technology companies:

• Apple 38%

The number for Google sounds high, but in fact a huge fraction of blog pages include things like Google Analytics, Feedburner, and so on. One day I may redo this on just the non-boilerplate plain text from a page. The numbers for Apple surprised me, even accounting for the fact that this includes mentions of apple-the-fruit. I can’t think of anything analogous to Google Analytics to account for the high numbers. Maybe people who read blogs really are obsessed with Apple.

Athletes like Tiger Woods and LeBron James to some extent transcend their sport. I was curious to see how they compare to the top athletes within a sport, and chose tennis: