Skip to content
Sep 26 12

Why Bloom filters work the way they do

by Michael Nielsen

Imagine you’re a programmer who is developing a new web browser. There are many malicious sites on the web, and you want your browser to warn users when they attempt to access dangerous sites. For example, suppose the user attempts to access http://domain/etc. You’d like a way of checking whether domain is known to be a malicious site. What’s a good way of doing this?

An obvious naive way is for your browser to maintain a list or set data structure containing all known malicious domains. A problem with this approach is that it may consume a considerable amount of memory. If you know of a million malicious domains, and domains need (say) an average of 20 bytes to store, then you need 20 megabytes of storage. That’s quite an overhead for a single feature in your web browser. Is there a better way?

In this post I’ll describe a data structure which provides an excellent way of solving this kind of problem. The data structure is known as a Bloom filter. Bloom filter are much more memory efficient than the naive “store-everything” approach, while remaining extremely fast. I’ll describe both how Bloom filters work, and also some extensions of Bloom filters to solve more general problems.

Most explanations of Bloom filters cut to the chase, quickly explaining the detailed mechanics of how Bloom filters work. Such explanations are informative, but I must admit that they made me uncomfortable when I was first learning about Bloom filters. In particular, I didn’t feel that they helped me understand why Bloom filters are put together the way they are. I couldn’t fathom the mindset that would lead someone to invent such a data structure. And that left me feeling that all I had was a superficial, surface-level understanding of Bloom filters.

In this post I take an unusual approach to explaining Bloom filters. We won’t begin with a full-blown explanation. Instead, I’ll gradually build up to the full data structure in stages. My goal is to tell a plausible story explaining how one could invent Bloom filters from scratch, with each step along the way more or less “obvious”. Of course, hindsight is 20-20, and such a story shouldn’t be taken too literally. Rather, the benefit of developing Bloom filters in this way is that it will deepen our understanding of why Bloom filters work in just the way they do. We’ll explore some alternative directions that plausibly could have been taken – and see why they don’t work as well as Bloom filters ultimately turn out to work. At the end we’ll understand much better why Bloom filters are constructed the way they are.

Of course, this means that if your goal is just to understand the mechanics of Bloom filters, then this post isn’t for you. Instead, I’d suggest looking at a more conventional introduction – the Wikipedia article, for example, perhaps in conjunction with an interactive demo, like the nice one here. But if your goal is to understand why Bloom filters work the way they do, then you may enjoy the post.

A stylistic note: Most of my posts are code-oriented. This post is much more focused on mathematical analysis and algebraic manipulation: the point isn’t code, but rather how one could come to invent a particular data structure. That is, it’s the story behind the code that implements Bloom filters, and as such it requires rather more attention to mathematical detail.

General description of the problem: Let’s begin by abstracting away from the “safe web browsing” problem that began this post. We want a data structure which represents a set S of objects. That data structure should enable two operations: (1) the ability to add an extra object x to the set; and (2) a test to determine whether a given object x is a member of S. Of course, there are many other operations we might imagine wanting – for example, maybe we’d also like to be able to delete objects from the set. But we’re going to start with just these two operations of adding and testing. Later we’ll come back and ask whether operations such as deleteing objects are also possible.

Idea: store a set of hashed objects: Okay, so how can we solve the problem of representing S in a way that’s more memory efficient than just storing all the objects in S? One idea is to store hashed versions of the objects in S, instead of the full objects. If the hash function is well chosen, then the hashed objects will take up much less memory, but there will be little danger of making errors when testing whether an object is an element of the set or not.

Let’s be a little more explicit about how this would work. We have a set S of objects x_0, x_1, \ldots, x_{|S|-1}, where |S| denotes the number of objects in S. For each object we compute an m-bit hash function h(x_j) – i.e., a hash function which takes an arbitrary object as input, and outputs m bits – and the set S is represented by the set \{ h(x_0), h(x_1), \ldots, h(x_{|S|-1}) \}. We can test whether x is an element of S by checking whether h(x) is in the set of hashes. This basic hashing approach requires roughly m |S| bits of memory.

(As an aside, in principle it’s possible to store the set of hashed objects more efficiently, using just m|S|-\log(|S|!) bits, where \log is to base two. The -\log(|S|!) saving is possible because the ordering of the objects in a set is redundant information, and so in principle can be eliminated using a suitable encoding. However, I haven’t thought through what encodings could be used to do this in practice. In any case, the saving is likely to be minimal, since \log(|S|!)  \approx |S| \log |S|, and m will usually be quite a bit bigger than \log |S| – if that weren’t the case, then hash collisions would occur all the time. So I’ll ignore the terms -\log(|S|!) for the rest of this post. In fact, in general I’ll be pretty cavalier in later analyses as well, omitting lower order terms without comment.)

A danger with this hash-based approach is that an object x outside the set S might have the same hash value as an object inside the set, i.e., h(x) = h(x_j) for some j. In this case, test will erroneously report that x is in S. That is, this data structure will give us a false positive. Fortunately, by choosing a suitable value for m, the number of bits output by the hash function, we can reduce the probability of a false positive as much as we want. To understand how this works, notice first that the probability of test giving a false positive is 1 minus the probability of test correctly reporting that x is not in S. This occurs when h(x) \neq h(x_j) for all j. If the hash function is well chosen, then the probability that h(x) \neq h(x_j) is (1-1/2^m) for each j, and these are independent events. Thus the probability of test failing is:

   p = 1-(1-1/2^m)^{|S|}.

This expression involves three quantities: the probability p of test giving a false positive, the number m of bits output by the hash function, and the number of elements in the set, |S|. It’s a nice expression, but it’s more enlightening when rewritten in a slightly different form. What we’d really like to understand is how many bits of memory are needed to represent a set of size |S|, with probability p of a test failing. To understand that we let \# be the number of bits of memory used, and aim to express \# as a function of p and |S|. Observe that \# = |S|m, and so we can substitute for m = \#/|S| to obtain

   p = 1-\left(1-1/2^{\#/|S|}\right)^{|S|}.

This can be rearranged to express \# in term of p and |S|:

   \# = |S|\log \frac{1}{1-(1-p)^{1/|S|}}.

This expression answers the question we really want answered, telling us how many bits are required to store a set of size |S| with a probability p of a test failing. Of course, in practice we’d like p to be small – say p = 0.01 – and when this is the case the expression may be approximated by a more transparent expression:

   \# \approx |S|\log \frac{|S|}{p}.

This makes intuitive sense: test failure occurs when x is not in S, but h(x) is in the hashed version of S. Because this happens with probability p, it must be that S occupies a fraction p of the total space of possible hash outputs. And so the size of the space of all possible hash outputs must be about |S|/p. As a consequence we need \log(|S|/p) bits to represent each hashed object, in agreement with the expression above.

How memory efficient is this hash-based approach to representing S? It’s obviously likely to be quite a bit better than storing full representations of the objects in S. But we’ll see later that Bloom filters can be far more memory efficient still.

The big drawback of this hash-based approach is the false positives. Still, for many applications it’s fine to have a small probability of a false positive. For example, false positives turn out to be okay for the safe web browsing problem. You might worry that false positives would cause some safe sites to erroneously be reported as unsafe, but the browser can avoid this by maintaining a (small!) list of safe sites which are false positives for test.

Idea: use a bit array: Suppose we want to represent some subset S of the integers 0, 1, 2, \ldots, 999. As an alternative to hashing or to storing S directly, we could represent S using an array of 1000 bits, numbered 0 through 999. We would set bits in the array to 1 if the corresponding number is in S, and otherwise set them to 0. It’s obviously trivial to add objects to S, and to test whether a particular object is in S or not.

The memory cost to store S in this bit-array approach is 1000 bits, regardless of how big or small |S| is. Suppose, for comparison, that we stored S directly as a list of 32-bit integers. Then the cost would be 32 |S| bits. When |S| is very small, this approach would be more memory efficient than using a bit array. But as |S| gets larger, storing |S| directly becomes much less memory efficient. We could ameliorate this somewhat by storing elements of S using only 10 bits, instead of 32 bits. But even if we did this, it would still be more expensive to store the list once |S| got beyond one hundred elements. So a bit array really would be better for modestly large subsets.

Idea: use a bit array where the indices are given by hashes: A problem with the bit array example described above is that we needed a way of numbering the possible elements of S, 0,1,\ldots,999. In general the elements of S may be complicated objects, not numbers in a small, well-defined range.

Fortunately, we can use hashing to number the elements of S. Suppose h(\cdot) is an m-bit hash function. We’re going to represent a set S = \{x_0,\ldots,x_{|S|-1}\} using a bit array containing 2^m elements. In particular, for each x_j we set the h(x_j)th element in the bit array, where we regard h(x_j) as a number in the range 0,1,\ldots,2^m-1. More explicitly, we can add an element x to the set by setting bit number h(x) in the bit array. And we can test whether x is an element of S by checking whether bit number h(x) in the bit array is set.

This is a good scheme, but the test can fail to give the correct result, which occurs whenever x is not an element of S, yet h(x) = h(x_j) for some j. This is exactly the same failure condition as for the basic hashing scheme we described earlier. By exactly the same reasoning as used then, the failure probability is

   [*] \,\,\,\, p = 1-(1-1/2^m)^{|S|}.

As we did earlier, we’d like to re-express this in terms of the number of bits of memory used, \#. This works differently than for the basic hashing scheme, since the number of bits of memory consumed by the current approach is \# = 2^m, as compared to \# = |S|m for the earlier scheme. Using \# = 2^m and substituting for m in Equation [*], we have:

   p = 1-(1-1/\#)^{|S|}.

Rearranging this to express \# in term of p and |S| we obtain:

   \# = \frac{1}{1-(1-p)^{1/|S|}}.

When p is small this can be approximated by

   \# \approx \frac{|S|}{p}.

This isn’t very memory efficient! We’d like the probability of failure p to be small, and that makes the 1/p dependence bad news when compared to the \log(|S|/p) dependence of the basic hashing scheme described earlier. The only time the current approach is better is when |S| is very, very large. To get some idea for just how large, if we want p = 0.01, then 1/p is only better than \log(|S|/p) when |S| gets to be more than about 1.27 * 10^{28}. That’s quite a set! In practice, the basic hashing scheme will be much more memory efficient.

Intuitively, it’s not hard to see why this approach is so memory inefficient compared to the basic hashing scheme. The problem is that with an m-bit hash function, the basic hashing scheme used m|S| bits of memory, while hashing into a bit array uses 2^m bits, but doesn’t change the probability of failure. That’s exponentially more memory!

At this point, hashing into bit arrays looks like a bad idea. But it turns out that by tweaking the idea just a little we can improve it a lot. To carry out this tweaking, it helps to name the data structure we’ve just described (where we hash into a bit array). We’ll call it a filter, anticipating the fact that it’s a precursor to the Bloom filter. I don’t know whether “filter” is a standard name, but in any case it’ll be a useful working name.

Idea: use multiple filters: How can we make the basic filter just described more memory efficient? One possibility is to try using multiple filters, based on independent hash functions. More precisely, the idea is to use k filters, each based on an (independent) m-bit hash function, h_0, h_1, \ldots, h_{k-1}. So our data structure will consist of k separate bit arrays, each containing 2^m bits, for a grand total of \# = k 2^m bits. We can add an element x by setting the h_0(x)th bit in the first bit array (i.e., the first filter), the h_1(x)th bit in the second filter, and so on. We can test whether a candidate element x is in the set by simply checking whether all the appropriate bits are set in each filter. For this to fail, each individual filter must fail. Because the hash functions are independent of one another, the probability of this is the kth power of any single filter failing:

   p = \left(1-(1-1/2^m)^{|S|}\right)^k.

The number of bits of memory used by this data structure is \# = k 2^m and so we can substitute 2^m = \#/k and rearrange to get

   [**] \,\,\,\, \# = \frac{k}{1-(1-p^{1/k})^{1/|S|}}.

Provided p^{1/k} is much smaller than 1, this expression can be simplified to give

   \# \approx \frac{k|S|}{p^{1/k}}.

Good news! This repetition strategy is much more memory efficient than a single filter, at least for small values of k. For instance, moving from k = 1 repetitions to k = 2 repititions changes the denominator from p to \sqrt{p} – typically, a huge improvement, since p is very small. And the only price paid is doubling the numerator. So this is a big win.

Intuitively, and in retrospect, this result is not so surprising. Putting multiple filters in a row, the probability of error drops exponentially with the number of filters. By contrast, in the single filter scheme, the drop in the probability of error is roughly linear with the number of bits. (This follows from considering Equation [*] in the limit where 1/2^m is small.) So using multiple filters is a good strategy.

Of course, a caveat to the last paragraph is that this analysis requires that p^{1/k} \ll 1, which means that k can’t be too large before the analysis breaks down. For larger values of k the analysis is somewhat more complicated. In order to find the optimal value of k we’d need to figure out what value of k minimizes the exact expression [**] for \#. We won’t bother – at best it’d be tedious, and, as we’ll see shortly, there is in any case a better approach.

Overlapping filters: This is a variation on the idea of repeating filters. Instead of having k separate bit arrays, we use just a single array of 2^m bits. When adding an object x, we simply set all the bits h_0(x), h_1(x),\ldots, h_{k-1}(x) in the same bit array. To test whether an element x is in the set, we simply check whether all the bits h_0(x), h_1(x),\ldots, h_{k-1}(x) are set or not.

What’s the probability of the test failing? Suppose x \not \in S. Failure occurs when h_0(x) = h_{i_0}(x_{j_0}) for some i_0 and j_0, and also h_1(x) = h_{i_1}(x_{j_1}) for some i_1 and j_1, and so on for all the remaining hash functions, h_2, h_3,\ldots, h_{k-1}. These are independent events, and so the probability they all occur is just the product of the probabilities of the individual events. A little thought should convince you that each individual event will have the same probability, and so we can just focus on computing the probability that h_0(x) = h_{i_0}(x_{j_0}) for some i_0 and j_0. The overall probability p of failure will then be the kth power of that probability, i.e.,

   p = p(h_0(x) = h_{i_0}(x_{j_0}) \mbox{ for some } i_0,j_0)^k

The probability that h_0(x) = h_{i_0}(x_{j_0}) for some i_0 and j_0 is one minus the probability that h_0(x) \neq h_{i_0}(x_{j_0}) for all i_0 and j_0. These are independent events for the different possible values of i_0 and j_0, each with probability 1-1/2^m, and so

   p(h_0(x) = h_{i_0}(x_{j_0}) \mbox{ for some } i_0,j_0) = 1-(1-1/2^m )^{k|S|},

since there are k|S| different pairs of possible values (i_0, j_0). It follows that

   p = \left(1-(1-1/2^m )^{k|S|}\right)^k.

Substituting 2^m = \# we obtain

   p = \left(1-(1-1/\# )^{k|S|}\right)^k

which can be rearranged to obtain

   \# = \frac{1}{1-(1-p^{1/k})^{1/k|S|}}.

This is remarkably similar to the expression [**] derived above for repeating filters. In fact, provided p^{1/k} is much smaller than 1, we get

   \# \approx \frac{k|S|}{p^{1/k}},

which is exactly the same as [**] when p^{1/k} is small. So this approach gives quite similar outcomes to the repeating filter strategy.

Which approach is better, repeating or overlapping filters? In fact, it can be shown that

   \frac{1}{1-(1-p^{1/k})^{1/k|S|}} \leq \frac{k}{1-(1-p^{1/k})^{1/|S|}},

and so the overlapping filter strategy is more memory efficient than the repeating filter strategy. I won’t prove the inequality here – it’s a straightforward (albeit tedious) exercise in calculus. The important takeaway is that overlapping filters are the more memory-efficient approach.

How do overlapping filters compare to our first approach, the basic hashing strategy? I’ll defer a full answer until later, but we can get some insight by choosing p = 0.0001 and k=4. Then for the overlapping filter we get \# \approx 40|S|, while the basic hashing strategy gives \# = |S| \log( 10000 |S|). Basic hashing is worse whenever |S| is more than about 100 million – a big number, but also a big improvement over the 1.27 * 10^{28} required by a single filter. Given that we haven’t yet made any attempt to optimize k, this ought to encourage us that we’re onto something.

Problems for the author

  • I suspect that there’s a simple intuitive argument that would let us see upfront that overlapping filters will be more memory efficient than repeating filters. Can I find such an argument?

Bloom filters: We’re finally ready for Bloom filters. In fact, Bloom filters involve only a few small changes to overlapping filters. In describing overlapping filters we hashed into a bit array containing 2^m bits. We could, instead, have used hash functions with a range 0,\ldots,M-1 and hashed into a bit array of M (instead of 2^m) bits. The analysis goes through unchanged if we do this, and we end up with

   p = \left(1-(1-1/\# )^{k|S|}\right)^k

and

   \# = \frac{1}{1-(1-p^{1/k})^{1/k|S|}},

exactly as before. The only reason I didn’t do this earlier is because in deriving Equation [*] above it was convenient to re-use the reasoning from the basic hashing scheme, where m (not M) was the convenient parameter to use. But the exact same reasoning works.

What’s the best value of k to choose? Put another way, what value of k should we choose in order to minimize the number of bits, \#, given a particular value for the probability of error, p, and a particular sizek |S|? Equivalently, what value of k will minimize p, given \# and |S|? I won’t go through the full analysis here, but with calculus and some algebra you can show that choosing

   k \approx \frac{\#}{|S|} \ln 2

minimizes the probability p. (Note that \ln denotes the natural logarithm, not logarithms to base 2.) By choosing k in this way we get:

   [***] \,\,\,\, \# = \frac{|S|}{\ln 2} \log \frac{1}{p}.

This really is good news! Not only is it better than a bit array, it’s actually (usually) much better than the basic hashing scheme we began with. In particular, it will be better whenever

   \frac{1}{\ln 2} \log \frac{1}{p} \leq \log \frac{|S|}{p},

which is equivalent to requiring

   |S| \geq p^{1-1/\ln 2} \approx \frac{1}{p^{0.44}}.

If we want (say) p=0.01 this means that Bloom filter will be better whenever |S| \geq 8, which is obviously an extremely modest set size.

Another way of interpreting [***] is that a Bloom filter requires \frac{1}{\ln 2} \log \frac{1}{p} \approx 1.44 \log \frac{1}{p} bits per element of the set being represented. In fact, it’s possible to prove that any data structure supporting the add and test operations will require at least \log \frac{1}{p} bits per element in the set. This means that Bloom filters are near-optimal. Futher work has been done finding even more memory-efficient data structures that actually meet the \log \frac{1}{p} bound. See, for example, the paper by Anna Pagh, Rasmus Pagh, and S. Srinivasa Rao.

Problems for the author

  • Are the more memory-efficient algorithms practical? Should we be using them?

In actual applications of Bloom filters, we won’t know S in advance, nor |S|. So the way we usually specify a Bloom filter is to specify the maximum size n of set that we’d like to be able to represent, and the maximal probability of error, p, that we’re willing to tolerate. Then we choose

   \# = \frac{n}{\ln 2} \log \frac{1}{p}

and

   k = \ln \frac{1}{p}.

This gives us a Bloom filter capable of representing any set up to size n, with probability of error guaranteed to be at most p. The size n is called the capacity of the Bloom filter. Actually, these expressions are slight simplifications, since the terms on the right may not be integers – to be a little more pedantic, we choose

   \# = \lceil \frac{n}{\ln 2} \log \frac{1}{p} \rceil

and

   k = \lceil \ln \frac{1}{p} \rceil.

One thing that still bugs me about Bloom filters is the expression for the optimal value for k. I don’t have a good intuition for it – why is it logarithmic in 1/p, and why does it not depend on |S|? There’s a tradeoff going on here that’s quite strange when you think about it: bit arrays on their own aren’t very good, but if you repeat or overlap them just the right number of times, then performance improves a lot. And so you can think of Bloom filters as a kind of compromise between an overlap strategy and a bit array strategy. But it’s really not at all obvious (a) why choosing a compromise strategy is the best; or (b) why the right point at which to compromise is where it is, i.e., why k has the form it does. I can’t quite answer these questions at this point – I can’t see that far through Bloom filters. I suspect that understanding the k = 2 case really well would help, but haven’t put in the work. Anyone with more insight is welcome to speak up!

Summing up Bloom filters: Let’s collect everything together. Suppose we want a Bloom filter with capacity n, i.e., capable of representing any set S containing up to n elements, and such that test produces a false positive with probability at most p. Then we choose

   k = \lceil \ln \frac{1}{p} \rceil

independent hash functions, h_0, h_1, \ldots, h_{k-1}. Each hash function has a range 0,\ldots,\#-1, where \# is the number of bits of memory our Bloom filter requires,

   \# = \lceil \frac{n}{\ln 2} \log \frac{1}{p} \rceil.

We number the bits in our Bloom filter from 0,\ldots,\#-1. To add an element x to our set we set the bits h_0(x), h_1(x), \ldots, h_{\#-1}(x) in the filter. And to test whether a given element x is in the set we simply check whether bits h_0(x), h_1(x), \ldots, h_{\#-1}(x) in the bit array are all set.

That’s all there is to the mechanics of how Bloom filters work! I won’t give any sample code – I usually provide code samples in Python, but the Python standard library lacks bit arrays, so nearly all of the code would be concerned with defining a bit array class. That didn’t seem like it’d be terribly illuminating. Of course, it’s not difficult to find libraries implementing Bloom filters. For example, Jason Davies has written a javascript Bloom filter which has a fun and informative online interactive visualisation. I recommend checking it out. And I’ve personally used Mike Axiak‘s fast C-based Python library pybloomfiltermmap – the documentation is clear, it took just a few minutes to get up and running, and I’ve generally had no problems.

Problems

  • Suppose we have two Bloom filters, corresponding to sets S_1 and S_2. How can we construct the Bloom filters corresponding to the sets S_1 \cup S_2 and S_1 \cap S_2?

Applications of Bloom filters: Bloom filters have been used to solve many different problems. Here’s just a few examples to give the flavour of how they can be used. An early idea was Manber and Wu’s 1994 proposal to use Bloom filters to store lists of weak passwords. Google’s BigTable storage system uses Bloom filters to speed up queries, by avoiding disk accesses for rows or columns that don’t exist. Google Chrome uses Bloom filters to do safe web browsing – the opening example in this post was quite real! More generally, it’s useful to consider using Bloom filters whenever a large collection of objects needs to be stored. They’re not appropriate for all purposes, but at the least it’s worth thinking about whether or not a Bloom filter can be applied.

Extensions of Bloom filters: There’s many clever ways of extending Bloom filters. I’ll briefly describe one, just to give you the flavour, and provide links to several more.

A delete operation: It’s possible to modify Bloom filters so they support a delete operation that lets you remove an element from the set. You can’t do this with a standard Bloom filter: it would require unsetting one or more of the bits h_0(x), h_1(x), \ldots in the bit array. This could easily lead us to accidentally delete other elements in the set as well.

Instead, the delete operation is implemented using an idea known as a counting Bloom filter. The basic idea is to take a standard Bloom filter, and replace each bit in the bit array by a bucket containing several bits (usually 3 or 4 bits). We’re going to treat those buckets as counters, initially set to 0. We add an element x to the counting Bloom filter by incrementing each of the buckets numbered h_0(x), h_1(x), \ldots. We test whether x is in the counting Bloom filter by looking to see whether each of the corresponding buckets are non-zero. And we delete x by decrementing each bucket.

This strategy avoids the accidental deletion problem, because when two elements of the set x and y hash into the same bucket, the count in that bucket will be at least 2. deleteing one of the elements, say x, will still leave the count for the bucket at least 1, so y won’t be accidentally deleted. Of course, you could worry that this will lead us to erroneously conclude that x is still in the set after it’s been deleted. But that can only happen if other elements in the set hash into every single bucket that x hashes into. That will only happen if |S| is very large.

Of course, that’s just the basic idea behind counting Bloom filters. A full analysis requires us to understand issues such as bucket overflow (when a counter gets incremented too many times), the optimal size for buckets, the probability of errors, and so on. I won’t get into that, but you there’s details in the further reading, below.

Other variations and further reading: There are many more variations on Bloom filters. Just to give you the flavour of a few applications: (1) they can be modified to be used as lookup dictionaries, associating a value with each element added to the filter; (2) they can be modified so that the capacity scales up dynamically; and (3) they can be used to quickly approximate the number of elements in a set. There are many more variations as well: Bloom filters have turned out to be a very generative idea! This is part of why it’s useful to understand them deeply, since even if a standard Bloom filter can’t solve the particular problem you’re considering, it may be possible to come up with a variation which does. You can get some idea of the scope of the known variations by looking at the Wikipedia article. I also like the survey article by Andrei Broder and Michael Mitzenmacher. It’s a little more dated (2004) than the Wikipedia article, but nicely written and a good introduction. For a shorter introduction to some variations, there’s also a recent blog post by Matthias Vallentin. You can get the flavour of current research by looking at some of the papers citing Bloom filters here. Finally, you may enjoy reading the original paper on Bloom filters, as well as the original paper on counting Bloom filters.

Understanding data structures: I wrote this post because I recently realized that I didn’t understand any complex data structure in any sort of depth. There are, of course, a huge number of striking data structures in computer science – just look at Wikipedia’s amazing list! And while I’m familiar with many of the simpler data structures, I’m ignorant of most complex data structures. There’s nothing wrong with that – unless one is a specialist in data structures there’s no need to master a long laundry list. But what bothered me is that I hadn’t thoroughly mastered even a single complex data structure. In some sense, I didn’t know what it means to understand a complex data structure, at least beyond surface mechanics. By trying to reinvent Bloom filters, I’ve found that I’ve deepened my own understanding and, I hope, written something of interest to others.

Interested in more? Please subscribe to this blog, or follow me on Twitter. You may also enjoy reading my new book about open science, Reinventing Discovery.

Aug 10 12

How to crawl a quarter billion webpages in 40 hours

by Michael Nielsen

More precisely, I crawled 250,113,669 pages for just under 580 dollars in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances.

I carried out this project because (among several other reasons) I wanted to understand what resources are required to crawl a small but non-trivial fraction of the web. In this post I describe some details of what I did. Of course, there’s nothing especially new: I wrote a vanilla (distributed) crawler, mostly to teach myself something about crawling and distributed computing. Still, I learned some lessons that may be of interest to a few others, and so in this post I describe what I did. The post also mixes in some personal working notes, for my own future reference.

What does it mean to crawl a non-trivial fraction of the web? In fact, the notion of a “non-trivial fraction of the web” isn’t well defined. Many websites generate pages dynamically, in response to user input – for example, Google’s search results pages are dynamically generated in response to the user’s search query. Because of this it doesn’t make much sense to say there are so-and-so many billion or trillion pages on the web. This, in turn, makes it difficult to say precisely what is meant by “a non-trivial fraction of the web”. However, as a reasonable proxy for the size of the web we can use the number of webpages indexed by large search engines. According to this presentation by Googler Jeff Dean, as of November 2010 Google was indexing “tens of billions of pages”. (Note that the number of urls is in the trillions, apparently because of duplicated page content, and multiple urls pointing to the same content.) The now-defunct search engine Cuil claimed to index 120 billion pages. By comparison, a quarter billion is, obviously, very small. Still, it seemed to me like an encouraging start.

Code: Originally I intended to make the crawler code available under an open source license at GitHub. However, as I better understood the cost that crawlers impose on websites, I began to have reservations. My crawler is designed to be polite and impose relatively little burden on any single website, but could (like many crawlers) easily be modified by thoughtless or malicious people to impose a heavy burden on sites. Because of this I’ve decided to postpone (possibly indefinitely) releasing the code.

There’s a more general issue here, which is this: who gets to crawl the web? Relatively few sites exclude crawlers from companies such as Google and Microsoft. But there are a lot of crawlers out there, many of them without much respect for the needs of individual siteowners. Quite reasonably, many siteowners take an aggressive approach to shutting down activity from less well-known crawlers. A possible side effect is that if this becomes too common at some point in the future, then it may impede the development of useful new services, which need to crawl the web. A possible long-term solution may be services like Common Crawl, which provide access to a common corpus of crawl data.

I’d be interested to hear other people’s thoughts on this issue.

Architecture: Here’s the basic architecture:

The master machine (my laptop) begins by downloading Alexa’s list of the top million domains. These were used both as a domain whitelist for the crawler, and to generate a starting list of seed urls.

The domain whitelist was partitioned across the 20 EC2 machine instances in the crawler. This was done by numbering the instances 0, 1, 2, \ldots, 19 and then allocating the domain domain to instance number hash(domain) % 20, where hash is the standard Python hash function.

Deployment and management of the cluster was handled using Fabric, a well-documented and nicely designed Python library which streamlines the use of ssh over clusters of machines. I managed the connection to Amazon EC2 using a set of Python scripts I wrote, which wrap the boto library.

I used 20 Amazon EC2 extra large instances, running Ubuntu 11.04 (Natty Narwhal) under the ami-68ad5201 Amazon machine image provided by Canonical. I used the extra large instance after testing on several instance types; the extra large instances provided (marginally) more pages downloaded per dollar spent. I used the US East (North Virginia) region, because it’s the least expensive of Amazon’s regions (along with the US West, Oregon region).

Single instance architecture: Each instance further partitioned its domain whitelist into 141 separate blocks of domains, and launched 141 Python threads, with each thread responsible for crawling the domains in one block. Here’s how it worked (details below):

The reason for using threads is that the Python standard library uses blocking I/O to handle http network connections. This means that a single-threaded crawler would spend most of its time idling, usually waiting on the network connection of the remote machine being crawled. It’s much better to use a multi-threaded crawler, which can make fuller use of the resources available on an EC2 instance. I chose the number of crawler threads (141) empirically: I kept increasing the number of threads until the speed of the crawler started to saturate. With this number of threads the crawler was using a considerable fraction of the CPU capacity available on the EC2 instance. My informal testing suggested that it was CPU which was the limiting factor, but that I was not so far away from the network and disk speed becoming bottlenecks; in this sense, the EC2 extra large instance was a good compromise. Memory useage was never an issue. It’s possible that for this reason EC2′s high-CPU extra large instance type would have been a better choice; I only experimented with this instance type with early versions of the crawler, which were more memory-limited.

How domains were allocated across threads: The threads were numbered 0, 1, \ldots, 140, and domains were allocated on the basis of the Python hash function, to thread number hash(domain) % 141 (similar to the allocation across machines in the cluster). Once the whitelisted domains / seed urls were allocated to threads, the crawl was done in a simple breadth-first fashion, i.e., for each seed url we download the corresponding web page, extract the linked urls, and check each url to see: (a) whether the extracted url is a fresh url which has not already been seen and added to the url frontier; and (b) whether the extracted url is in the same seed domain as the page which has just been crawled. If both these conditions are met, the url is added to the url frontier for the current thread, otherwise the url is discarded. With this architecture we are essentially carrying out a very large number of independent crawls of the whitelisted domains obtained from Alexa.

Note that this architecture also ensures that if, for example, we are crawling a page from TechCrunch, and extract from that page a link to the Huffington Post, then the latter link will be discarded, even though the Huffington Post is in our domain whitelist. The only links added to the url frontier will be those that point back to TechCrunch itself. The reason we avoid adding dealing with (whitelisted) external links is because: (a) it may require communication between different EC2 instances, which would substantially complicate the crawler; and, more importantly, (b) in practice, most sites have lots of internal links, and so it’s unlikely that this policy means the crawler is missing much.

One advantage of allocating all urls from the same domain to the same crawler thread is that it makes it much easier to crawl politely, since no more than one connection to a site will be open at any given time. In particular, this ensures that we won’t be hammering any given domain with many simultaneous connections from different threads (or different machines).

Problems for the author

  • For some very large and rapidly changing websites it may be necessary to open multiple simultaneous connections in order for the crawl to keep up with the changes on the site. How can we decide when that is appropriate?

How the url frontiers work: A separate url frontier file was maintained for each domain. This was simply a text file, with each line containing a single url to be crawled; initially, the file contains just a single line, with the seed url for the domain. I spoke above of the url frontier for a thread; that frontier can be thought of as the combination of all the url frontier files for domains being crawled by that thread.

Each thread maintained a connection to a redis server. For each domain being crawled by the thread a redis key-value pair was used to keep track of the current position in the url frontier file for that domain. I used redis (and the Python bindings) to store this information in a fashion that was both persistent and fast to look up. The persistence was important because it meant that the crawler could be stopped and started at will, without losing track of where it was in the url frontier.

Each thread also maintained a dictionary whose keys were the (hashed) domains for that thread. The corresponding values were the next time it would be polite to crawl that domain. This value was set to be 70 seconds after the last time the domain was crawled, to ensure that domains weren’t getting hit too often. The crawler thread simply iterated over the keys in this dictionary, looking for the next domain it was polite to crawl. Once it found such a domain it then extracted the next url from the url frontier for that domain, and went about downloading that page. If the url frontier was exhausted (some domains run out of pages to crawl) then the domain key was removed from the dictionary. One limitation of this design was that when restarting the crawler each thread had to identify again which domains had already been exhausted and should be deleted from the dictionary. This slowed down the restart a little, and is something I’d modify if I were to do further work with the crawler.

Use of a Bloom filter: I used a Bloom filter to keep track of which urls had already been seen and added to the url frontier. This enabled a very fast check of whether or not a new candidate url should be added to the url frontier, with only a low probability of erroneously adding a url that had already been added. This was done using Mike Axiak‘s very nice C-based pybloomfiltermmap.

Anticipated versus unanticipated errors: Because the crawler ingests input from external sources, it needs to deal with many potential errors. By design, there are two broad classes of error: anticipated errors and unanticipated errors.

Anticipated errors are things like a page failing to download, or timing out, or containing unparseable input, or a robots.txt file disallowing crawling of a page. When anticipated errors arise, the crawler writes the error to a (per-thread) informational log (the “info log” in the diagram above), and continues in whatever way is appropriate. For example, if the robots.txt file disallows crawling then we simply continue to the next url in the url frontier.

Unanticipated errors are errors which haven’t been anticipated and designed for. Rather than the crawler falling over, the crawler simply logs the error (to the “critical log” in the diagram above), and moves on to the next url in the url frontier. At the same time, the crawler tracks how many unanticipated errors have occurred in close succession. If many unanticipated errors occur in close succession it usually indicates that some key piece of infrastructure has failed. Because of this, if there are too many unanticipated errors in close succession, the crawler shuts down entirely.

As I was developing and testing the crawler, I closely followed the unanticipated errors logged in the critical log. This enabled me to understand many of the problems faced by the crawler. For example, early on in development I found that sometimes the html for a page would be so badly formed that the html parser would have little choice but to raise an exception. As I came to understand such errors I would rewrite the crawler code so such errors become anticipated errors that were handled as gracefully as possible. Thus, the natural tendency during development was for unanticipated errors to become anticipated errors.

Domain and subdomain handling: As mentioned above, the crawler works by doing lots of parallel intra-domain crawls. This works well, but a problem arises because of the widespread use of subdomains. For example, if we start at the seed url http://barclays.com and crawl only urls within the barclays.com domain, then we quickly run out of urls to crawl. The reason is that most of the internal links on the barclays.com site are actually to group.barclays.com, not barclays.com. Our crawler should also add urls from the latter domain to the url frontier for barclays.com.

We resolve this by stripping out all subdomains, and working with the stripped domains when deciding whether to add a url to the url frontier. Removing subdomains turns out to be a surprisingly hard problem, because of variations in the way domain names are formed. Fortunately, the problem seems to be well solved using John Kurkowski’s tldextract library.

On the representation of the url frontier: I noted above that a separate url frontier file was maintained for each domain. In an early version of the code, each crawler thread had a url frontier maintained as a single flat text file. As a crawler thread read out lines in the file, it would crawl those urls, and append any new urls found to the end of the file.

This approach seemed natural to me, but organizing the url frontier files on a per-thread (rather than per-domain) basis caused a surprising number of problems. As the crawler thread moved through the file to find the next url to crawl, the crawler thread would encounter urls belonging to domains that were not yet polite to crawl because they’d been crawled too recently. My initial strategy was simply to append such urls to the end of the file, so they would be found again later. Unfortunately, there were often a lot of such urls in a row – consecutive urls often came from the same domain (since they’d been extracted from the same page). And so this strategy caused the file for the url frontier to grow very rapidly, eventually consuming most disk space.

Exacerbating this problem, this approach to the url frontier caused an unforseen “domain clumping problem”. To understand this problem, imagine that the crawler thread encountered (say) 20 consecutive urls from a single domain. It might crawl the first of these, extracting (say) 20 extra urls to append to the end of the url frontier. But the next 19 urls would all be skipped over, since it wouldn’t yet be polite to crawl them, and they’d also be appended to the end of the url frontier. Now we have 39 urls from the same domain at the end of the url frontier. But when the crawler thread gets to those, we may well have the same process repeat – leading to a clump of 58 urls from the same domain at the end of the file. And so on, leading to very long runs of urls from the same domain. This consumes lots of disk space, and also slows down the crawler, since the crawler thread may need to examine a large number of urls before it finds a new url it’s okay to crawl.

These problems could have been solved in various ways; moving to the per-domain url frontier file was how I chose to address the problems, and it seemed to work well.

Choice of number of threads: I mentioned above that the number of crawler threads (141) was chosen empirically. However, there is an important constraint on that number, and in particular its relationship to the number (20) of EC2 instances being used. Suppose that instead of 141 threads I’d used (say) 60 threads. This would create a problem. To see why, note that any domain allocated to instance number 7 (say) would necessarily satisfy hash(domain) % 20 = 7. This would imply that hash(domain) % 60 = 7 or 27 or 47, and as a consequence all the domains would be allocated to just one of three crawler threads (thread numbers 7, 27 and 47), while the other 57 crawler threads would lie idle, defeating the purpose of using multiple threads.

One way to solve this problem would be to use two independent hash functions to allocate domains to EC2 instances and crawler threads. However, an even simpler way of solving the problem is to choose the number of crawler threads to be coprime to the number of EC2 instances. This coprimality ensures that domains will be allocated reasonably evenly across both instance and threads. (I won’t prove this here, but it can be proved with a little effort). It is easily checked that 141 and 20 are coprime.

Note, incidentally, that Python’s hash is not a true hash function, in the sense that it doesn’t guarantee that the domains will be spread evenly across EC2 instances. It turns out that Python’s hash takes similar key strings to similar hash values. I talk more about this point (with examples) in the fifth paragraph of this post. However, I found empirically that hash seems to spread domains evenly enough across instances, and so I didn’t worry about using a better (but slower) hash function, like those available through Python’s hashlib library.

Use of Python: All my code was written in Python. Initially, I wondered if Python might be too slow, and create bottlenecks in the crawling. However, profiling the crawler showed that most time was spent either (a) managing network connections and downloading data; or (b) parsing the resulting webpages. The parsing of the webpages was being done using lxml, a Python binding to fast underlying C libraries. It didn’t seem likely to be easy to speed that up, and so I concluded that Python was likely not a particular bottleneck in the crawling.

Politeness: The crawler used Python’s robotparser library in order to observe the robots exclusion protocol. As noted above, I also imposed an absolute 70-second minimum time interval between accesses to any given domain. In practice, the mean time between accesses was more like 3-4 minutes.

In initial test runs of the crawler I got occasional emails from webmasters asking for an explanation of why I was crawling their site. Because of this, in the crawler’s User-agent I included a link to a webpage explaining the purpose of my crawler, how to exclude it from a site, and what steps I was taking to crawl politely. This was (I presume) both helpful to webmasters and also helpful to me, for it reduced the number of inquiries. A handful of people asked me to exclude their sites from the crawl, and I complied quickly.

Problems for the author

  • Because my crawl didn’t take too long, the robots.txt file was downloaded just once for each domain, at the beginning of the crawl. In a longer crawl, how should we decide how long to wait between downloads of robots.txt?

Truncation: The crawler truncates large webpages rather than downloading the full page. It does this in part because it’s necessary – it really wouldn’t surprise me if someone has a terabyte html file sitting on a server somewhere – and in part because for many applications it will be of more interest to focus on earlier parts of the page.

What’s a reasonable threshold for truncation? According to this report from Google, as of May 2010 the average network size of a webpage from a top site is 312.04 kb. However, that includes images, scripts and stylesheets, which the crawler ignores. If you ignore the images and so on, then the average network size drops to just 33.66 kb.

However, that number of 33.66 kb is for content which may be served compressed over the network. Our truncation will be based on the uncompressed size. Unfortunately, the Google report doesn’t tell us what the average size of the uncompressed content is. However, we can get an estimate of this, since Google reports that the average uncompressed size of the total page (including images and so on) is 477.26 kb, while the average network size is 312.04 kb.

Assuming that this compression ratio is typical, we estimate that the average uncompressed size of the content the crawler downloads is 51 kb. In the event, I experimented with several truncation settings, and found that a truncation threshold of 200 kilobytes enabled me to download the great majority of webpages in their entirety, while addressing the problem of very large html files mentioned above. (Unfortunately, I didn’t think to check what the actual average uncompressed size was, my mistake.)

Storage: I stored all the data using EC2′s built-in instance storage – 1.69 Terabytes for the extra-large instances I was using. This storage is non-persistent, and so any data stored on an instance will vanish when that instance is terminated. Now, for many kinds of streaming or short-term analysis of data this would be adequate – indeed, it might not even be necessary to store the data at all. But, of course, for many applications of a crawl this approach is not appropriate, and the instance storage should be supplemented with something more permanent, such as S3. For my purposes using the instance storage seemed fine.

Price: The price broke down into two components: (1) 512 dollars for the use of the 20 extra-large EC2 instances for 40 hours; and (2) about 65 dollars for a little over 500 gigabytes of outgoing bandwidth, used to make http requests. Note that Amazon does not charge for incoming bandwidth (a good thing, too!) It would be interesting to compare these costs to the (appropriately amortized) costs of using other cloud providers, or self-hosting.

Something I didn’t experiment with is the use of Amazon’s spot instances, where you can bid to use Amazon’s unused EC2 capacity. I didn’t think of doing this until just as I was about to launch the crawl. When I went to look at the spot instance pricing history, I discovered to my surprise that the spot instance prices are often a factor of 10 or so lower than the prices for on-demand instances! Factoring in the charges for outgoing bandwidth, this means it may be possible to use spot instances to do a similar crawl for 120 dollars or so, a factor of five savings. I considered switching, but ultimately decided against it, thinking that it might take 2 or 3 days work to properly understand the implications of switching, and to get things working exactly as I wanted. Admittedly, it’s possible that it would have taken much less time, in which case I missed an opportunity to trade some money for just a little extra time.

Improvements to the crawler architecture: Let me finish by noting a few ways it’d be interesting to improve the current crawler:

  • For many long-running applications the crawler would need a smart crawl policy so that it knows when and how to re-crawl a page. According to a presentation from Jeff Dean, Google’s mean time to index a new page is now just minutes. I don’t know how that works, but imagine that notification protocols such as pubsubhubbub play an important role. It’d be good to change the crawler so that it’s pubsubhubbub aware.
  • The crawler currently uses a threaded architecture. Another quite different approach is to use an evented architecture. What are the pros and cons of a multi-threaded versus an evented architecture?
  • The instances in the cluster are configured using fabric and shell scripts to install programs such as redis, pybloomfilter, and so on. This is slow and not completely reliable. Is there a better way of doing this? Creating my own EC2 AMI? Configuration management software such as Chef and Puppet? I considered using one of the latter, but deferred it because of the upfront cost of learning the systems.
  • Logging is currently done using Python’s logging module. Unfortunately, I’m finding this is not well-adapted to Python’s threading. Is there a better solution?
  • The crawler was initially designed for crawling in a batch environment, where it is run and then terminates. I’ve since modified it so that it can be stopped, modifications made, and restarted. It’d be good to add instrumentation so it can be modified more dynamically, in real time.
  • Many interesting research papers have been published about crawling. I read or skimmed quite a few while writing my crawler, but ultimately used only a few of the ideas; just getting the basics right proved challenging enough. In future iterations it’d be useful to look at this work again and to incorporate the best ideas. Good starting points include a chapter in the book by Manning, Raghavan and Sch\”utze, and a survey paper by Olston and Najork. Existing open source crawlers such as Heritrix and Nutch would also be interesting to look at in more depth.
Jun 30 12

Using evaluation to improve our question-answering system

by Michael Nielsen

It’s tempting to think IBM’s Jeopardy-playing machine Watson must have relied on some huge algorithmic advance or silver bullet idea in order to beat top human Jeopardy players. But the researchers behind Watson have written a very interesting paper about how Watson works, and a different picture emerges. It’s not that they found any super-algorithm for answering questions. Instead, Watson combined a large number of different algorithms, most of them variations on standard algorithms from natural language processing and machine learning. Individually, none of these algorithms was particularly good at solving Jeopardy puzzles, or the sub-problems that needed to be solved along the way. But by integrating a suite of not-very-good algorithms in just the right way, the Watson team got superior performance. As they write in the paper:

rapid integration and evaluation of new ideas and new components against end-to-end metrics [were] essential to our progress… [Question answering benefits from] a single extensible architecture that [allows] component results to be consistently evaluated in a common technical context against a growing variety of… “Challenge Problems.”… Our commitment to regularly evaluate the effects of specific techniques on end-to-end-performance, and to let that shape our research investment, was necessary for our rapid progress.

In other words, they built an extensive evaluation environment that gave detailed and clear metrics that let them see how well their system was doing. In turn, they used these metrics to determine where to invest time and money in improving their system and, equally important, to determine what ideas to abandon. We will call this style of development evaluation-driven development. It’s not a new idea, of course – anyone who’s ever run an A/B test is doing something similar – but the paper implies that Watson took the approach to a remarkable extreme, and that it was this approach which was responsible for much of Watson’s success.

In the last post I developed a very simple question-answering system based on the AskMSR system developed at Microsoft Research in 2001. I’ve since been experimenting – in a very small-scale way! – with the use of evaluation-driven development to improve the performance of that system. In this post I describe some of my experiments. Of course, my experiments don’t compare in sophistication to what the Watson team did, nor with what is done in many other machine learning systems. Still, I hope that the details are of interest to at least a few readers. The code for the experiments may be found on GitHub here. Note that you don’t need to have read the last post to understand this one, although of course it wouldn’t hurt!

First evaluation: The system described in my last post took questions like “Who is the world’s richest person?” and rewrote those questions as search queries for Google. E.g., for the last question it might use Google to find webpages matching “the world’s richest person is *”, and then look for name-like strings in the wildcard position. It would extract possible strings, and score them based on a number of factors, such as whether the strings were capitalized, how likely the search query was to give the correct results, and so on. Finally, it returned the highest-scoring strings as candidate answers.

To evaluate this system, I constructed a list of 100 questions, with each question having a list of one or more acceptable answers. Example questions and acceptable answers included:

(1) Who married the Prince of Wales in 2005?: Camilla Parker Bowles

(2) Who was the first woman to climb the mountain K2?: Wanda Rutkiewicz

(3) Who was the bass guitarist for the band Nirvana?: Krist Novoselic

(4) Who wrote ‘The Waste Land’?: T. S. Eliot or Thomas Stearns Eliot

(5) Who won the 1994 Formula One Grand Prix championship?: Michael Schumacher

The system described in my last post got 41 of the 100 questions exactly right, i.e., it returned an acceptable answer as its top-scoring answer. Furthemore, the system returned a correct answer as one of its top 20 scored responses for 75 of the 100 questions. Among these, the average rank was 3.48: i.e., when Google could find the answer at all, it tended to rank it very highly.

(Actually, to be precise, I used a slightly modified version of the system in the last post: I fixed a bug in how the program parsed strings. The bug is conceptually unimportant, but fixing it increased performance slightly.)

Comparison to Wolfram Alpha: As a comparison, I tried submitting the evaluation questions also to Wolfram Alpha, and determining if the answers returned by Wolfram were in the list of acceptable answers. Now, Wolfram doesn’t always return an answer. But for 27 of those questions it did return an answer. And 20 of those answers were correct.

Hybrid system: Two things are notable about Wolfram’s performance versus the Google-based system: (1) Wolfram gets answers correct much less often than the Google-based system; and (2) When Wolfram returns an answer, it is much more likely to be correct (20 / 27, or 74 percent) than the Google-based system. This suggests using a hybrid system which applies the following procedure:

if Wolfram Alpha returns an answer:
    return Wolfram Alpha's answer
else:
    return the Google-based answer

Let’s see if we can guess how well this system will work. One possible assumption is that the correctness of the Google-based system and the Wolfram-based system are uncorrelated. If that’s the case, then we’d expect that the hybrid system would get 20 / 27 questions correct, for those questions answered by Wolfram, and the Google-based system would get 41 percent of the remaining 73 questions correct. That gives a total of 50 questions correct.

Now, in practice, when I ran the hybrid procedure it only gave 45 correct answers. That’s a considerable improvement (4 questions) over the Google-based system alone. On the other hand, it’s less than half the improvement we’d expect if the Google-based system and the Wolfram-based system were uncorrelated. What’s going on is that the evaluation questions which Wolfram Alpha is good at answering are, for the most part, questions which Google is also good at answering. In other words, what we learn from the evaluation is that there is considerable overlap (or correlation) in the type of questions these two systems are good at answering, and so we get less improvement than we might have thought.

Problems for the author

  • In developing the hybrid algorithm we relied on the fact that when Wolfram Alpha returned an answer it was highly likely to be correct – much more likely than the Google-based system. But suppose the Google-based system says that there is a very big gap in score between the top-ranked and second-ranked answer. Might that signal that we can have high confidence in the Google-based answer? Determine where it is possible to gain better performance by returning the Google-based answer preferentially when the ratio of the top-ranked and second-ranked score is sufficiently large.

  • Are there other problem features that would enable us to determine whether a question is more or less likely to be answered correctly by the Google-based system or by the Wolfram-based system? Maybe, for example, long questions are more likely to be answered correctly by Google. Modify the code to determine whether this is the case.

Improving the scoring of proper nouns in the Google-based system: In the Google-based system, the scoring mechanism increase the scores of candidate answers by a factor for each word in the answer which is capitalized. The idea is to score more highly candidate answers which are likely to be proper nouns.

Originally, I chose the capitalization factor to be 3.0, based on nothing more than an intuitive guess. Let’s use the evaluation system to see if we can improve this guess. The results are below; you may wish to skip straight to the list of lessons learned, however, as an entree into the results.

Factor: 4.0
Top-ranking answer is correct: 33
A correct answer in the top 20: 74
Average rank for correct answers in the top 20: 4.46

Factor: 3.0 (original parameter choice)
Top-ranking answer is correct: 41
A correct answer in the top 20: 75
Average rank for correct answers in the top 20: 3.48

Factor: 2.5
Top-ranking answer is correct: 42
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.41

Factor: 2.4
Top-ranking answer is correct:  47
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.38

Factor: 2.3
Top-ranking answer is correct: 47
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.25

Factor: 2.2
Top-ranking answer is correct: 47
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.21

Factor: 2.1
Top-ranking answer is correct: 47
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.24

Factor: 2.0
Top-ranking answer is correct: 42
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.13

Factor: 1.9
Top-ranking answer is correct: 44
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.13

Factor: 1.8
Top-ranking answer is correct: 43
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.12

Factor: 1.7
Top-ranking answer is correct: 38
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.39

Factor: 1.6
Top-ranking answer is correct: 34
A correct answer in the top 20: 74
Average rank for correct answers in the top 20: 3.22

Factor: 1.5
Top-ranking answer is correct: 29
A correct answer in the top 20: 73
Average rank for correct answers in the top 20: 3.38

Factor: 1.0
Top-ranking answer is correct: 1
A correct answer in the top 20: 65
Average rank for correct answers in the top 20: 7.74

Factor: 0.8
Top-ranking answer is correct: 1
A correct answer in the top 20: 52
Average rank for correct answers in the top 20: 10.25

Looking at the above, we learn a great deal:

(1) Increasing the capitalization factor to much above my original guess at a value (3.0) starts to significantly degrade the system.

(2) Removing the capitalization factor (i.e., setting it to 1.0) makes the system much, much worse. It still does okay at getting things approximately correct — for 65 of the questions a correct answer is in the top 20 returned responses. But for only 1 of the 100 questions is the top-ranked answer actually correct. In other words, we learn that paying attention to capitalization makes a huge difference to how likely our algorithm is to return a correct answer as the top result. It’s also interesting to note that even ignoring capitalization the algorithm does pretty well at returning a high-ranked answer that is correct.

(3) Actually penalizing capitalized answers (by choosing a capitalization factor of 0.8) makes things even worse. This is not surprising, but it’s useful to see explicitly, and further confirmation that paying attention to capitalization matters.

(4) The optimal parameter range for the capitalization factor depends on whether you want to maximize the number of perfect answers, in which case the capitalization factor should be 2.1-2.4, or to get the best possible rank, in which case the capitalization factor should be 1.8-2.0.

(5) Performance only varies a small amount over the range of capitalization factors 1.8-2.3. Even taking the capitalization factor out to its original value (3.0) decreases performance only a small amount.

Improving the scoring of proper nouns in the hybrid system: We’ve just used our evaluation suite to improve the value of the capitalization factor for the Google-based question-answering system. What happens when go through the same procedure for the hybrid question-answering system? A priori the optimal value for the capitalization factor may be somewhat different. I ran the evaluation algorithm for many different values of the capitaliation factor. Here are the results:

Factor: 2.6
Top-ranking answer is correct: 47

Factor: 2.5
Top-ranking answer is correct: 46

Factor: 2.4
Top-ranking answer is correct: 51

Factor: 2.3
Top-ranking answer is correct: 51

Factor: 2.2
Top-ranking answer is correct: 51

Factor: 2.1
Top-ranking answer is correct: 51

Factor: 2.0
Top-ranking answer is correct: 47

Factor: 1.9
Top-ranking answer is correct: 50

Factor: 1.8
Top-ranking answer is correct: 49

Factor: 1.7
Top-ranking answer is correct: 47

The pattern is quite similar to the Google-based system, with the highest scores being obtained for capitalization factors in the range 2.1-2.4. As a result, I’ve change the value for the capitalization factor from its original value of 3.0 to a somewhat lower value, 2.2.

Varying the weights of the rewrite rules: Recall from the last post that questions of the form “Who VERB w1 w2 w3…” got rewritten as Google queries:

"VERB w1   w2   w3 ... "
"w1   VERB w2   w3 ... "
"w1   w2   VERB w3 ... "
... 

The candidate answers extracted from the search result for these rewritten queries were given a score of 5, following the scoring used in the original AskMSR research paper. Another rewrite rule formulated a single Google query which omitted both the quotes and VERB, i.e., it was just:

w1   w2   w3 ... 

The candidate answers extracted from the search results for these rewritten queries were given a score of 2.

The scores 5 and 2 seem pretty arbitrary. Can we improve our system by changing the scores? Let’s experiment, and find out. To do the experiment, let me first mention that all that matters to the final ranking is the ratio of the two scores (initially, 5/2 = 2.5). So I’ll give the results of running our evaluation suite for different choices of that ratio:

Ratio: infinite
Top-ranking answer is correct: 50

Ratio: 10.0
Top-ranking answer is correct: 50

Ratio: 6.0
Top-ranking answer is correct: 50

Ratio: 5.0
Top-ranking answer is correct: 50

Ratio: 4.0
Top-ranking answer is correct: 50

Ratio: 3.5
Top-ranking answer is correct: 51

Ratio: 3.0
Top-ranking answer is correct: 51

Ratio: 2.5
Top-ranking answer is correct: 51

Ratio: 2.0
Top-ranking answer is correct: 51

Ratio: 1.5
Top-ranking answer is correct: 51

Ratio: 1.0
Top-ranking answer is correct: 50

Ratio: 0.75
Top-ranking answer is correct: 49

Ratio: 0.0
Top-ranking answer is correct: 37

Again, we learn some interesting things:

(1) The performance is surprisingly insensitive to changes in this ratio.

(2) Even removing the quoted rewritten queries entirely (ratio = 0.0) only drops the performance from 51 to 37 questions answered correctly.

(3) Dropping the unquoted rewritten queries has almost no effect at all, dropping the number of correct answers from 51 to 50.

(4) Choosing the ratio to be 2.5, as was done in the original AskMSR paper, seems to provide good performance.

Two problems in the way we’ve used evaluation questions to improve the system

(1) It’s not clear that the evaluation questions are truly a random sample of well-posed “Who…?” questions. They’re simply a set of questions that happened to occur to me while I was working on the program, and they naturally reflect my biases and interests. In fact, the problem is worse than that: it’s not even clear what it would mean for the evaluation questions to be a random sample. A random sample from what natural population?

(2) There’s something fishy about using the same data repeatedly to improve the parameters in our system. It can lead to the problem of overfitting, where the parameters we end up choosing reflect accidental features of the evaluation set. A standard solution to this problem is cross-validation, dividing evaluation questions up into training data (which are used to estimate parameters) and validation data (which are used to check that overfitting hasn’t occurred).

Problem (2) could be solved with a much larger set of evaluation questions, and that’s a good goal for the future. If anyone knows of a good source of evaluation questions, I’d like to hear about it! It’s not at all clear to me how to solve problem (1), and I’d be interested to hear people’s ideas.

Automating: What we’ve done here is very similar to what people routinely do with statistical models or machine learning, where you use training data to find the best possible parameters for your model. What’s interesting, though, is that we’re not just varying parameters, we’re changing the algorithm. It’s interesting to learn that Wolfram Alpha and Google overlap more than you might a priori expect in the types of questions they’re good (and bad) at answering. And it tells you that effort invested in improving one or the other approach might have less impact than you expect, unless you’re careful to aim your improvements at questions not already well-answered by the other system.

Problems

  • Find three ways of modifying the hybrid algorithm, and use the evaluation algorithm to see how well the modifications work. Can you improve the algorithm so it gets more than 51 of the evaluation questions exactly right?

Acknowledgements: Thanks to Nathan Hoffman for gently suggesting that it might be a good idea to cache results from my screen-scraping!

Interested in more? Please subscribe to this blog, or follow me on Twitter. You may also enjoy reading my new book about open science, Reinventing Discovery.

Jun 13 12

How to answer a question: a simple system

by Michael Nielsen

In 2011, IBM got a lot of publicity when it demonstrated a computer named Watson, which was designed to answer questions on the game show Jeopardy. Watson was good enough to be competitive with (and ultimately better than) some of the best ever Jeopardy players.

Playing Jeopardy perhaps seems like a frivolous task, and it’s tempting to think that Watson was primarily a public relations coup for IBM. In fact, Watson is a remarkable technological achievement. Jeopardy questions use all sorts of subtle hints, humour and obscure references. For Watson to compete with the best humans it needed to integrate an enormous range of knowledge, as well as many of the best existing ideas in natural language processing and artificial intelligence.

In this post I describe one of Watson’s ancestors, a question-answering system called AskMSR built in 2001 by researchers at Microsoft Research. AskMSR is much simpler than Watson, and doesn’t work nearly as well, but does teach some striking lessons about question-answering systems. My account of AskMSR is based on a paper by Brill, Lin, Banko, Dumais and Ng, and a followup paper by Brill, Dumais and Banko.

The AskMSR system was developed for a US-Government-run workshop called TREC. TREC is an annual multi-track workshop, with each track concentrating on a different information retrieval task. For each TREC track the organizers pose a challenge to participants in the run-up to the workshop. Participants build systems to solve those challenges, the systems are evaluated by the workshop organizers, and the results of the evaluation are discussed at the workshop.

For many years one of the TREC tracks was question answering (QA), and the AskMSR system was developed for the QA track at the 2002 TREC workshop. At the time, many of the systems submitted to TREC’s QA track relied on complex linguistic analysis to understand as much as possible about the meaning of the questions being asked. The researchers behind AskMSR had a different idea. Instead of doing sophisticated linguistic analysis of questions they decided to do a much simpler analysis, but to draw on a rich database of knowledge to find answers – the web. As they put it:

In contrast to many question answering systems that begin with rich linguistic resources (e.g., parsers, dictionaries, WordNet), we begin with data and use that to drive the design of our system. To do this, we first use simple techniques to look for answers to questions on the Web.

As I describe in detail below, their approach was to take the question asked, to rewrite it in the form of a search engine query, or perhaps several queries, and then extract the answer by analysing the Google results for those queries.

The insight they had was that a large and diverse document collection (such as the web) would contain clear, easily extractable answers to a very large number of questions. Suppose, for example, that your QA system is trying to answer the question “Who killed Abraham Lincoln?” Suppose also that the QA system only received limited training data, including a text which read “Abraham Lincoln’s life was ended by John Wilkes Booth”. It requires a sophisticated analysis to understand that this means Booth killed Lincoln. If that were the only training text related to Lincoln and Booth, the system might not make the correct inference. On the other hand, with a much large document collection it would be likely that somewhere in the documents it would plainly say “John Wilkes Booth killed Abraham Lincoln”. Indeed, if the document collection were large enough this phrase (and close variants) might well be repeated many times. At that point it takes much less sophisticated analysis to figure out that “John Wilkes Booth” is a good candidate answer. As Brill et al put it:

[T]he greater the answer redundancy in the source, the more likely it is that we can find an answer that occurs in a simple relation to the question, and therefore, the less likely it is that we will need to resort to solving the aforementioned difficulties facing natural language processing systems.

To sum it up even more succinctly, the idea behind AskMSR was that: unsophisticated linguistic algorithms + large amounts of data \geq sophisticated linguistic algorithms + only a small amount of data.

I’ll now describe how AskMSR worked. I limit my discussion to just a single type of question, questions of the form “Who…?” The AskMSR system could deal with several different question types (“When…?”, etc), but the ideas used for the other question types were similar. More generally, my description is an abridgement and simplification of the original AskMSR system, and you should refer to the original paper for more details. The benefit of my simplified approach is that we’ll be able to create a short Python implementation of the ideas.

How AskMSR worked

Rewriting: Suppose we’re asked a question like “Who is the richest person in the world?” If we’re trying to use Google ourselves to answer the question, then we might simply type the question straight in. But Google isn’t (so far as I know) designed with question-answering in mind. So if we’re a bit more sophisticated we’ll rewrite the text of the question to turn it into a query more likely to turn up answers, queries like richest person world or world's richest person. What makes these queries better is that they omit terms like “Who” which are unlikely to appear in answers to the question we’re interested in.

AskMSR used a similar idea of rewriting, based on very simple rules. Consider “Who is the richest person in the world?” Chances are pretty good that we’re looking for a web page with text of the form “the richest person in the world is *”, where * denotes the (unknown) name of that person. So a good strategy is to search for pages matching the text “the richest person in the world is” exactly, and then to extract whatever name comes to the right.

There are a few things to note about how we rewrote the question to get the search query.

Most obviously, we eliminated “Who” and the question mark.

A little less obviously, we moved the verb “is” to the end of the phrase. Moving the verb in this way is common when answering questions. Unfortunately, we don’t always move the verb in the same way. Consider a question such as “Who is the world’s richest person married to?” The best way to move the verb is not to the end of the sentence, but rather to move it between “the world’s richest person” and “married to”, so we are searching for pages matching the text “the world’s richest person is married to *”.

More generally, we’ll rewrite questions of the form Who w0 w1 w2 w3 ... (where w0 w1 w2 ... are words) as the following search queries (note that the quotes matter, and are part of the query):

"w0 w1 w2 w3 ... "
"w1 w0 w2 w3 ... "
"w1 w2 w0 w3 ... "
... 

It’s a carpet-bombing strategy, trying all possible options for the correct place for the verb in the text. Of course, we end up with many ridiculous options, like "the world's is richest person married to". However, according to Brill et al, “While such an approach results in many nonsensical rewrites… these very rarely result in the retrieval of bad pages, and the proper movement position is guaranteed to be found via exhaustive search.”

Why does this very rarely result in the retrieval of bad pages? Ultimately, the justification must be empirical. However, a plausible story is that rewrites such as "the world's is richest person married to" don’t make much sense, and so are likely to match only a very small collection of webpages compared to (say) "world's richest person is married to". Because of this, we will see few repetitions in any erroneous answers extracted from the search results for the nonsensical query, and so the nonsensical query is unlikely to result in incorrect answers.

Problems for the author

  • Contrary to the last paragraph, I think it’s likely that sometimes the rewritten questions do make sense as search queries, but not as ways of searching for answers to the original question. What are some examples of such questions?

Another feature of the queries above is that they are all quoted. Google uses this syntax to specify that an exact phrase match is required. Below we’ll introduce an unquoted rewrite rule, meaning that an exact phrase match isn’t required.

With the rules I’ve described above we rewrite “Who is the world’s richest person?” as both the query “is the world’s richest person” and the query “the world’s richest person is”. There are several other variations, but for now we’ll focus on just these two. For the first query, “is the world’s richest person”, notice that if we find this text in a page then it’s likely to read something like “Bill Gates is the world’s richest person” (or perhaps “Carlos Slim is…”, since Slim has recently overtaken Gates in wealth). If we search on the second query, then returned pages are more likely to read “the world’s richest person is Bill Gates”.

What this suggests is that accompanying the rewritten query, we should also specify whether we expect the answer to appear on the left (L) or the right (R). Brill et al adopted the rule to expect the answer on the left when the verb starts the query, and on the right when the verb appears anywhere else. With this change the specification of the rewrite rules becomes:

["w0 w1 w2 w3 ... ", L]
["w1 w0 w2 w3 ... ", R]
["w1 w2 w0 w3 ... ", R]
... 

Brill et al don’t justify this choice of locations. I haven’t thought hard about it, but the few examples I’ve tried suggest that it’s not a bad convention, although I’ll bet you could come up with counterexamples.

For some rewritten queries Brill allow the answer to appear on either (E) side. In particular, they append an extra rewrite rule to the above list which is:

[w1 w2 w3..., E]

This differs in two ways to the earlier rewrite rules. First, the search query is unquoted, so it doesn’t require an exact match, or for the order of the words to be exactly as specified. Second, the verb is entirely omitted. For these reasons it makes sense to look for answers on either side of the search text.

Of course, if we extract a candidate answer from this less precise search then we won’t be as confident in the answer. For that reason we also assign a score to each rewrite rule. Here’s the complete set of rewrite rules we use, including scores. (I’ll explain how we use the scores a little later: for now all that matters is that higher scores are better.)

["w0 w1 w2 w3 ... ", L, 5]
["w1 w0 w2 w3 ... ", R, 5]
["w1 w2 w0 w3 ... ", R, 5]
... 
[w1 w2 w3..., E, 2]

Problems for the author

  • At the moment we’ve implicitly assumed that the verb is a single word after “Who”. However, sometimes the verb will be more complex. For example, in the sentence “Who shot and killed President Kennedy?” the verb is “shot and killed”. How can we identify such complex compound verbs?

Extracting candidate answers: We submit each rewritten query to Google, and extract Google’s document summaries for the top 10 search results. We then split the summaries into sentences, and each sentence is assigned a score, which is just the score of the rewrite rule the sentence originated from.

How should we extract candidate answers from these sentences? First, we remove text which overlaps the query itself – we’re looking for terms near the query, not the same as the query! This ensures that we don’t answer questions such as “Who killed John Fitzgerald Kennedy?” with “John Fitzgerald Kennedy”. We’ll call the text that remains after deletion a truncated sentence.

With the sentence truncated, the idea then is to look for 1-, 2-, and 3-word strings (n-grams) which recur often in the truncated sentences. To do this we start by listing every n-gram that appears in at least one truncated sentence. This is our list of candidate answers. For each such n-gram we’ll assign a total score. Roughly speaking, the total score is the sum of the scores of all the truncated sentences that the n-gram appears in.

In fact, we can improve the performance of the system by modifying this scoring procedure. We do this by boosting the score of a sentence if one or more words in the n-gram are capitalized. We do this because capitalization likely indicates a proper noun, which means the word is an especially good candidate to be part of the correct answer. The system outputs the n-gram which has the highest total score as its preferred answer.

Toy program: Below is a short Python 2.7 program which implements the algorithm described above. (It omits the left-versus-right-versus-either distinction, though.) The code is also available on GitHub, together with a small third-party library (via Mario Vilas) that’s used to access Google search results. Here’s the code:

#### mini_qa.py
#
# A toy question-answering system, which uses Google to attempt to
# answer questions of the form "Who... ?"  An example is: "Who
# discovered relativity?"
#
# The design is a simplified version of the AskMSR system developed by
# researchers at Microsoft Research.  The original paper is:
#
# Brill, Lin, Banko, Dumais and Ng, "Data-Intensive Question
# Answering" (2001).
#
# I've described background to this program here:
#
# http://michaelnielsen.org/ddi/how-to-answer-a-question-v1/

#### Copyright and licensing
#
# MIT License - see GitHub repo for details:
#
# https://github.com/mnielsen/mini_qa/blob/blog-v1/mini_qa.py
#
# Copyright (c) 2012 Michael Nielsen

#### Library imports

# standard library
from collections import defaultdict
import re

# third-party libraries
from google import search

def pretty_qa(question, num=10):
    """
    Wrapper for the `qa` function.  `pretty_qa` prints the `num`
    highest scoring answers to `question`, with the scores in
    parentheses.
    """
    print "\nQ: "+question
    for (j, (answer, score)) in enumerate(qa(question)[:num]):
        print "

def qa(question):
    """
    Return a list of tuples whose first entry is a candidate answer to
    `question`, and whose second entry is the score for that answer.
    The tuples are ordered in decreasing order of score.  Note that
    the answers themselves are tuples, with each entry being a word.
    """
    answer_scores = defaultdict(int)
    for query in rewritten_queries(question):
        for summary in get_google_summaries(query.query):
            for sentence in sentences(summary):
                for ngram in candidate_answers(sentence, query.query):
                    answer_scores[ngram] += ngram_score(ngram, 
                                                        query.score)
    return sorted(answer_scores.iteritems(), 
                  key=lambda x: x[1], 
                  reverse=True)

def rewritten_queries(question):
    """
    Return a list of RewrittenQuery objects, containing the search
    queries (and corresponding weighting score) generated from
    `question`.  
    """
    rewrites = [] 
    tq = tokenize(question)
    verb = tq[1] # the simplest assumption, something to improve
    rewrites.append(
        RewrittenQuery("\"
    for j in range(2, len(tq)):
        rewrites.append(
            RewrittenQuery(
                "\"
                    " ".join(tq[2:j+1]), verb, " ".join(tq[j+1:])),
                5))
    rewrites.append(RewrittenQuery(" ".join(tq[2:]), 2))
    return rewrites

def tokenize(question):
    """
    Return a list containing a tokenized form of `question`.  Works by
    lowercasing, splitting around whitespace, and stripping all
    non-alphanumeric characters.  
    """
    return [re.sub(r"\W", "", x) for x in question.lower().split()]

class RewrittenQuery():
    """
    Given a question we rewrite it as a query to send to Google.
    Instances of the RewrittenQuery class are used to store these
    rewritten queries.  Instances have two attributes: the text of the
    rewritten query, which is sent to Google; and a score, indicating
    how much weight to give to the answers.  The score is used because
    some queries are much more likely to give highly relevant answers
    than others.
    """

    def __init__(self, query, score):
        self.query = query
        self.score = score

def get_google_summaries(query):
    """
    Return a list of the top 10 summaries associated to the Google
    results for `query`.  Returns all available summaries if there are
    fewer than 10 summaries available.  Note that these summaries are
    returned as BeautifulSoup.BeautifulSoup objects, and may need to
    be manipulated further to extract text, links, etc.
    """
    return search(query)

def sentences(summary):
    """
    Return a list whose entries are the sentences in the
    BeautifulSoup.BeautifulSoup object `summary` returned from Google.
    Note that the sentences contain alphabetical and space characters
    only, and all punctuation, numbers and other special characters
    have been removed.
    """
    text = remove_spurious_words(text_of(summary))
    sentences = [sentence for sentence in text.split(".") if sentence]
    return [re.sub(r"[^a-zA-Z ]", "", sentence) for sentence in sentences]

def text_of(soup):
    """
    Return the text associated to the BeautifulSoup.BeautifulSoup
    object `soup`.
    """
    return ".join(str(soup.findAll(text=True)))

def remove_spurious_words(text):
    """
    Return `text` with spurious words stripped.  For example, Google
    includes the word "Cached" in many search summaries, and this word
    should therefore mostly be ignored.
    """
    spurious_words = ["Cached", "Similar"]
    for word in spurious_words:
        text = text.replace(word, "")
    return text

def candidate_answers(sentence, query):
    """
    Return all the 1-, 2-, and 3-grams in `sentence`.  Terms appearing
    in `query` are filtered out.  Note that the n-grams are returned
    as a list of tuples.  So a 1-gram is a tuple with 1 element, a
    2-gram is a tuple with 2 elements, and so on.
    """
    filtered_sentence = [word for word in sentence.split() 
                         if word.lower() not in query]
    return sum([ngrams(filtered_sentence, j) for j in range(1,4)], [])

def ngrams(words, n=1):
    """
    Return all the `n`-grams in the list `words`.  The n-grams are
    returned as a list of tuples, each tuple containing an n-gram, as
    per the description in `candidate_answers`.
    """
    return [tuple(words[j:j+n]) for j in xrange(len(words)-n+1)]

def ngram_score(ngram, score):
    """
    Return the score associated to `ngram`.  The base score is
    `score`, but it's modified by a factor which is 3 to the power of
    the number of capitalized words.  This biases answers toward
    proper nouns.
    """
    num_capitalized_words = sum(
        1 for word in ngram if is_capitalized(word)) 
    return score * (3**num_capitalized_words)

def is_capitalized(word):
    """
    Return True or False according to whether `word` is capitalized.
    """
    return word == word.capitalize()

if __name__ == "__main__":
    pretty_qa("Who ran the first four-minute mile?")
    pretty_qa("Who makes the best pizza in New York?")
    pretty_qa("Who invented the C programming language?")
    pretty_qa("Who wrote the Iliad?")
    pretty_qa("Who caused the financial crash of 2008?")
    pretty_qa("Who caused the Great Depression?")
    pretty_qa("Who is the most evil person in the world?")
    pretty_qa("Who wrote the plays of Wiliam Shakespeare?")
    pretty_qa("Who is the world's best tennis player?")
    pretty_qa("Who is the richest person in the world?")

If you run the program you’ll see that the results are a mixed bag. When I tested it, it knew that Roger Bannister ran the first four-minute mile, that Dennis Ritchie invented the C programming language, and that Homer wrote the Iliad. On the other hand, the program sometimes gives answers that are either wrong or downright nonsensical. For example, it thinks that Occupy Wall Street caused the financial crash of 2008 (Ben Bernanke also scores highly). And it replies “Laureus Sports Awards” when asked who is the world’s best tennis player. So it’s quite a mix of good and bad results.

While developing the program I used some questions repeatedly to figure out how to improve perfomance. For example, I often asked it the questions “Who invented relativity?” and “Who killed Abraham Lincoln?” Unsuprisingly, the program now answers both questions correctly! So to make things fairer the questions used as examples in the code aren’t ones I tested while developing the program. They’re still far from a random sample, but at least the most obvious form of bias has been removed.

Much can be done to improve this program. Here are a few ideas:

Problems

  • To solve many of the problems I describe below it would help to have a systematic procedure to evaluate the performance of the system, and, in particular, to compare the performance of different versions of the system. How can we build such a systematic evaluation procedure?
  • The program does no sanity-checking of questions. For example, it simply drops the first word of the question. As a result, the question “Foo killed Abraham Lincoln?” is treated identically to “Who killed Abraham Lincoln?” Add some basic sanity checks to ensure the question satisfies standard constraints on formatting, etc.
  • More generally, the program makes no distinction between sensible and nonsensical questions. You can ask it “Who coloured green ideas sleeping furiously?” and it will happily answer. (Perhaps appropriately, “Chomsky” is one of the top answers.) How might the program figure out that the question doesn’t make sense? (This is a problem shared by many other systems. Here is to the question, and also Google’s response.)
  • The program boosts the score of n-grams containing capitalized words. This makes sense most of the time, since a capitalized word is likely to be a proper noun, and is thus a good candidate to be the answer to the question. But it makes less sense when the word is the first word of the sentence. What should be done then?
  • In the sentences extracted from Google we removed terms which appeared in the original query. This means that, for example, we won’t answer “Who killed John Fitzgerald Kennedy?” with “John Fitzgerald Kennedy”. It’d be good to take this further by also eliminating synonyms, so that we don’t answer “JFK”. How can this be done?
  • There are occasional exceptions to the rule that answers don’t repeat terms in the question. For example, the correct answer to the question “Who wrote the plays of William Shakespeare?” is, most likely, “William Shakespeare”. How can we identify questions where this is likely to be the case?
  • How does it change the results to use more than 10 search summaries? Do the results get better or worse?
  • An alternative approach to using Google would be to use Bing, Yahoo! BOSS, Wolfram Alpha, or Cyc as sources. Or we could build our own source using tools such as Nutch and Solr. How do the resulting systems compare to one another? Is it possible to combine the sources to do better than any single source alone?
  • Some n-grams are much more common than others: the phrase “he was going” occurs far more often in English text than the phrase “Lee Harvey Oswald”, for example. Because of this, chances are that repetitions of the phrase “Lee Harvey Oswald” in search results are more meaningful than repetitions of more common phrases, such as “he was going”. It’d be natural to modify the program to give less common phrases a higher score. What’s the right way of doing this? Does using inverse document frequency give better performance, for example?
  • Question-answering is really three problems: (1) understanding the question; (2) figuring out the answer; and (3) explaining the answer. In this post I’ve concentrated on (2) and (to some extent) (1), but not (3). How might we go about explaining the answers generated by our system?

Discussion

Let’s come back to the idea I described in the opening: unsophisticated algorithms + large amounts of data \geq sophisticated linguistic algorithms + only a small amount of data.

Variations on this idea have become common recently. In an influential 2009 paper, Halevy, Norvig, and Pereirawrote that:

[I]nvariably, simple models and a lot of data trump more elaborate models based on less data.

They were writing in the context of machine translation and speech recognition, but this point of view has become commonplace well beyond machine translation and speech recognition. For example, at the 2012 O’Reilly Strata California conference on big data, there was a debate on the idea that “[i]n data science, domain expertise is more important than machine learning skill.” The people favouring the machine learning side won the debate, at least in the eyes of the audience. Admittedly, a priori you’d expect this audience to strongly favour machine learning. Still, I expect that just a few years earlier even a highly sympathetic audience would have tilted the other way.

An interesting way of thinking about this idea that data trumps the quality of your models and algorithms is in terms of a tradeoff curve:

The curve shown is one of constant performance: as you increase the amount of training data, you decrease the quality of the algorithm required to get the same performance. The tradeoff between the two is determined by the slope of the curve.

A difficulty with the “more data is better” point of view is that it’s not clear how to determine what the tradeoffs are in practice: is the slope of the curve very shallow (more data helps more than better algorithms), or very steep (better algorithms help more than more data). To put it another way, it’s not obvious whether to focus on acquiring more data, or on improving your algorithms. Perhaps the correct moral to draw is that this is a key tradeoff to think about when deciding how to allocate effort. At least in the case of the AskMSR system, taking the more data idea seriously enabled the team to very quickly build a system that was competitive with other systems which had taken much longer to develop.

A second difficulty with the more data is better point of view is, of course, that sometimes more data is exremely expensive to generate. IBM has said that it intends to use Watson to help doctors answer medical questions. Yet for such highly specialized questions it’s not so clear that it will be easy to find data sources which meet the criteria outlined by Brill et al:

[T]he greater the answer redundancy in the source, the more likely it is that we can find an answer that occurs in a simple relation to the question, and therefore, the less likely it is that we will need to resort to solving the aforementioned difficulties facing natural language processing systems.

Data versus understanding: I’ll finish by briefly discussing one possible criticism of the data-driven approach to question-answering. That criticism is that it rejects detailed understanding in favour of a simplistic data-driven associative model. It’s tempting to think that this must therefore be the wrong track, a blind alley that may pay off in the short term, but which will ultimately be unfruitful. I think that point of view is shortsighted. We don’t yet understand how human cognition works, but we do know that we often use after-the-fact rationalization and confabulation, suggesting that the basis for many of our decisions isn’t a detailed understanding, but something more primitive. This isn’t to say that rationalization and confabulation are desirable. But it does mean that it’s worth pushing beyond narrow conceptions of understanding, and trying to determine how much of our intelligence can be mimicked using simple data-driven associative techniques, and by combining such techniques with other ideas.

Acknowledgements: Thanks to Thomas Ballinger, Dominik Dabrowski, Nathan Hoffman, and Allan O’Donnell for comments that improved my code.

Interested in more? Please subscribe to this blog, or follow me on Twitter. You may also enjoy reading my new book about open science, Reinventing Discovery.

Apr 11 12

Lisp as the Maxwell’s equations of software

by Michael Nielsen

On my first day of physics graduate school, the professor in my class on electromagnetism began by stepping to the board, and wordlessly writing four equations:

 \nabla \cdot E = \frac{\rho}{\epsilon_0} \hspace{2cm} \nabla \times E+\frac{\partial B}{\partial t} = 0   \nabla \cdot B = 0 \hspace{1.1cm} \nabla \times B-\mu_0 \epsilon_0 \frac{\partial E}{\partial t} = \mu_0 j

He stepped back, turned around, and said something like [1]: “These are Maxwell’s equations. Just four compact equations. With a little work it’s easy to understand the basic elements of the equations – what all the symbols mean, how we can compute all the relevant quantities, and so on. But while it’s easy to understand the elements of the equations, understanding all their consequences is another matter. Inside these equations is all of electromagnetism – everything from antennas to motors to circuits. If you think you understand the consequences of these four equations, then you may leave the room now, and you can come back and ace the exam at the end of semester.”

Alan Kay has famously described Lisp as the “Maxwell’s equations of software”. He describes the revelation he experienced when, as a graduate student, he was studying the LISP 1.5 Programmer’s Manual and realized that “the half page of code on the bottom of page 13… was Lisp in itself. These were “Maxwell’s Equations of Software!” This is the whole world of programming in a few lines that I can put my hand over.”

Here’s the half page of code that Kay saw in that manual:

What we’re going to do in this essay is understand what that half page of code means, and what it means that Lisp is the Maxwell’s equations of software. However, we won’t literally work through the half page of code above. Instead, we’ll do something much more informative: we’ll create a modern, fully executable equivalent of the code above. Furthermore, to make this essay accessible, I won’t assume that you know Lisp. Instead, I’ll teach you the basic elements of Lisp.

That perhaps sounds over-ambitious, but the good news is that it’s easy to learn the basic elements of Lisp. Provided you have a little facility with computer programming and comfort with mathematics, you can learn how Lisp works in just a few minutes. Frankly, it’s much easier than understanding the elements of Maxwell’s equations! And so I’ll start by explaining a subset of the Lisp programming language, and getting you to write some Lisp code.

But I won’t stop with just showing you how to write some Lisp. Once we’ve done that we’re going to write an interpreter for Lisp code. In particular, we’ll create a interpreter based on a beautiful Lisp interpreter written by Peter Norvig, which contains just 90 lines of Python code. Our interpreter will be a little more complex, due mostly to the addition of a few conveniences absent from Norvig’s interpreter. The code is still simple and easy to understand, provided you’re comfortable reading Python code. As we’ll see, the benefit of writing the interpreter is not just that it gives us a running interpreter (although that’s no small thing). It’s that writing an interpreter also deepens our understanding of Lisp. It does that by taking what would otherwise be some rather abstract concepts in our description of Lisp, and giving them concrete, tangible representations in terms of Python code and data structures. By making concrete what was formerly abstract, the code for our Lisp interpreter gives us a new way of understanding how Lisp works.

With our Python Lisp interpreter up and running, we’ll then write a modern equivalent to the code on the bottom of page 13 of the LISP 1.5 Programmer’s Manual. But while our code will be essentially the same as the code from page 13, it will have the considerable advantage that it’s also executable. We can, if we wish, play with the code, modify it, and improve it. In other words, it’s a living version of the Maxwell’s equations of software! Furthermore, with our new understanding it becomes an easy and fun exercise to understand all the details on page 13 of the LISP Manual.

This second part of the essay is based primarily on two sources: the first chapter of the LISP 1.5 Manual, of course, but also an essay by Paul Graham (postscript) in which he explains some of the early ideas behind Lisp. Incidentally, “LISP” is the capitalization used in the LISP Manual, but otherwise I’ll use the modern capitalization convention, and write “Lisp”.

The great Norwegian mathematician Niels Henrik Abel was once asked how he had become so good at mathematics. He replied that it was “by studying the masters, not their pupils”. The current essay is motivated by Abel’s admonishment. As a programmer, I’m a beginner (and almost completely new to Lisp), and so this essay is a way for me to work in detail through ideas from masters such as Alan Kay, Peter Norvig, and Paul Graham. Of course, if one takes Abel at his word, then you should stop reading this essay, and instead go study the works of Kay, Norvig, Graham, and the like! I certainly recommend taking the time to study their work, and at the end of the essay I make some recommendations for further reading. However, I hope that this essay has a distinct enough point of view to be of interest in its own right. Above all, I hope the essay makes thinking about Lisp (and programming) fun, and that it raises some interesting fundamental questions. Of course, as a beginner, the essay may contain some misunderstandings or errors (perhaps significant ones), and I’d welcome corrections, pointers, and discussion.

Some elements of Lisp

In this and the next two sections we’ll learn the basic elements of Lisp – enough to develop our own executable version of what Alan Kay saw on page 13 of the LISP Manual. We’ll call the dialect of Lisp we develop “tiddly Lisp”, or just tiddlylisp for short. Tiddlylisp is based on a subset of the programming language Scheme, which is one of the most popular modern dialects of Lisp.

While I can show you a bunch of examples of Lisp, you’ll learn much more if you type in the examples yourself, and then play around with them, modifying them and trying your own ideas. So I want you to download the file tiddlylisp.py to your local machine. Alternately, if you’re using git, you can just clone the entire code repository associated to this essay. The file tiddlylisp.py is the Lisp interpreter whose design and code we’ll work through later in the essay. On Linux and Mac you can start the tiddlylisp interpreter by typing python tiddlylisp.py from the command line. You should see a prompt:

tiddlylisp>

That’s where you’re going to type in the code from the examples – it’s an interactive Lisp interpreter. You can exit the interpreter at any time by hitting Ctrl-C. Note that the interpreter isn’t terribly complete – as we’ll see, it’s only 153 lines! – and one way it’s incomplete is that the error messages aren’t very informative. Don’t spend a lot of time worrying about the errors, just try again.

If you’re on Windows, and don’t already have Python installed, you’ll need to download it (I suggest Python 2.7 for this code), before starting the interpreter by running python tiddlylisp.py. Note that I’ve only tested the interpreter on Ubuntu Linux, not on Windows or the Mac.

As our first example, type the following code into the tiddlylisp interpreter:

tiddlylisp> (+ 2 3)
5

In the expression you typed in, + is a built-in procedure, which represents the addition operation. It’s being applied to two arguments, in this case the numbers 2 and 3. The interpreter evaluates the result of applying + to 2 and 3, i.e., of adding 2 and 3 together, returning the value 5, which it then prints.

That first example was extremely simple, but it contains many of the concepts we need to understand Lisp: expressions, procedures, arguments, evaluation, returning a value, and printing. We’ll see many more illustrations of these ideas below.

Here’s a second example, illustrating another built-in procedure, this time the multiplication procedure, *:

tiddlylisp> (* 3 4)
12

It’s the same basic story: * is a built-in procedure, this time representing multiplication, and is here being applied to the numbers 3 and 4. The interpreter evaluates the expression, and prints the value returned, which is 12.

One potentially confusing thing about the first two examples is that I’ve called + and * “procedures”, yet in many programming languages procedures don’t return a value, only functions do. I’m using this terminology because it’s the standard terminology in the programming language Scheme, which is the dialect of Lisp that tiddlylisp is based on. In fact, in some modern dialects of Lisp – such as Common Lisp – operations such as + and * would be called “functions”. But we’ll stick with Scheme’s useage, and talk only of procedures.

Another example, illustrating a slightly different type of procedure:

tiddlylisp> (< 10 20)
True

Here, the built-in procedure < represents the comparison operator. And so tiddlylisp prints the result of evaluating an expression which compares the constants 10 and 20. The result is True, since 10 is less than 20. By contrast, we have

tiddlylisp> (< 17 12)
False

since 17 is not less than 12.

Many Lisp beginners are initially confused by this way of writing basic numerical operations. We're so familiar with expressions such as 2 + 3 that the changed order in (+ 2 3) appears strange and unfamiliar. And yet our preference for the infix notation 2 + 3 over the prefix notation (+ 2 3) is more a historical accident than a consequence of anything fundamental about arithmetic. Unfortunately, this has the consequence that some people back away from Lisp, simply because they dislike thinking in this unfamiliar way.

In this essay we won't get deep enough into Lisp to see in a lot of concrete detail why the prefix notation is a good idea. However, we will get one hint: our tiddlylisp interpreter will be much simpler for using the same (prefix) style for all procedures. If you really strongly dislike the prefix notation, then I challenge you to rewrite tiddlylisp so that it uses infix notation for some operations, and prefix notation for others. What you'll find is that the interpreter becomes quite a bit more complex. And so there is a sense in which using the same prefix style everywhere makes Lisp simpler.

Another useful thing Lisp allows us to do is to nest expressions:

tiddlylisp> (< (* 11 19) (* 9 23))
False

To compute the value of this expression, tiddlylisp evaluates the nested expressions, returning 209 from (* 11 19), and 207 from (* 9 23). The output from the outer expression is then just the result of evaluating (< 209 207), which of course is False.

We can use tiddlylisp to define variables:

tiddlylisp> (define x 3)
tiddlylisp> (* x 2)
6

define is used to define new variables; we can also assign a value to a variable which has previously been defined:

tiddlylisp> (set! x 4)
tiddlylisp> x
4

One question you may have here is about the slightly unusual syntax: set!. You might wonder whether the exclamation mark means something special - maybe set! is some type of hybrid procedure or something. Actually, there's no such complexity: set! is just a single keyword, just like define, nothing special about it at all. It's just that tiddlylisp allows exclamation marks in keyword names. So there's nothing special going on indicated by the exclamation mark.

A deeper question is why we don't simply eleminate define, and make it so set! checks whether a variable has been defined already, and if not, does so. I'll explain a little bit later why we don't do this; for now, you should just note the distinction.

We can use define to define new procedures in a way similar to how we use it to define variables. Here's an example showing how to define a procedure named square. I'll unpack what's going on below, but first, here's the code,

tiddlylisp> (define square (lambda (x) (* x x)))
tiddlylisp> (square 7)
49

Ignore the first of the lines above for now. You can see from the second line what the procedure square does: it takes a single number as input, and returns the square of the number.

What about the first line of code above? We already know enough to guess quite a bit about how this line works: a procedure named square is being defined, and is assigned the value of the expression (lambda (x) (* x x)), whatever that value might be. What's new, and what we need to understand, is what the value of the expression (lambda (x) (* x x)) is. To understand this, let's break the expression into three pieces: lambda, (x), and (* x x). We'll understand the three pieces separately, and then put them back together.

The first piece of the expression - the lambda - simply tells the tiddlylisp interpreter that this expression is defining a procedure. I must admit that when I first encountered the lambda notation I found this pretty confusing [2] - I thought that lambda must be a variable, or an argument, or something like that. But no, it's just a big fat red flag to the tiddlylisp interpreter saying "Hey, this is a procedure definition". That's all lambda is.

The second part of the expression - the (x) - tells tiddlylisp that this is a procedure with a single argument, and that we're going to use the temporary name x for that argument, for the purposes of defining the procedure. If the procedure definition had started instead with (lambda (x y) ...) that would have meant that the procedure had two arguments, temporarily labelled x and y for the purposes of defining the procedure.

The third part of the expression - the (* x x) - is the meat of the procedure definition. It's what we evaluate and return when the procedure is called, with the actual values for the arguments of the procedure substituted in place of x.

Taking it all together, then, the value of the expresion (lambda (x) (* x x)) is just a procedure with a single input, and returning the square of that input. This procedure is anonymous, i.e., it doesn't have a name. But we can give it a name by using define, and so the line

tiddlylisp> (define square (lambda (x) (* x x)))

tells tiddlylisp to define something called square, whose value is a procedure (because of the lambda) with a single argument (because of the (x)), and what that procedure returns is the square of its argument (because of the (* x x)).

An important point about the variables used in defining procedures is that they're dummy variables. Suppose we wanted to define a procedure area which would return the area of a triangle, with arguments the base of the triangle and the height. We could do this using the following procedure definition:

tiddlylisp> (define area (lambda (b h) (* 0.5 (* b h))))
tiddlylisp> (area 3 5)
7.5

But what would have happened if we'd earlier defined a variable h, e.g.:

tiddlylisp> (define h 11)
tiddlylisp> (define area (lambda (b h) (* 0.5 (* b h))))

It probably won't surprise you to learn that inside the procedure definition, i.e., immediately after the lambda (b h), the h is treated as a dummy variable, and is entirely different from the h outside the procedure definition. It has what's called a different scope. And so we can continue the above with the following

tiddlylisp> (area 3 h)
16.5

that is, the area is just 0.5 times 3 times the value of h set earlier, outside the procedure definition. At that point the value of h was 11, and so (area 3 h) returns 16.5.

There's a variation on the above that you might wonder about, which is what happens when you use variables defined outside a procedure inside that procedure definition? For instance, suppose we have:

tiddlylisp> (define x 3)
tiddlylisp> (define foo (lambda (y) (* x y)))

What happens now if we evaluate foo? Well, tiddlylisp does the sensible thing, and interprets x as it was defined outside the procedure definition, so we have:

tiddlylisp> (foo 4)
12

What happens to our procedure if we next change the value of x? In fact, this changes foo:

tiddlylisp> (set! x 5)
tiddlylisp> (foo 4)
20

In other words, in the procedure definition lambda (y) (* x y) the x really does refer to the variable x, and not to the particular value x might have at any given point in time.

Let's dig down a bit more into how tiddlylisp handles scope and dummy variables. Let's look at what happens when we define a variable:

tiddlylisp> (define x 5)
tiddlylisp> x
5

The way the interpreter handles this internally is that it maintains what's called an environment: a dictionary whose keys are the variable names, and whose values are the corresponding variable values. So what the interpeter does when it sees the first line above is add a new key to the environment, "x", with value 5. When the interpreter sees the second line, it consults the environment, looks up the key x, and returns the corresponding value. You can, if you like, think of the environment as the interpreter's memory or data store, in which it stores all the details of the variables defined to date.

All this is pretty simple. We can go along defining and changing variables, and the interpreter just keeps consulting and modifying the environment as necessary. But when you define a new procedure using lambda the interpreter treats the variables used in the definition slightly differently. Let's look at an example:

tiddlylisp> (define h 5)
tiddlylisp> (define area (lambda (b) (* b h)))
tiddlylisp> (area 2)
10

What I want to concentrate on here is the procedure definition (lambda (b) (* b h)). Up to this point the interpreter had been chugging along, modifying its environment as appropriate. What happens when it sees the (lambda ...), though, is that the interpreter creates a new environment, an environment called an inner environment, as distinguished from the outer environment, which is what the interpreter has been operating in until it reached the (lambda ...) statement. The inner environment is a new dictionary, whose keys initially are just the arguments to the procedure - in this case a single key, b - and whose values will be supplied when the procedure is called.

To recap, what the interpreter does when it sees (lambda (b) (* b h)) is create a new inner environment, with a key b whose value will be set when the procedure is called. When evaluating the expression (* b h) (which defines the result returned from the procedure) what the interpreter does is first consult the inner environment, where it finds the key b, but not the key h. When it fails to find h, it looks instead for h in the outer environment, where the key h has indeed been defined, and retrieves the appropriate value.

I've described a simple example showing how environments work, but tiddlylisp also allows us to have procedure definitions nested inside procedure definitions, nested inside procedure definitions (and so on). To deal with this, the top-level tiddlylisp interpreter operates inside a global environment, and each procedure definition creates a new inner environment, perhaps nested inside a previously created inner environment, if that's appropriate.

If all this talk about inner and outer environments has left you confused, fear not. At this stage it really is important to have gotten the gist of how environments work, but you shouldn't worry if the details still seem elusive. In my opinion, the best way to understand those details is not through abstract discussion, but instead by looking at the working code for tiddlylisp. We'll get to that shortly, but for now can move onwards armed with a general impression of how environments work.

The procedure definitions I've described so far evaluate and return the value from just a single expression. We can use such expressions to achieve suprisingly complex things, because of the ability to nest expressions. Still, it would be convenient to have a way of chaining together expressions that doesn't involve nesting. A way of doing this is to use the begin keyword. begin is especially helpful in defining complex procedures, and so I'll give an example in that context:

tiddlylisp> (define area (lambda (r) (begin (define pi 3.14) (* pi (* r r)))))

This line is defining a procedure called area, with a single argument r, and where the value of (area r) is just the value of the expression

(begin (define pi 3.14) (* pi (* r r)))

with the appropriate value for r substituted. The way tiddlylisp evaluates the begin expression above is that it evaluates all the separate sub-expressions, consecutively in the order they appear, and then returns the value of the final sub-expression, in this case the sub-expression (* pi (* r r)). So, for example, we get

tiddlylisp> (area 3)
28.26

which is just the area of a circle with radius 3.

Now, in a simple example such as this you might argue that it makes more sense to avoid defining pi, and just put 3.14 straight into the later expression. However, doing so will make the intent of the code less clear, and I trust you will find it easy to imagine more complex cases where it makes even more sense to use multi-part begin expressions. This is especially the case since it is permissible to split Lisp expressions over multiple lines, so the above procedure definition could have been written:

(define area (lambda (r)
  (begin (define pi 3.14)
         (* pi (* r r)))))

Now, I should mention that if you enter the above into the tiddlylisp interpreter, you'll get errors. This is because the tiddlylisp interpreter is so stripped down that it doesn't allow multi-line inputs. However, tiddlylisp will allow you to load multi-line expressions from a file - try saving the above three lines in a file named area.tl, and then running that file with python tiddlylisp.py area.tl. Tiddlylisp will execute the code, defining the area procedure, and then leave you in the interpreter, where you can type things like:

tiddlylisp> (area 3)
28.26

A second advantage of using begin in the above code is that the variable pi is only defined in the inner environment associated to the lambda expression. If, for some reason, you want to define pi differently outside that scope, you can do so, and it won't be affected by the definition of area. Consider, for example, the following sequence of expressions:

tiddlylisp> (define pi 3)
tiddlylisp> pi
3
tiddlylisp> (define area (lambda (r) (begin (define pi 3.14) (* pi (* r r)))))
tiddlylisp> (area 1)
3.14
tiddlylisp> pi
3

In other words, the value of the final pi is returned from the outer (global) environment, not the inner environment created during the definition of the procedure area.

Earlier, we discussed define and set!, and wondered whether there was really any essential difference. Consider now the following example, where we modify area by using set! instead of define:

tiddlylisp> (define pi 3)
tiddlylisp> pi
3
tiddlylisp> (define area (lambda (r) (begin (set! pi 3.14) (* pi (* r r)))))
tiddlylisp> (area 1)
3.14
tiddlylisp> pi
3.14

As you can see by comparing the final line of this example to our earlier example, there really is a significant difference between define and set!. In particular, when we define area in this example, what set! does is check to see whether the inner environment contains a variable named pi. Since it doesn't, it checks the outer environment, where it does find such a variable, and that's what set! updates. If we'd used define instead it would have created a completely new variable named pi in the inner environment, while leaving pi in the outer environment untouched, so the final line would have returned 3 instead. So having both define and set! gives us quite a bit of flexibility and control over which environment is being used, at the expense of a complication in syntax.

As our final piece of Lisp before putting what we've learnt together to construct a nontrivial example, tiddlylisp includes a keyword if that can be used to test conditions, and conditionally return the value of different expressions. Here's an example showing the use of if to evaluate the absolute value of a variable:

tiddlylisp> (define x -2)
tiddlylisp> (if (> x 0) x (- 0 x))
2

We can use the same idea to define a procedure abs which returns the absolute value of a number:

tiddlylisp> (define abs (lambda (x) (if (> x 0) x (- 0 x))))
tiddlylisp> (abs -2)
2

The general form for if expressions is (if cond result alt), where we evaluate the expression cond, and if the value is True, return the value of result, otherwise return the value of alt. Note that in the expression (if cond result alt), the cond, result and alt of course shouldn't be read literally. Rather, they are placeholders for other expressions, as we saw in the absolute value example above. Through the remainder of this essay I will use use italics to indicate such placeholder expressions.

Let me conclude this section by briefly introducing two important Lisp concepts. The first is the concept of a special form. In this section we discussed several built-in procedures, such as + and *, and also some user-defined procedures. Calls to such procedures always have the form (proc exp1 exp2...), where proc is the procedure name, and the other arguments are expressions. Such expressions are evaluated by evaluating each individual expression exp1, exp2..., and then passing those values to the procedure. However, not all Lisp expressions are procedure calls. Consider the expression (define pi 3.14). This isn't evaluated by calling some procedure define with arguments given by the value of pi and the value of 3.14. It can't be evaluated this way because pi doesn't have a value yet! So define isn't a procedure. Instead, define is an example of what is known as a special form. Other examples of special forms include lambda, begin, and if, and a few more which we'll meet later. Like define, none of these is a procedure, but instead each special forms has its own special rule for evaluation.

The second concept I want to briefly introduce is that of a list. The list is one of the basic data structures used in Lisp, and even gives the language its name - Lisp is short for "list processing". In fact, most of the Lisp expressions we've seen up to now are lists: an expression such as (abs 2) is a two-element list, with elements abs and 2, delimited by spaces and parentheses. A more complex expression such as (define abs (lambda (x) (if (> x 0) x (- 0 x)))) is also a list, in this case with the first two elements being define and abs. The third element, (lambda (x) (if (> x 0) x (- 0 x))), is a sublist which in turn has sublists (and subsublists) of its own. Later we'll see how to use Lisp to do manipulations with such lists.

A nontrivial example: square roots using only elementary arithmetic

Let's put together the ideas above to do something nontrivial. We'll write a short tiddlylisp program to compute square roots, using only the elementary arithmetical operations (addition, subtraction, multiplication, and division). The idea behind the program is to use Newton's method. Although Newton's method is interesting in its own right, I'm not including this example because of its algorithmic elegance. Instead, I'm including it as a simple and beautiful example of a Lisp program. The example comes from Abelson and Sussman's book on the Structure and Interpretation of Computer Programs.

Here's how Newton's method for computing square roots works. Suppose we have a (positive) number x whose square root we wish to compute. Then we start by making a guess at the square root, guess. We're going to start by arbitrarily choosing 1.0 as our initial value for guess; in principle, any positive number will do. According to Newton's method, we'll get an improved guess by computing (guess+x/guess)/2, i.e., by taking the average of guess and x/guess. If we repeat this averaging process enough times, we'll converge to a good estimate of the square root.

Let's express these ideas in Lisp code. We'll do it from the top down, starting at a high level, and gradually filling in the details of all the procedures we need. We start at the absolute top level,

(define sqrt (lambda (x) (sqrt-iter 1.0 x)))

Here, (sqrt-iter guess x) is the value of a to-be-defined procedure sqrt-iter that takes a guess guess at the square root of x, and keeps improving that guess over and over until it's a good enough estimate of the square root. As I mentioned above, we start by arbitrarily choosing guess = 1.0.

The core of the algorithm is the definition of sqrt-iter:

(define sqrt-iter 
    (lambda (guess x) 
      (if (good-enough? guess x) guess (sqrt-iter (improve guess x) x))))

What sqrt-iter does is takes the guess and checks to see whether it's yet a good-enough? approximation to the square root of x, in a sense to be made precise below. If it is, then sqrt-iter is done, and returns guess, otherwise it applies sqrt-iter to the improved guess (improve guess x) supplied by applying Newton's method.

Incidentally, the way I've written sqrt-iter above it's another example of a multi-line tiddlylisp expression, and so it's not possible to enter in the tiddlylisp interpreter. However, you can enter it as part of a file, sqrt.tl, along with the definition of sqrt, and other procedures which we'll define below. We'll later use tiddlylisp to execute the file sqrt.tl.

To fill out the details of the above, we need to understand how good-enough? and improve work. Let's start with good-enough?, which simply checks whether the absolute value of the square of guess is within 0.00001 of x:

 
(define good-enough? 
    (lambda (guess x) (< (abs (- x (square guess))) 0.00001)))

Here, we have defined the absolute value procedure abs and the squaring procedure square as:

 
(define abs (lambda (x) (if (< 0 x) x (- 0 x))))
(define square (lambda (x) (* x x)))

That's all the code we need for good-enough?. For improve we simply implement Newton's method,

(define improve (lambda (guess x) (average guess (/ x guess))))

where average is defined in the obvious way:

(define average (lambda (x y) (* 0.5 (+ x y))))

We can write out the whole program as follows:

(define average (lambda (x y) (* 0.5 (+ x y))))
(define improve (lambda (guess x) (average guess (/ x guess))))
(define square (lambda (x) (* x x)))
(define abs (lambda (x) (if (< 0 x) x (- 0 x))))
(define good-enough? 
    (lambda (guess x) (< (abs (- x (square guess))) 0.00001)))
(define sqrt-iter 
    (lambda (guess x) 
      (if (good-enough? guess x) guess (sqrt-iter (improve guess x) x))))
(define sqrt (lambda (x) (sqrt-iter 1.0 x)))

Save these lines to the file sqrt.tl, and then run it using python tiddlylisp.py sqrt.tl. Tiddlylisp will execute the code, defining all the above procedures, and then leave you in an interpreter, where you can type:

tiddlylisp> (sqrt 2.0)
1.41421568627

This is, indeed, a pretty good approximation to the true square root: 1.4142135....

In writing out the overall program sqrt.tl, I've reversed the order of the lines, compared to my initial explanation. The reason I've done this is worth discussing. You see, the program sqrt.tl was actually the first piece of Lisp code I ever worked though in detail. The natural way to think through the problem of computing square roots with Newton's method is in the order I explained, working in a top-down fashion, starting from the broad problem, and gradually breaking our attack on that problem down into parts. But the first time I wrote the code out I assumed that I needed to explain these ideas to Lisp in a bottom-up fashion, defining procedures like average before improve and so on, so that procedure definitions only include previously defined procedures. That meant reversing the order of the code, as I've shown above.

Taking this approach bugged me. It doesn't seem like the most natural way to think about implementing Newton's method. At least for this problem a top-down approach seems more natural, and if you were just doing exploratory programming you'd start by defining sqrt, then sqrt-iter, and so on. As an experiment, I decided to reverse the order of the progam, so it reflects the natural "thinking order":

(define sqrt (lambda (x) (sqrt-iter 1.0 x)))
(define sqrt-iter 
    (lambda (guess x) 
      (if (good-enough? guess x) guess (sqrt-iter (improve guess x) x))))
(define good-enough? 
    (lambda (guess x) (< (abs (- x (square guess))) 0.00001)))
(define abs (lambda (x) (if (< 0 x) x (- 0 x))))
(define square (lambda (x) (* x x)))
(define improve (lambda (guess x) (average guess (/ x guess))))
(define average (lambda (x y) (* 0.5 (+ x y))))

Somewhat to my surprise, this ran just fine ! (Actually, my code was slightly different, since I was using Common Lisp at that point, not tiddlylisp, but tiddlylisp works as well.) It's apparently okay to define a procedure in terms of some other procedure which isn't defined until later. Now, upon close examination of the code for tiddlylisp it turns out that this makes perfect sense. But it came as a surprise to me. Later, when we examine the code for tiddlylisp, I'll set a problem asking you to explain why it's okay to re-order the code above.

I think this program for the square root is a very beautiful program. Of course, in most programming languages the square root is built in, and so it's not exactly notable to have an elegant program for it! Indeed, the square root is built in to most version of Lisp, and it's trivially possible to add the square root to tiddlylisp. But what I like about the above is that it's such a simple and natural expression of Newton's method. As Peter Deutsch has said, Lisp programs come close to being "executable mathematics".

Problems for the author

  • As a general point about programming language design it seems like it would often be helpful to be able to define procedures in terms of other procedures which have not yet been defined. Which languages make this possible, and which do not? What advantages does it bring for a programming language to be able to do this? Are there any disadvantages?

A few more elements of Lisp

I began my introduction to Lisp by focusing on elementary arithmetical operations, and how to put them together. I did that so I could give you a concrete example of Lisp in action - the square root example in the last section - which is a good way of getting a feel for how Lisp works. But if we're going to understand Lisp as "the Maxwell's equations of software" then we also need to understand more about how Lisp deals with expressions and with lists. I'll describe those operations in this section.

Before I describe those operations, I want to discuss a distinction that I haven't drawn much attention to until now. That's the distinction between an expression such as (+ 1 2) and the value of that expression, which in this case is 3. When we feed a (valid) tiddlylisp expression to the tiddlylisp interpreter, it evaluates the expression, and returns its value.

When I was first learning Lisp I'd often get myself in trouble because I failed to keep myself clear on the distinction between an expression and its value. Let me give you an example of the kind of confusion I'd get myself into. There is a built-in Lisp procedure called atom?, defined so that (atom? exp) returns True if the value of the expression exp is atomic - a number, for example, or anything which isn't a list - and returns False otherwise. (Note that exp is a placeholder expression, as we discussed earlier, and shouldn't be read literally as the identifier exp.) Now, I'd look at an example like (atom? 5) and have no trouble figuring out that it evaluated to True. But where I'd get into trouble is with an expression like (atom? (+ 1 2)). I'd look at it and think it must return False, because the expression (+ 1 2) is not atomic, it's a list. Unfortunately, while it's true that the expression (+ 1 2) is not atomic, it's irrelevant, because the value of (atom? exp) is determined by whether the value of exp is atomic - which it is, since the value of (+ 1 2)) is 3. You'll be a much happier Lisper if you always keep very clear on the distinction between expressions and the value of expressions.

Now, of course, this distinction between expressions and their values appears in most other programming languages. For this reason, I felt foolish when I understood what was causing my confusion. But what makes it so easy to make this kind of mistake is that in Lisp both expressions and the value of those expressions can be lists. And that's why when I write something like (atom? (+ 1 2)) there's a real question: is atom? checking whether the expression (+ 1 2) is atomic (no, it's not, it's a list), or whether the value of the expression - 3 - is atomic (yes, it is). So you need to be pretty careful to keep clear on which is meant. To help with this, I'll be pretty pedantic about the distinction in what follows, writing out "the value of exp" explicitly, to distinguish the value from the expression itself. Note, however, that it's quite common for people to be less pedantic and more informal, so that something like (+ x 2) may well be described as "adding the variable x to the variable y", not the more explicit "adding the value of the variable x to the value of the variable y."

Alright, with that admonishment out of the way, let's turn our attention to defining the final set of elementary operations we'll need in tiddlylisp.

Returning an expression as a value: (quote exp) returns the expression exp as its value, without evaluating exp. This is most clearly demonstrated by an example,

tiddlylisp> (quote (+ 1 2))
(+ 1 2)

as opposed to the value of (+ 1 2), which of course is 3. Another example, just to reinforce the point that quote returns an expression literally, without evaluating it,

tiddlylisp> (define x 3)
tiddlylisp> (quote x)
x

not the value of x, which of course is 3. Note also that quoting doesn't evaluate nested subexpressions, e.g.,

tiddlylisp> (quote ((+ 1 2) 3))
((+ 1 2) 3)

quote can also be used to return lists which aren't valid Lisp expressions at all, e.g.,

tiddlylisp> (quote (1 2 3))
(1 2 3)

It's worth emphasizing that (1 2 3) isn't a valid Lisp expression, since 1 is neither a procedure nor a special form - if you enter (1 2 3) into the tiddlylisp interpreter, it will give an error. But what quote lets us do is use the list (1 2 3) as data. For example, we could use it to store the (1 2 3) in a variable,

tiddlylisp> (define x (quote (1 2 3)))
tiddlylisp> x
(1 2 3)

In this way, quote lets us work with lists as data structures.

Why does Lisp have quote? Most other computer programming languages don't have anything like it. The reason Lisp has it is because Lisp allows you to treat code as data, and data as code. So, for example, an object such as (+ 1 2) can be treated as code, i.e., as a Lisp expression to be evaluated, which is what we've been doing up until the current discussion. But (+1 2) can also potentially be treated as data, i.e., as a list of three objects. This ability to treat code and data on the same footing is a wonderful thing, because it means you can write programs to manipulate programs. But it also creates a problem, which is that we need to be able to distinguish when an expression should be treated as data, and when it should be treated as code. quote is a way of distinguishing between the two. It's similar to the way most languages use escape characters to deal with special strings such as ". quote is a way of saying "Hey, what follows should be treated as data, not evaluated as Lisp code". And so quote lets us escape Lisp code, so we can take an expression such as (+ 1 2) and turn it into data: (quote (+ 1 2). In this way, quote allows us to define Lisp expressions whose values are arbitrary Lisp expressions.

Up until now we've been focused on using Lisp to do simple arithmetic operations, and as a result we haven't needed quote. But when we get to Lisp-as-Maxwell's-equations we're going to be increasingly focused on using Lisp to manipulate Lisp code, and as a result we'll be making frequent use of quote. For this reason it's helpful to introduce shorthand for quote. In most Lisps, the conventional shorthand is to use 'exp to denote (quote exp), i.e. if exp is some Lisp expression, then the value of 'exp is just the expression exp itself. Unfortunately, using the shorthand 'exp complicated the parsing in the tiddlylisp interpreter more than I wanted. So I decided to instead use a different (and, I emphasize, unconventional) shorthand for (quote exp), namely (q exp). So our examples of quote above can be shortened to,

tiddlylisp> (q (+ 1 2))
(+ 1 2)
tiddlylisp> (define x 3)
tiddlylisp> (q x)
x
tiddlylisp> (q ((+ 1 2) 3))
((+ 1 2) 3)

Testing whether the value of an expression is atomic: As I noted above, (atom? exp) returns True if the value of the expression exp is atomic, and otherwise returns False. We've already discussed the following example,

tiddlylisp> (atom? (+ 1 2))
True

but it's illuminating to see it in tandem with the following example, which illustrates also the use of quote,

tiddlylisp> (atom? (q (+ 1 2)))
False

As above, the first example returns True because while (+ 1 2) is not atomic, its value, 3, is. But the second example returns False because the value of (q (+ 1 2)) is just (+ 1 2), which is a list, and thus not atomic.

By the way, I should mention that atom? is not a built-in procedure in the dialect of Lisp that tiddlylisp based on, Scheme. I've built atom? into tiddlylisp because, as we'll see, the analogous operation is used many times in the code on page 13 of the LISP 1.5 Programmer's Manual. Of course, an atom? procedure is easily defined in Scheme, but for the purposes of understanding the code on page 13 it seemed most direct to simply include atom? as a built-in in tiddlylisp.

Testing whether two expressions both evaluate to the same atom (or the empty list): (eq? exp1 exp2) returns True if the values of exp1 and exp2 are both the same atom, or both the empty list, (). (eq? exp1 exp2) returns False otherwise. Note that if exp1 and exp2 have the same value, but are not atomic or the empty list, then (eq? exp1 exp2) returns False. For example,

tiddlylisp> (eq? 2 (+ 1 1))
True
tiddlylisp> (eq? 3 (+ 1 1))
False
tiddlylisp> (eq? (q (1 2)) (q (1 2)))
False

As for atom? my explanation of eq? is not quite the same as in standard Scheme, but instead more closely matches the function eq defined in the LISP 1.5 Programmer's Manual.

Getting the first item of a list: (car exp) returns the first element of the value of exp, provided the value of exp is a list. Otherwise (car exp) is undefined. For example,

tiddlylisp> (car (q (+ 2 3)))
+
tiddlylisp> (car (+ 2 3))

The first of these two behaves as expected: the value of (q (+ 2 3) is just the list (+ 2 3), and so car returns the first element, which is +. In the second, though, car is not defined, and tiddlylisp returns an error message, which I've elided. The reason it returns an error message is because the value of (+ 2 3) is 5, which is not a list, and so car is undefined. In a similar way, suppose we try

tiddlylisp> (car (1 2 3))

Again, we get an error message. The reason is that car is being applied to the value of (1 2 3), considered as a Lisp expression. And, of course, that value is undefined, since 1 is neither a procedure nor a special form. The right way to do the above is to quote the list first,

tiddlylisp> (car (q (1 2 3)))
1

Once again, we see how quote is used to make it clear to tiddlylisp that we're dealing with data, not code to be evaluated.

Getting the rest of a list: (cdr exp) returns a list containing all but the first element of the value of exp. Of course, the value of exp must be a list, otherwise (cdr exp) is undefined. For example,

tiddlylisp> (cdr (q (1 2 3)))
(2 3)

According to Wikipedia the names car and cdr have their origin in some pretty esoteric facts about the early implementations of Lisp. The details don't much matter here - car stands for "contents of address part of register", while cdr stands for "contents of decrement part of register" - I just wanted to make the point that you could reasonably be wondering "Where on Earth did those names come from?!" A mnemonic I find useful in distinguishing the two is to focus on the difference between the names of the two procedures, which of course is just the middle letter - a or d - and to keep in mind that a comes before d in the alphabet, just as car extracts the element of a list that comes before the remainder of the list, as given to us by cdr. Your taste in mnemonics may vary - if you don't like mine, it's still worth taking a minute or two to come up with some trick for remembering and distinguishing car and cdr. Of course, after a little practice you get used to them, and you won't need the mnemonic any more, but at first it's helpful.

Appending an item at the start of a list: Provided the value of exp2 is a list, then (cons exp1 exp2) returns a list containing the value of exp1 as its first element, followed by all the elements of the value of exp2. For example,

tiddlylisp> (cons 1 (q (2 3)))
(1 2 3)

Testing whether an expression evaluates to the empty list: (null? exp) returns True if exp evaluates to the empty list, (), and otherwise returns False. For example,

tiddlylisp> (null? (cdr (q (1))))
True

since (cdr (q (1))) returns the empty list.

Conditionals: (cond (p1 e1)...(pn en)): This starts by evaluating the expression p1. If p1 evaluates to True, then evaluate the expression e1, and return that value. If not, evaluate p2, and if it is True, return the value of e2, and so on. If none of the pj evaluates to True, then the (cond ...) expression is undefined.

That's all the Lisp we're going to need to write our version of Lisp-as-Maxwell's-equations, i.e., our version of the code on page 13 of the LISP Manual! In fact, as we'll see shortly, it's more than we need - I included a few extra features so that we could work through examples like sqrt, and also simply for fun, to make tiddlylisp a richer and more interesting language to play with. Of course, what I've described is merely a subset (with some variations) of our chosen dialect of Lisp (Scheme), and there are important missing ideas. To learn more about Lisp or Scheme, please consult the suggestions for further reading at the end of this essay.

An interpreter for Lisp

Now that we've worked through the basic elements of Lisp, let's write a simple Lisp interpreter, using Python. The interpreter we'll write is based on Peter Norvig's lispy interpreter, and I highly recommend you read through his explanation. I've given the program we'll discuss a separate name - tiddlylisp - so as to make it easy to refer to separately from Norvig's interpreter, but please keep in mind that most of the code we're discussing is Norvig's. However, we're going to examine the code from a different angle than Norvig: we're going to take a bit more of a bottom-up computer's-eye view, looking at how the code executes.

We'll look at the code piece by piece, before putting it all together. Let's start with the interactive interpreter, which is run when you start up. We implement this with a Python procedure called repl, meaning to read some input, evaluate the expression, print the result of the evaluation, and then loop back to the beginning. This is also know as the read-eval-print loop, or REPL. Here's the repl procedure, together with a procedure for handling errors when they occur:

import traceback

def repl(prompt='tiddlylisp> '):
    "A prompt-read-eval-print loop."
    while True:
        try:
            val = eval(parse(raw_input(prompt)))
            if val is not None: print to_string(val)
        except KeyboardInterrupt:
            print "\nExiting tiddlylisp\n"
            sys.exit()
        except:
            handle_error()

def handle_error():
    """
    Simple error handling for both the repl and load.
    """
    print "An error occurred.  Here's the Python stack trace:\n"
    traceback.print_exc()

The core of repl is in the try clause, and we'll get to how that works shortly. Before we look at that, note that if the user presses Ctrl-C, it raises the KeyboardInterrupt exception, which causes the program to exit. If an error occurs during the try block -- say, due to a syntax error in the Lisp expression being parsed, or due to a bug in tiddlylisp itself - then some other exception will be raised. Tiddlylisp doesn't deal very well with errors - it simply announces that an error has occurred, and prints the Python stack trace, which may give you a few hints about what's gone wrong, but which is obviously a long way short of truly informative error handling! After printing the stack trace, tiddlylisp simply returns to the prompt. This type of error handling could easily be improved, but we're not going to invest any effort in this direction.

Let's look at the try clause. It begins by taking input at the prompt, and then passing it to the function parse, which converts the string entered by the user into an internal representation, i.e., a data structure that's more convenient for our Python program to work with than a string. Here's an example which shows how it works:

parse("(* (+ 7 12) (- 8 6))") = ["*", ["+", 7, 12], ["-", 8, 6]]

In other words, nested Lisp lists become Python sublists, and things like procedures and numbers become elements in a Python list.

We'll look shortly at how parse is implemented. For now, let's finish understanding how repl works. The output of parse is passed to the function eval, which evaluates the internal representation of the expression entered by the user. Provided no error occurs, the result is returned in val. However, val is in the format of the internal representation, and so we need to convert it from that internal representation back into a printable Lisp expression, using to_string.

At this point, we've got three functions to understand the details of: parse, eval and to_string. I'll explain them out of order, starting with parse and to_string, since they're extremely similar. Then we'll get to eval.

Alright, let's understand how parse works. Without further ado, here's the code; for the explanation, see below:

Symbol = str

def parse(s):
    "Parse a Lisp expression from a string."
    return read_from(tokenize(s))

def tokenize(s):
    "Convert a string into a list of tokens."
    return s.replace('(',' ( ').replace(')',' ) ').split()

def read_from(tokens):
    "Read an expression from a sequence of tokens."
    if len(tokens) == 0:
        raise SyntaxError('unexpected EOF while reading')
    token = tokens.pop(0)
    if '(' == token:
        L = []
        while tokens[0] != ')':
            L.append(read_from(tokens))
        tokens.pop(0) # pop off ')'
        return L
    elif ')' == token:
        raise SyntaxError('unexpected )')
    else:
        return atom(token)

def atom(token):
    "Numbers become numbers; every other token is a symbol."
    try: return int(token)
    except ValueError:
        try: return float(token)
        except ValueError:
            return Symbol(token)

The parsing is easy to understand. tokenize first inserts spaces on either side of any parentheses, and then splits the string around spaces, returning a list of tokens, i.e., the non-space substrings. read_from takes that list and removes the parentheses, instead nesting sublists as indicated by the original parenthesis structure. And, finally, atom turns tokens into Python ints, floats or Symbols (strings, by definition), as appropriate.

That's all there is to parse. If tiddlylisp were a little more powerful then parse would need to be more complex. For example, if we allowed strings as first-class objects in the language, then it would not work to tokenize by splitting around spaces, since that would risk splitting a string into separate tokens. This is the kind of thing that'd be fun to include in an extended version of tiddlylisp (and I've included it as a problem later in this section), but which we don't need in a first version.

Let's look now at how to_string works. It's much simpler, quickly undoing the steps taken in parsing:

isa = isinstance

def to_string(exp):
    "Convert a Python object back into a Lisp-readable string."
    if not isa(exp, list):
        return str(exp)
    else:
        return '('+' '.join(map(to_string, exp))+')'

In other words, if the internal representation of the expression is not a list, then return an appropriate stringified version (this takes care of the fact that we may have ints, floats and, as we shall see, Booleans in the internal representation). If it is a list, then return whatever we get by applying to_string to all the elements of that list, with appropriate delimiting by whitespace and parentheses.

At this point, the main thing we need to complete tiddlylisp is the eval function. Actually, that's not quite true: as we discussed earlier, tiddlylisp also keeps track of a global environment (and possibly one or more inner environments), to store variable and procedure names, and their values. eval is going to make heavy use of the environments, and so it helps to look first at how environments are defined. They're pretty simple: an environment has a bunch of keys, representing the names of the variables and procedures in that environment, and corresponding values for those keys, which are just the values for the variables or procedures. An environment also keeps track of its outer environment, with the caveat that the global environment has an outer environment set to Python's None. Such environments are easily implemented as a subclass of Python dictionaries:

class Env(dict):
    "An environment: a dict of {'var':val} pairs, with an outer Env."

    def __init__(self, params=(), args=(), outer=None):
        self.update(zip(params, args))
        self.outer = outer

    def find(self, var):
        "Find the innermost Env where var appears."
        return self if var in self else self.outer.find(var)

As you can see, the only modifications of the dictionary class are that: (1) an environment also keeps track of its own outer environment; and (2) an environment can determine whether a variable or procedure name appears in its list of keys, and if it doesn't, then it looks to see if it's in its outer environment. As a result, the find method returns the innermost environment where the variable or procedure name appears.

Note, incidentally, that the environment doesn't distinguish between variable and procedure names. Indeed, as we'll see, tiddlylisp treats user-defined procedures and variables in the same way: a procedure is a variable which just happens to take the value of a lambda expression as its value.

Tiddlylisp starts off operating in a particular global environment, and this too must be defined by our program. We'll do this by creating an instance of the class Env, and calling a function to add some built-in procedure definitions and variables:

def add_globals(env):
    "Add some built-in procedures and variables to the environment."
    import operator
    env.update(
        {'+': operator.add,
         '-': operator.sub, 
         '*': operator.mul, 
         '/': operator.div, 
         '>': operator.gt, 
         '<': operator.lt, 
         '>=': operator.ge, 
         '<=': operator.le, 
         '=': operator.eq
         })
    env.update({'True': True, 'False': False})
    return env

global_env = add_globals(Env())

Incidentally, in tiddlylisp's version of add_globals I decided to strip out many of the built-in procedures which Norvig includes in lispy's global environment - it's instructive to look at Norvig's code for add_globals to see just how easy it is to add more built-in procedures to tiddlylisp. If you want to do some exploratory programming with tiddlylisp then you should probably copy some of Norvig's additional built-in procedures (and perhaps add some of your own). For us, though, the above procedures are enough.

One notable feature of the global environment is the variables named True and False, which evaluate to Python's Boolean True and False, respectively. This isn't standard in Scheme (or most other Lisps), but I've done it because it ensures that we can use the strings True and False, and get the appropriate internal representation.

With the global environment set up, we can now understand how eval works. The code is extremely simple, simply enumerating the different types of expressions we might be evaluating, and reading from or modifying the environment, as appropriate. It's worth reading (and rereading) the code in detail, until you understand exactly how eval works. I also have a few comments at the end. Here's the code:

isa = isinstance

def eval(x, env=global_env):
    "Evaluate an expression in an environment."
    if isa(x, Symbol):              # variable reference
        return env.find(x)[x]
    elif not isa(x, list):          # constant literal
        return x                
    elif x[0] == 'quote' or x[0] == 'q': # (quote exp), or (q exp)
        (_, exp) = x
        return exp
    elif x[0] == 'atom?':           # (atom? exp)
        (_, exp) = x
        return not isa(eval(exp, env), list)
    elif x[0] == 'eq?':             # (eq? exp1 exp2)
        (_, exp1, exp2) = x
        v1, v2 = eval(exp1, env), eval(exp2, env)
        return (not isa(v1, list)) and (v1 == v2)
    elif x[0] == 'car':             # (car exp)
        (_, exp) = x
        return eval(exp, env)[0]
    elif x[0] == 'cdr':             # (cdr exp)
        (_, exp) = x
        return eval(exp, env)[1:]
    elif x[0] == 'cons':            # (cons exp1 exp2)
        (_, exp1, exp2) = x
        return [eval(exp1, env)]+eval(exp2,env)
    elif x[0] == 'cond':            # (cond (p1 e1) ... (pn en))
        for (p, e) in x[1:]:
            if eval(p, env): 
                return eval(e, env)
    elif x[0] == 'null?':           # (null? exp)
        (_, exp) = x
        return eval(exp,env) == []
    elif x[0] == 'if':              # (if test conseq alt)
        (_, test, conseq, alt) = x
        return eval((conseq if eval(test, env) else alt), env)
    elif x[0] == 'set!':            # (set! var exp)
        (_, var, exp) = x
        env.find(var)[var] = eval(exp, env)
    elif x[0] == 'define':          # (define var exp)
        (_, var, exp) = x
        env[var] = eval(exp, env)
    elif x[0] == 'lambda':          # (lambda (var*) exp)
        (_, vars, exp) = x
        return lambda *args: eval(exp, Env(vars, args, env))
    elif x[0] == 'begin':           # (begin exp*)
        for exp in x[1:]:
            val = eval(exp, env)
        return val
    else:                           # (proc exp*)
        exps = [eval(exp, env) for exp in x]
        proc = exps.pop(0)
        return proc(*exps)

Mostly this is self-explanatory. But allow me to draw your attention to how Norvig deals with anonymous procedure definitions using lambda. When I first examined his code I wondered how he'd cope with this, and expected it would be quite complex. But as you can see, it is extremely simple: lambda expressions evaluate to the appropriate anonymous Python function, with a new environment modified by the addition of the appropriate variable keys, and their values. Beautiful!

Tiddlylisp is essentially complete at this point. It's convenient to finish off the program by providing two ways of running tiddlylisp: either in an interactive interpreter mode, i.e., the REPL, or by loading a tiddlylisp program stored in a separate file. To start the REPL, we'll simply run python tiddlylisp.py. To load and execute a file, we'll run python tiddlylisp.py filename. After execution, we'd like to be dropped into the REPL so we can inspect results and do further experiments. The main complication in doing this is the need to load tiddlylisp code which is split over multiple lines. We do this by merging lines until the number of opening and closing parentheses match. Here's the code - it's best to start at the bottom, with the code immediately after if __name__ == "__main__":

import sys

def load(filename):
    """
    Load the tiddlylisp program in filename, execute it, and start the
    repl.  If an error occurs, execution stops, and we are left in the
    repl.  Note that load copes with multi-line tiddlylisp code by
    merging lines until the number of opening and closing parentheses
    match.
    """
    print "Loading and executing 
    f = open(filename, "r")
    program = f.readlines()
    f.close()
    rps = running_paren_sums(program)
    full_line = ""
    for (paren_sum, program_line) in zip(rps, program):
        program_line = program_line.strip()
        full_line += program_line+" "
        if paren_sum == 0 and full_line.strip() != "":
            try:
                val = eval(parse(full_line))
                if val is not None: print to_string(val)
            except:
                handle_error()
                print "\nThe line in which the error occurred:\n
                break
            full_line = ""
    repl()

def running_paren_sums(program):
    """
    Map the lines in the list program to a list whose entries contain
    a running sum of the per-line difference between the number of '('
    parentheses and the number of ')' parentheses.
    """
    count_open_parens = lambda line: line.count("(")-line.count(")")
    paren_counts = map(count_open_parens, program)
    rps = []
    total = 0
    for paren_count in paren_counts:
        total += paren_count
        rps.append(total)
    return rps

if __name__ == "__main__":
    if len(sys.argv) > 1: 
        load(sys.argv[1])
    else: 
        repl()

That completes the code for tiddlylisp! A grand total of 153 lines of non-comment, non-whitespace code. Here it all is, in one big block (commented and slightly reordered), so you can see how the pieces fit together:

#### tiddlylisp.py
#
# Based on Peter Norvig's lispy (http://norvig.com/lispy.html),
# copyright by Peter Norvig, 2010.
#
# Adaptations by Michael Nielsen.  See
# http://michaelnielsen.org/ddi/lisp-as-the-maxwells-equations-of-software/

import sys
import traceback

#### Symbol, Env classes

Symbol = str

class Env(dict):
    "An environment: a dict of {'var':val} pairs, with an outer Env."

    def __init__(self, params=(), args=(), outer=None):
        self.update(zip(params, args))
        self.outer = outer

    def find(self, var):
        "Find the innermost Env where var appears."
        return self if var in self else self.outer.find(var)

def add_globals(env):
    "Add some built-in procedures and variables to the environment."
    import operator
    env.update(
        {'+': operator.add,
         '-': operator.sub, 
         '*': operator.mul, 
         '/': operator.div, 
         '>': operator.gt, 
         '<': operator.lt, 
         '>=': operator.ge, 
         '<=': operator.le, 
         '=': operator.eq
         })
    env.update({'True': True, 'False': False})
    return env

global_env = add_globals(Env())

isa = isinstance

#### eval

def eval(x, env=global_env):
    "Evaluate an expression in an environment."
    if isa(x, Symbol):              # variable reference
        return env.find(x)[x]
    elif not isa(x, list):          # constant literal
        return x                
    elif x[0] == 'quote' or x[0] == 'q': # (quote exp), or (q exp)
        (_, exp) = x
        return exp
    elif x[0] == 'atom?':           # (atom? exp)
        (_, exp) = x
        return not isa(eval(exp, env), list)
    elif x[0] == 'eq?':             # (eq? exp1 exp2)
        (_, exp1, exp2) = x
        v1, v2 = eval(exp1, env), eval(exp2, env)
        return (not isa(v1, list)) and (v1 == v2)
    elif x[0] == 'car':             # (car exp)
        (_, exp) = x
        return eval(exp, env)[0]
    elif x[0] == 'cdr':             # (cdr exp)
        (_, exp) = x
        return eval(exp, env)[1:]
    elif x[0] == 'cons':            # (cons exp1 exp2)
        (_, exp1, exp2) = x
        return [eval(exp1, env)]+eval(exp2,env)
    elif x[0] == 'cond':            # (cond (p1 e1) ... (pn en))
        for (p, e) in x[1:]:
            if eval(p, env): 
                return eval(e, env)
    elif x[0] == 'null?':           # (null? exp)
        (_, exp) = x
        return eval(exp,env) == []
    elif x[0] == 'if':              # (if test conseq alt)
        (_, test, conseq, alt) = x
        return eval((conseq if eval(test, env) else alt), env)
    elif x[0] == 'set!':            # (set! var exp)
        (_, var, exp) = x
        env.find(var)[var] = eval(exp, env)
    elif x[0] == 'define':          # (define var exp)
        (_, var, exp) = x
        env[var] = eval(exp, env)
    elif x[0] == 'lambda':          # (lambda (var*) exp)
        (_, vars, exp) = x
        return lambda *args: eval(exp, Env(vars, args, env))
    elif x[0] == 'begin':           # (begin exp*)
        for exp in x[1:]:
            val = eval(exp, env)
        return val
    else:                           # (proc exp*)
        exps = [eval(exp, env) for exp in x]
        proc = exps.pop(0)
        return proc(*exps)

#### parsing

def parse(s):
    "Parse a Lisp expression from a string."
    return read_from(tokenize(s))

def tokenize(s):
    "Convert a string into a list of tokens."
    return s.replace('(',' ( ').replace(')',' ) ').split()

def read_from(tokens):
    "Read an expression from a sequence of tokens."
    if len(tokens) == 0:
        raise SyntaxError('unexpected EOF while reading')
    token = tokens.pop(0)
    if '(' == token:
        L = []
        while tokens[0] != ')':
            L.append(read_from(tokens))
        tokens.pop(0) # pop off ')'
        return L
    elif ')' == token:
        raise SyntaxError('unexpected )')
    else:
        return atom(token)

def atom(token):
    "Numbers become numbers; every other token is a symbol."
    try: return int(token)
    except ValueError:
        try: return float(token)
        except ValueError:
            return Symbol(token)

def to_string(exp):
    "Convert a Python object back into a Lisp-readable string."
    if not isa(exp, list):
        return str(exp)
    else:
        return '('+' '.join(map(to_string, exp))+')'         

#### Load from a file and run

def load(filename):
    """
    Load the tiddlylisp program in filename, execute it, and start the
    repl.  If an error occurs, execution stops, and we are left in the
    repl.  Note that load copes with multi-line tiddlylisp code by
    merging lines until the number of opening and closing parentheses
    match.
    """
    print "Loading and executing 
    f = open(filename, "r")
    program = f.readlines()
    f.close()
    rps = running_paren_sums(program)
    full_line = ""
    for (paren_sum, program_line) in zip(rps, program):
        program_line = program_line.strip()
        full_line += program_line+" "
        if paren_sum == 0 and full_line.strip() != "":
            try:
                val = eval(parse(full_line))
                if val is not None: print to_string(val)
            except:
                handle_error()
                print "\nThe line in which the error occurred:\n
                break
            full_line = ""
    repl()

def running_paren_sums(program):
    """
    Map the lines in the list program to a list whose entries contain
    a running sum of the per-line difference between the number of '('
    parentheses and the number of ')' parentheses.
    """
    count_open_parens = lambda line: line.count("(")-line.count(")")
    paren_counts = map(count_open_parens, program)
    rps = []
    total = 0
    for paren_count in paren_counts:
        total += paren_count
        rps.append(total)
    return rps

#### repl

def repl(prompt='tiddlylisp> '):
    "A prompt-read-eval-print loop."
    while True:
        try:
            val = eval(parse(raw_input(prompt)))
            if val is not None: print to_string(val)
        except KeyboardInterrupt:
            print "\nExiting tiddlylisp\n"
            sys.exit()
        except:
            handle_error()

#### error handling

def handle_error():
    """
    Simple error handling for both the repl and load.
    """
    print "An error occurred.  Here's the Python stack trace:\n"
    traceback.print_exc()

#### on startup from the command line

if __name__ == "__main__":
    if len(sys.argv) > 1: 
        load(sys.argv[1])
    else: 
        repl()

Problems

  • Modify tiddlylisp so that the + procedure can be applied to any number of arguments, e.g., so that (+ 1 2 3) evaluates to 6.

  • Earlier we implemented a square root procedure in tiddlylisp. Can you add it directly to tiddlylisp, using the Python math module's sqrt function?
  • In our earlier implementation of the sqrt procedure we discussed the ordering of the lines of code, and whether it's okay to define a procedure in terms of some other yet-to-be-defined procedure. Examine the code for tiddlylisp, and explain why it's okay for a procedure such as sqrt to be defined in terms of a procedure such as sqrt-iter which isn't defined until later. Try doing the same thing with variables, e.g., try running (define x y) followed by (define y 1). Does this work? If so, why? If not, why not?
  • Modify tiddlylisp so that when applied to one argument the - procedure simply negates it, e.g., (- 2) returns -2, while - still computes differences when applied to two arguments.
  • Is it possible to write a pure tiddlylisp procedure minus so that (minus x) returns -x, while (minus x y) returns x-y?
  • In the discussion where we introduced cond I stated that (cond (p1 e1)...(pn en)) is undefined when none of the expressions p1...pn evaluate to True. What does Tiddlylisp return in this case? Can you think of a better way of dealing with this situation?
  • Can you add support for strings to tiddlylisp?

When I first examined Norvig's code for lispy, I was surprised by just how much I learned from his code. Of course, I expected to learn quite a bit - I am just a beginner at Lisp - but what I learned greatly exceeded my expectations. Why might writing an interpreter deepen our understanding of a programming language? I think the answer has to do with how we understand abstractions. Consider the way I first explained the concept of Lisp environments, early in this essay: I gave a general discussion of the concept, and then related it to several of the examples we were working through. This is the usual way we cope with abstractions when learning (or teaching) a language: we make those abstractions concrete by working through code examples that illustrates the consequences of those abstractions. The problem is that although I can show you examples, the abstraction itself remains ephemeral.

Writing an interpreter is a way of making a programming language's abstractions concrete. I can show you a million examples illustrating consequences of the Lisp environment, but none will have quite the same concrete flavour as the code for our Python Lisp interpreter. That code shows explicitly how the environment can be represented as a data structure, how it is manipulated by commands such as define, and so on. And so writing an interpreter is a way of reifying abstractions in the programming language being interpreted.

Problems

  • I gave the example of the environment as a Lisp abstraction which is made more concrete when you understand the code for the interpreter. Another example of an abstraction is errors in code. Can you improve tiddlylisp's error handling so that we get something more informative than a Python stack trace when something goes wrong? One suggestion for how to do this is to identify two (or more) classes of error that may occur in tiddlylisp programs, and to modify the interpreter so it catches and gracefully handles those error classes, printing informative error messages.

On Peter Norvig's webpage describing his interpreter, a few commenters take him to task for writing his interpreter in Python. Here's an example to give you the flavour of these comments:

This code looks very nice, but i think that implementing a Lisp Interpreter in Python is some kind of cheating. Python is a high-level language, so you get very much for free.

Norvig replies:

You are right -- we are relying on many features of Python: call stack, data types, garbage collection, etc. The next step would be to show a compiler to some sort of assembly language. I think either the Java JVM or the Python byte code would be good targets. We'd also need a runtime system with GC. I show the compiler in my PAIP [Paradigms of Artificial Intelligence] book.

The commenter and Norvig are right, in some sense. But there's also a sense in which the Python interpreter achieves something that would not be achieved by a program that compiled Lisp to the Java JVM, or to assembler, or some other target closer to the bare metal. That's because of all programming languages, Python is one of the closest to an ordinary human language. And so writing a Lisp interpreter in Python is an exceptionally clear way of explaining how Lisp works to a human who doesn't yet understand the core concepts of Lisp. Insofar as I can guess at Norvig's intention, I believe the code for his interpreter is primarily intended to be read by humans, and the fact that it can also be read by a computer is a design constraint, not the fundamental purpose [3].

It seems to me that the kind of comment above arises because there are really three natural variations on the question "How to explain Lisp?". All three variations are interesting, and worth answering; all three have different answers. The first variation is how to explain Lisp to a person who doesn't yet know Lisp. As I've just argued, a good answer to this question is to work through some examples, and then to write a simple Python interpreter. The second variation is how to explain Lisp to a machine. That's the question the commenter on Norvig's blog is asking, and to answer that question nothing beats writing a Lisp interpreter (or compiler) that works close to the bare metal, say in assembler, requiring you to deal explicitly with memory allocation, garbage collection, and so on.

But there's also a third variation on the question. And that's how best to explain Lisp to someone who already understands the core concepts of Lisp. That sounds paradoxical: doesn't such a person, by definition, already understand Lisp? But it's not paradoxical at all. Consider the following experience which many people have when learning (or teaching) mathematics. The best way to explain a mathematical idea to someone new to the idea is using their old language and their old way of looking at the world. This is like explaining Lisp by writing a Python interpreter. But once the person has grasped a transformative new mathematical idea, they can often deepen their understanding by re-examining that idea within their new way of looking at the world. That re-examination can help crystallize a deeper understanding. In a similar way, while writing a Lisp interpreter in Python may be a good way of explaining Lisp to a person who doesn't yet understand Lisp, someone who grasps the core ideas of Lisp may find the Python interpreter a little clunky. How should we explain Lisp within the framework of Lisp itself? One answer to that question is to use Lisp to write a Lisp interpreter. It's to that task that we now turn.

Lisp in Lisp

How should we write a Lisp interpreter in Lisp? Let's think back to what Alan Kay saw at the bottom of page 13 of the LISP Manual:

Although it's written in a different notation than we've used, this is Lisp code. In fact, it's the core of a Lisp interpreter written in Lisp: the procedure evalquote takes a Lisp expression as input, and then returns the value of that expression. In this section we're going to use tiddlylisp to write an analogue to evalquote (we'll change the name to eval). Of course, such a procedure is not really a full interpreter - we won't have a read-eval-print loop, for one thing - but it's not difficult to extend our code to a full interpreter (it requires a few additions to tiddlylisp, too). For this reason, in what follows I'll refer to our eval procedure as an "interpreter", even though it's more accurate to say that it's the core of an interpreter. I haven't made the extension to a full interpreter here, partly because I don't want to lengthen an already long essay, but mostly because I want to stick to the theme of the "Maxwell's equations of software". For the same reasons, I've also limited our eval to interpreting only a subset of tiddlylisp, omitting the arithmetical operations and concentrating instead on procedures for manipulating lists.

My treatment in this section is based on a beautiful essay (postscript) by Paul Graham, in which he explains what the original designer of Lisp, John McCarthy, was up to in the paper where he introduced Lisp. In his essay, Graham writes a fully executable Lisp interpreter in one of the modern dialects of Lisp, Common Lisp, and I've based much of my code on Graham's. Perhaps the main difference in my treatment is that while Graham's eval is written to be run under Common Lisp, our eval is executable in tiddlylisp, an interpreter for Lisp that we've written ourselves (with lots of help from Peter Norvig!) So even though the code is very similar, the perspective is quite diferent, and I think we gain something from this different perspective.

The code we'll write is longer than what you see on page 13 of the LISP Manual. The reason is that the code on page 13 was not actually self-contained, but made use of several procedures defined earlier in the LISP Manual, and we need to include those procedures. The final result is still only a little over a page of code. Let's start by defining a few of those helper procedures.

(not exp) returns True if the expression exp evaluates to False, and otherwise returns False. For example,

tiddlylisp> (not (atom? (q (1 2))))
True
tiddlylisp> (not (eq? 1 (- 2 1)))
False

Here's the tiddlylisp code for not:

(define not (lambda (x) (if x False True)))

(append exp1 exp2) takes expressions exp1 and exp2 both of whose values are lists, and returns the list formed by concatenating those lists. For example,

tiddlylisp> (append (q (1 2 3)) (q (4 5)))
(1 2 3 4 5)

Here's the tiddlylisp code for append:

(define append (lambda (x y)
		 (if (null? x) y (cons (car x) (append (cdr x) y)))))

(pair exp1 exp2) returns a two-element list whose elements are the value of exp1 and the value of exp2:

tiddlylisp> (pair 1 2)
(1 2)
tiddlylisp> (pair (+ 1 2) 1)
(3 1)

Here's the tiddlylisp code for pair:

(define pair (lambda (x y) (cons x (cons y (q ()) ))))

Note that my use of pair is somewhat unconventional - the more usual approach in Lisp is to use (list exp1 exp2 exp3...) to construct a list whose values are just the values of the respective expressions. The reason I haven't done this is because tiddlylisp doesn't allow us to define Lisp procedures with a variable number of arguments. Note also that the procedure pair that I've defined should not be confused with one of Scheme's standard procedures, pair?, which has a different purpose, and which we won't use in the current essay.

Problems

  • Can you modify tiddlylisp so that (list exp1 exp2 exp3...) does indeed return a list whose values are just the values of the respective expressions?

I'll now introduce a class of helper procedures which are concatenations of two or more applications of car or cdr. An example is the procedure cdar, which applies car first, followed by cdr, that is, (cdar exp) has the same value as (cdr (car exp)). The notation cdar is a mnemonic, whose key elements are the middle two letters, d and a, indicating that cdar is what you get when you apply (in reverse order) cdr and car. You might wonder why it's reverse order - the answer is that reverse order corresponds to the visual syntactic order, that is, the order from left-to-right that the procedures appear in the expression (cdr (car exp)).

As another example, the procedure caar is defined so that (caar exp) has the same value as (car (car exp)). In our eval it'll be helpful to use several such procedures:

(define caar (lambda (x) (car (car x))))
(define cadr (lambda (x) (car (cdr x))))
(define cadar (lambda (x) (cadr (car x))))
(define caddr (lambda (x) (cadr (cdr x))))
(define caddar (lambda (x) (caddr (car x))))

Our next helper procedure is called pairlis. (pairlis exp1 exp2) takes expressions exp1 and exp2 whose values are lists of the same length, and returns a list which is formed by pairing the values of corresponding elements. For example,

tiddlylisp> (pairlis (q (1 2 3)) (q (4 5 6)))
((1 4) (2 5) (3 6))

Here's the tiddlylisp code for pairlis:

(define pairlis 
    (lambda (x y)
      (if (null? x)
	  (q ())
	  (cons (pair (car x) (car y)) (pairlis (cdr x) (cdr y))))))

We'll call a list of pairs such as that produced by pairlis an association list. It gets this name from our final helper procedure, the assoc procedure, which takes an association list and treats it as a lookup dictionary. The easiest way to explain what this means is through an example,

tiddlylisp> (define a (pairlis (q (1 2 3)) (q (4 5 6))))
tiddlylisp> a
((1 4) (2 5) (3 6))
tiddlylisp> (assoc 2 a)
5

In other words, assoc looks for the key 2 as the first entry in one of the pairs in the list which is the value of a. Once it finds such a pair, it returns the second element in the pair.

Stated more abstractly, suppose the expression exp1 has a value which appears as the first entry in one of the pairs in the association list which is the value of exp2. Then (assoc exp1 exp2) returns the second entry of that pair.

After all that explanation, the code for assoc is extremely simple, simpler even than pairlis:

(define assoc (lambda (x y)
		(if (eq? (caar y) x) (cadar y) (assoc x (cdr y)))))

I won't explain how assoc works, but if you're looking for a good exercise in applying caar and similar procedures, then it's worth spending some time to carefully understand how assoc works.

With all these helper procedures in place, we can now write our equivalent to the code on page 13 of the LISP Manual. This includes both the core procedure, eval, together with a couple of extra helper procedures, evcon and evlis. Here's the code:

(define eval 
    (lambda (e a)
      (cond
	((atom? e) (assoc e a))
	((atom? (car e))
	 (cond
	   ((eq? (car e) (q car))   (car (eval (cadr e) a)))
	   ((eq? (car e) (q cdr))   (cdr (eval (cadr e) a)))
	   ((eq? (car e) (q cons))  (cons (eval (cadr e) a) (eval (caddr e) a)))
	   ((eq? (car e) (q atom?)) (atom? (eval (cadr e) a)))
	   ((eq? (car e) (q eq?))   (eq? (eval (cadr e) a) (eval (caddr e) a)))
	   ((eq? (car e) (q quote)) (cadr e))
	   ((eq? (car e) (q q))     (cadr e))
	   ((eq? (car e) (q cond))  (evcon (cdr e) a))
	   (True                   (eval (cons (assoc (car e) a) (cdr e)) a))))
	((eq? (caar e) (q lambda))
	 (eval (caddar e) (append (pairlis (cadar e) (evlis (cdr e) a)) a))))))

(define evcon 
    (lambda (c a)
      (cond ((eval (caar c) a) (eval (cadar c) a))
	    (True              (evcon (cdr c) a)))))

(define evlis 
    (lambda (m a)
      (cond ((null? m) (q ()))
	    (True     (cons (eval (car m) a) (evlis (cdr m) a))))))

Before we examine how eval works, I want to give you some examples of eval in action. If you want, you can follow along with the examples by first loading the program defining eval into tiddlylisp (the full source is below), and then typing the examples into the interpreter.

To understand how to use eval in examples, we need to be clear about the meaning of its arguments. e is a Lisp expression whose value is the Lisp expression that we want to evaluate with eval. And a is a Lisp expression whose value is an association list, representing the environment. In particular, the first element of each pair in a is the name of a variable or procedure, and the second element is the value of that variable or procedure. I'll often refer to a just as the environment.

Suppose, for example, that we wanted to use eval to evaluate the expression (car (q (1 2))). We'll assume that we're evaluating it in the empty environment, that is, no variables or extra procedures have been defined. Then we'd need to pass eval expressions with values (car (q (1 2))) and (). We can do this by quoting those values:

tiddlylisp> (eval (q (car (q (1 2)))) (q ()))
1

As you can see, we get the right result: 1.

I explained in detail how to build up the expression (eval (q (car...) evaluated above. But if we hadn't gone through that explanation, then the expression would have appeared quite of complicated, with lots of quoting going on. The reason is that eval is evaluating an expression which is itself the value of another expression. With so much evaluation going on it's no wonder there's many q's floating around! But after working carefully through a few examples it all becomes transparent.

Here's an example showing how to use variables in the environment:

tiddlylisp> (eval (q (cdr x)) (q ((x (1 2 3)))))
(2 3)

Unpacking the quoting, we see that it's evaluating the expression (cdr x) in an environment with a variable x whose value is (1 2 3). The result is, of course, (2 3).

Here's an example showing how to use a procedure which has been defined in the environment:

tiddlylisp> (eval (q (cddr (q (1 2 3 4 5)))) (q ((cddr (lambda (x) (cdr (cdr x)))))))
(3 4 5)

In other words, the environment stores a procedure cddr whose value is (lambda (x) (cdr (cdr x))), and eval returns the result of applying cddr to an expression whose value is (1 2 3 4 5). Of course, this is just (3 4 5).

We can also use eval to define and evaluate an anonymous procedure, in this case one that has the same effect as cadr:

tiddlylisp> (eval (q ((lambda (x) (car (cdr x))) (q (1 2 3 4)))) (q ()))
2

A significant drawback of eval is that it has a pretty limited Lisp vocabulary. You can see this by running:

tiddlylisp> (eval (q (eq? 1 1)) (q (())))

The first line looks like perfectly valid Lisp - in fact, it is perfectly valid Lisp. The problem is that eval doesn't recognize 1 - at the level of sophistication we're working it really only understands lists, variables, and procedures. So what it tries to do is treat 1 as a variable or procedure to look up in the environment, a. But 1 isn't in the environment, which is why there's an error message.

Fixing this problem by modifying eval isn't terribly difficult [4]. However, to stay close to the LISP Manual, I'll leave this as is. A kludge to get around this issue is to add 1 as a key in the environment. For example, we can use:

tiddlylisp> (eval (q (eq? 1 1)) (q ((1 1))))
True
tiddlylisp> (eval (q (eq? 1 2)) (q ((1 1) (2 2))))
False

This is exactly as expected. We didn't see this problem in our earlier examples of eval, simply because they involved list manipulations which didn't require us to evaluate numbers such as 1. Incidentally, here's an amusing variation on the above kludge:

tiddlylisp> (eval (q (eq? 1 2)) (q ((1 1) (2 1))))
True

In other words, if we tell our interpreter emphatically enough that 1 = 2 then it will start to believe it!

Just to put eval through its paces, let's add a bundle of tests of basic functionality. It's not an exhaustive test suite, but at least checks that the basic procedures are working as we expect. You don't need to read through the following test code in exhaustive detail, although you should read at least the first few lines, to get a feeling for what's going on. Note that in a few of the lines we need to add something like 1 or 2 to the environment, in order that eval be able to evaluate it, as occurred in the example just above.

(define assert-equal (lambda (x y) (= x y)))

(define assert-not-equal (lambda (x y) (not (assert-equal x y))))

(assert-equal (eval (q x) (q ((x test-value))))
	      (q test-value))
(assert-equal (eval (q y) (q ((y (1 2 3)))))
	      (q (1 2 3)))
(assert-not-equal (eval (q z) (q ((z ((1) 2 3)))))
		  (q (1 2 3)))
(assert-equal (eval (q (quote 7)) (q ()))
	      (q 7))
(assert-equal (eval (q (atom? (q (1 2)))) (q ()))
	      False)
(assert-equal (eval (q (eq? 1 1)) (q ((1 1))))
	      True)
(assert-equal (eval (q (eq? 1 2)) (q ((1 1) (2 2))))
	      False)
(assert-equal (eval (q (eq? 1 1)) (q ((1 1))))
	      True)
(assert-equal (eval (q (car (q (3 2)))) (q ()))
	      (q 3))
(assert-equal (eval (q (cdr (q (1 2 3)))) (q ()))
	      (q (2 3)))
(assert-not-equal (eval (q (cdr (q (1 (2 3) 4)))) (q ()))
		  (q (2 3 4)))
(assert-equal (eval (q (cons 1 (q (2 3)))) (q ((1 1)(2 2)(3 3))))
	      (q (1 2 3)))
(assert-equal (eval (q (cond ((atom? x) (q x-atomic)) 
			     ((atom? y) (q y-atomic)) 
			     ((q True) (q nonatomic)))) 
		    (q ((x 1)(y (3 4)))))
	      (q x-atomic))
(assert-equal (eval (q (cond ((atom? x) (q x-atomic)) 
			     ((atom? y) (q y-atomic)) 
			     ((q True) (q nonatomic)))) 
		    (q ((x (1 2))(y 3))))
	      (q y-atomic))
(assert-equal (eval (q (cond ((atom? x) (q x-atomic)) 
			     ((atom? y) (q y-atomic)) 
			     ((q True) (q nonatomic)))) 
		    (q ((x (1 2))(y (3 4)))))
	      (q nonatomic))
(assert-equal (eval (q ((lambda (x) (car (cdr x))) (q (1 2 3 4)))) (q ()))
	      2)

In tiddlylisp, perhaps the easiest way to use this test code is to append it at the bottom of the file where we define eval. Then, when we load that file into memory, the tests run automatically. If everything is working properly, then all the tests should evaluate to True.

How does eval work? Looking back at the code, we see that it's just a big cond statement, whose value is determined by which of various conditions evaluate to True. The cond statement starts off:

      (cond
	((atom? e) (assoc e a))
        ...

To understand what this accomplishes, it is helpful to remember that what we're most interested in is the value of e, not e itself. Let's use e' to denote the value of e, i.e., e' is the Lisp expression that we actually want to evaluate using eval. Then what the condition above does is check whether e' is atomic, and if so it returns the value of the corresponding variable or procedure in the environment, exactly as we'd expect.

Let's look at the next line in the big outer conditional statement:

	((atom? (car e))

At this stage, we know that e' isn't atomic, since we already checked for that, and so e' must be a list. This line checks to see whether the first element of e' is itself an atom. If it is, then there are multiple possibilities: it could be a special form, such as quote, or a built-in procedure, such as car, or else a procedure that's defined in the environment. To check which of these possibilities is the case, we evaluate another (nested) conditional statement. This just checks off the different cases, for instance the first line of the nested conditional checks to see if we're applying the procedure car, and if so proceeds appropriately,

	   ((eq? (car e) (q car))   (car (eval (cadr e) a)))

In other words, if the first symbol in e' is car, then extract whatever expression is being passed to car, using (cadr e), then evaluate that expression using (eval (cadr e) a), and finally extract the first element, using (car (eval...)). That's exactly what we'd expect car to do. Most of the rest of this nested conditional statement works along similar lines, as you can check yourself. The final line is interesting, and deserves comment:

	   (True                   (eval (cons (assoc (car e) a) (cdr e)) a))))

This line is evaluated when the expression e' does not start with a special form or built-in procedure, but instead starts with the name of a procedure defined in the environment. To understand what is returned, note that (car e) retrieves the name of the procedure, so (assoc (car e) a) can retrieve the procedure from the environment, and then (cons (assoc (car e) a) (cdr e)) appends the arguments to the procedure. The whole thing is then evaluated. It's all quite simple and elegant!

Moving back into the outer cond statement, the final condition is as follows:

	((eq? (caar e) (q lambda))
	 (eval (caddar e) (append (pairlis (cadar e) (evlis (cdr e) a)) a))))))

This occurs when evaluating a quoted expression of the form ((lambda (x...|) exp). The first line simply checks that we are, indeed, seeing a lambda expression. The caddar e extracts the expression exp from the body of the lambda expression. We evaluate this in the context of an environment which has modified by appending some new variable names (extracted with cadar e), using pairlis to pair them with their values, which are evaluated using evlis (which you can work through yourself). Once again, it's all quite simple and neat - a fact which speaks to the marvellous elegance of the design presented in the LISP Manual (and, ultimately, due to John McCarthy).

It won't have escaped your attention that our Lisp eval is very similar to the eval we wrote earlier in Python. Tiddlylisp is somewhat different to the dialect of Lisp our eval interprets, but the implementation is recognizably similar. It is a matter of taste, but I think the Lisp implementation is more elegant. It's true that the Lisp code is superficially a little more complex - it relies more on concepts outside our everyday experience, such as the procedures caar, cadar, and so on. But it makes up for that by possessing a greater conceptual economy, in that we are using concepts such as car, cdr and cond to write an interpreter which understands those very same concepts.

Here's the full code for our Lisp interpreter in tiddlylisp. You should append the test code given above, and save it all as a single file, eval.tl.

(define caar (lambda (x) (car (car x))))
(define cadr (lambda (x) (car (cdr x))))
(define cadar (lambda (x) (cadr (car x))))
(define caddr (lambda (x) (cadr (cdr x))))
(define caddar (lambda (x) (caddr (car x))))

(define not (lambda (x) (if x False True)))

(define append (lambda (x y)
		 (if (null? x) y (cons (car x) (append (cdr x) y)))))

(define pair (lambda (x y) (cons x (cons y (q ()) ))))

(define pairlis 
    (lambda (x y)
      (if (null? x)
	  (q ())
	  (cons (pair (car x) (car y)) (pairlis (cdr x) (cdr y))))))

(define assoc (lambda (x y)
		(if (eq? (caar y) x) (cadar y) (assoc x (cdr y)))))

(define eval 
    (lambda (e a)
      (cond
	((atom? e) (assoc e a))
	((atom? (car e))
	 (cond
	   ((eq? (car e) (q car))   (car (eval (cadr e) a)))
	   ((eq? (car e) (q cdr))   (cdr (eval (cadr e) a)))
	   ((eq? (car e) (q cons))  (cons (eval (cadr e) a) (eval (caddr e) a)))
	   ((eq? (car e) (q atom?)) (atom? (eval (cadr e) a)))
	   ((eq? (car e) (q eq?))   (eq? (eval (cadr e) a) (eval (caddr e) a)))
	   ((eq? (car e) (q quote)) (cadr e))
	   ((eq? (car e) (q q))     (cadr e))
	   ((eq? (car e) (q cond))  (evcon (cdr e) a))
	   (True                   (eval (cons (assoc (car e) a) (cdr e)) a))))
	((eq? (caar e) (q lambda))
	 (eval (caddar e) (append (pairlis (cadar e) (evlis (cdr e) a)) a))))))

(define evcon 
    (lambda (c a)
      (cond ((eval (caar c) a) (eval (cadar c) a))
	    (True              (evcon (cdr c) a)))))

(define evlis 
    (lambda (m a)
      (cond ((null? m) (q ()))
	    (True     (cons (eval (car m) a) (evlis (cdr m) a))))))

It's instructive to compare our eval to what Kay saw on page 13 of the LISP 1.5 Programmer's Manual:

Obviously, what we've written is longer than that half-page! However, as I mentioned earlier, that half-page omitted the code for helper procedures such as caar, append, and so on, which were defined earlier in the LISP Manual. A more direct comparison is to our code for the eval, evcon and evlis procedures.

If you compare our code to the LISP Manual, a few differences jump out. The most obvious is that the LISP Manual's evalquote, apply and eval have all been combined into one procedure. This is a form of organization I adopted from Paul Graham's eval, and it makes it much easier to see what is going on, in the outer cond. In particular, the outer cond has a very simple structure: (1) if the expression we're tring to evaluate is an atom, return its value; otherwise (2) the expression must be a list, so check to see if the first element is an atom, in which case it must be a special form or procedure, and should be evaluated appropriately (this is the inner cond); and otherwise (3) we must be dealing with a lambda expression.

Condition (3) is interesting. With the syntax we're using, the condition in step (3) could simply be expressed as True, not (eq? (caar e) (q lambda), since it's the only remaining possibility. This would, in some sense, simplify (and speed up) the code. However, it would also make it harder to understand the intent of the code.

Something that you may note was present in the LISP Manual but which is missing from our eval is the special form label. label was used in the LISP Manual to give names to procedures, and so that procedure definitions could refer recursively to themselves. It's only a couple of lines to add back in, but I haven't done so. If you'd like, it's a fun challenge to add this functionality back in, and so I've given this as a problem below.

Problems

  • How could you add a facility to eval so that procedure definitions can refer to themselves? If you're having trouble with this problem, you can get a hint by looking at the code from page 13 of the LISP Manual. A complete solution to the problem may be found in Paul Graham's essay about the roots of Lisp.

This is a nice little interpreter. However, it has many limitations, even when compared to tiddlylisp. It can't do basic arithmetic, doesn't cope with integers, much less more complicated data types, it doesn't even have a way of defineing variables (after all, it doesn't return a modified environment). Still, it already contains many of the core concepts of Lisp, and it really is an executable counterpart to what Alan Kay saw on page 13 of the LISP 1.5 Manual.

What can we learn from this interpreter for Lisp in Lisp? As an intellectual exercise, it's cute, but beyond that, so what? Let's think about the analogous question for Python, i.e., writing a Python function that can return the value of Python expressions. In some sense, solving this problem is a trivial one-liner, since Python has a built-in function called eval that is capable of evaluating Python expressions.

What if, however, we eliminated eval (and similar functions) from Python? What then? Well, a Python version of eval would be much more complicated than our Lisp eval. Python is much less regular language than Lisp, and this makes it much more complicated for it to deal with Python code. By contrast, Lisp has an extremely simple syntax, and is designed to manipulate its own code as data. This is all reflected in the simplicity of the interpreter above.

Beyond this, the code for eval is a beautiful expression of the core ideas of Lisp, written in Lisp. It's true that our eval implements a very incomplete version of Lisp, but with just a little elaboration we can add support for arithmetic, more advanced control structures, and so on - everything needed to make this an essentially complete basic Lisp. And so we need only a little poetic license to say that, just as with Maxwell's equations and electromagnetism, there is a sense in which if you can look at this compact little program and understand all its consequences, then you understand all that Lisp can do. And because Lisp is universal, that means that inside these few lines of code is all a computer can do - everything from Space Invaders to computer models of climate to the Google search engine. In that sense this elegant little program really is the Maxwell's equations of software.

Problems

  • Outline a proof that (eval (q (exp)) a) returns the value of exp in the environment a for all expressions exp and environments a if and only if the underlying Lisp interpreter is correct. This little theorem can be considered a formal way of stating that eval contains all of Lisp. The reason I ask for an outline proof only is that various elements in the statement aren't defined as well as they need to be to make this a rigorous result; still, a compelling outline proof is possible.

  • Extend the code given for eval so that you can implement a full read-eval-print loop. This will require you to extend tiddlylisp so that it can cope with input and output, and (perhaps) some sort of looping.
  • Having worked through eval.tl, it should now be easy to work through the first chapter of the LISP 1.5 Programmer's Manual. Download the LISP manual and work through the first chapter, including the code on page 13.

Problems for the author

  • Is it possible to modify the above Lisp-in-Lisp so that it interprets all of tiddlylisp? Note that this will require modification of tiddlylisp.

Acknowledgements

Thanks to Jen Dodd for many helpful discussions.

Footnote

[1] I'm paraphrasing, since this was 17 years ago, but I believe I've reported the essence of the comments correctly. I've taken one liberty, which is in supplying my own set of examples (antennas, motors, and circuits), since I don't recall the examples he gave. Incidentally, his comments contain a common error that took me several years to sort out in my own thinking: Maxwell's equations actually don't completely specify electromagnetism. For that, you need to augment them with one extra equation, the Lorentz force law. It is, perhaps, unfair to characterize this as an error, since it's a common useage to equate Maxwell's equations with electromagnetism, and I've no doubt my professor was aware of the nuance. Nonetheless, while the useage is common, it's not correct, and you really do need the Lorentz force law as well. This nuance creates a slight quandary for me in this essay. As the title of the essay suggests, it explores a famous remark made by Alan Kay about what he called "the Maxwell's equations of software", and I presume that in making this remark Kay was following the common useage of equating the consequences of Maxwell's equations with the whole of electromagnetism. My quandary is this: on the one hand I don't wish to perpetuate this useage; on the other hand I think Kay's formulation is stimulating and elegant. So I'll adopt the same useage, but with the admonishment that you should read "Maxwell's equations" as synonomous with "Maxwell's equations plus the Lorentz force law". That set of five equations really does specify the whole of (classical) electromagnetism.

[2] I first encountered lambda in Python, not Lisp, but I believe I would have been perplexed for the same reason even if I'd first encountered it in Lisp.

[3] With a hat tip to Abelson and Sussman, who famously wrote "programs must be written for people to read, and only incidentally for machines to execute".

[4] The simplest solutions I can think of are: (1) to give eval the ability to determine when some key is not in the environment; or (2) to give eval the ability to recognize numbers. Both approaches seem to also require making some modifications to tiddlylisp.

Further reading

Much of Alan Kay's writing may be found at the website of the Viewpoints Research Institute. I also recommend browsing his list of recommended reading.

Lisp enjoys a plethora of insightful and beautifully written books and essays, many of them freely available online. This essay is, of course, based principally on Peter Norvig's essay on his basic Lisp interpreter, lispy. I've also drawn a few ideas from a followup essay of Norvig's which describes a more sophisticated Lisp interpreter. Both essays (and the accompanying code) are marvellously elegant, and well worth working through. Norvig's other works are also worth your time. The first three chapters of Norvig's book Paradigms of Artificial Intelligence Programming are an excellent introduction to Common Lisp.

The other principal inspiration for the current essay is Paul Graham's essay The Roots of Lisp (postscript file), where he explains John McCarthy's early ideas about Lisp. My essay may be viewed as an attempt to remix the ideas in Norvig's and Graham's essays, in an attempt to better understand Alan Kay's remark about Lisp-as-Maxwell's-Equations. I also recommend Graham's book "On Lisp", which contains an excellent discussion of Lisp macros and many other subjects. The book seems to be out of print, but thanks to Graham and the publisher Prentice Hall the text of the entire book is freely available online. Note that I am still working through "On Lisp", I have not yet read it to completion. The same is true of the books I mention below, by Seibel, and by Abelson and Sussmann.

Although "On Lisp" is a marvellous book, it's not written for people new to Lisp. To gain familiarity, I suggest working through the first three chapters of Norvig's book, mentioned above. If that's not available, then you should take a look at Peter Seibel's book Practical Common Lisp. It's freely available at the link, and gives an easily readable introduction to Common Lisp.

Finally, I must recommend the wonderful book by Abelson and Sussman on the Structure and Interpretation of Computer Programs. Among other things, it's an introduction to the Scheme dialect of Lisp, but it's about much more than that; it's about how to think about programming. It's a famous book, but for a long time I avoided looking at it, because I'd somehow picked up the impression that it was a little dry. I started reading, and found this impression was completely false: I was utterly gripped. Much of "On Lisp" and "Paradigms of Artificial Intelligence Programming" also have this quality. Abelson and Sussmann's book is freely available online.

Interested in more? Please subscribe to this blog, or follow me on Twitter. You may also enjoy reading my new book about open science, Reinventing Discovery.

Mar 5 12

How changing the structure of the web changes PageRank

by Michael Nielsen

Suppose I add a hyperlink from a webpage w to a webpage w'. In principle, adding that single hyperlink changes the PageRank not just of those two pages, but potentially of nearly every other page on the web. For instance, if the CNN homepage http://cnn.com adds a link to the homepage of my site, http://michaelnielsen.org, then not only will that increase the PageRank of my homepage, it will also increase (though in a smaller way) the PageRank of pages my homepage links to, and then the pages they link to, and so on, a cascading network of ever-smaller changes to PageRank.

In this post I investigate how PageRank changes when the structure of the web changes. We’ll look at the impact of adding and removing multiple links from the web, and also at what happens when entirely new webpages are created. In each case we’ll derive quantitative bounds on the way in which PageRank can change.

I’ve no doubt that most or all of the results I describe here are already known – PageRank has been studied in great detail in the academic literature, and presumably the bounds I describe (or better) have been obtained by others. This post is for my own pleasure (and to improve my own understanding) in attacking these questions from first principles; it is, essentially, a cleaned up version of my notes on the problem, with the hope that the notes may also be of interest to others. For an entree into the academic literature on these questions, see here.

The post requires a little undergraduate-level linear algebra to follow, as well as some basic familiarity with how PageRank is defined and calculated. I’ve written an introduction to PageRank here (see also here, here, here and here). However, I’ve included a short summary below that should be enough to follow this post.

One final note before we get into the analysis: I understand that there are some people reading my blog who are interested in search engine optimization (SEO). If you’re from that community, then I should warn you in advance that this post doesn’t address questions like how to link in such a way as to boost the PageRank of a particular page. Rather, the results I obtain are general bounds on global changes in PageRank as a result of local changes to the structure of the web. Understand such global changes is of great interest if you’re trying to understand how a search engine using PageRank might work (which is my concern), but probably of less interest if your goal is to boost the PageRank of individual pages (SEO’s concern).

PageRank summary

PageRank is defined by imagining a crazy websurfer, who surfs the web by randomly following links on any given page. They interrupt this random websurfing occasionally to “teleport” to a page chosen completely at random from the web. The probability of teleporting t in any given step is assumed to be around t = 0.15, in line with the original PageRank paper. The PageRank of a given webpage w is defined to be the long-run probability p_w of the surfer being on the page. A high PageRank means the page is important; a low PageRank means the page is not important.

One minor wrinkle in this definition of PageRank is what our websurfer should do when they arrive at a dangling webpage, i.e., a webpage with no outgoing links. Obviously, they can’t just select a link to follow at random – there’s nothing to select! So, instead, in this situation what they do is always teleport to a randomly chosen webpage. Another, equivalent way of thinking about this is to imagine that we add outgoing links from the dangling page to every single other page. In this imagined world, our crazy websurfer simply chooses a webpage at random, regardless of whether they are teleporting or not.

With the above definition of the PageRank of a page in mind, the PageRank vector is defined to be the vector p whose components are the PageRanks of all the different webpages. If we number the pages 0, 1, 2, \ldots, N-1 then the component p_0 is the PageRank of page number 0, p_1 is the PageRank of page number 1, and so on. The PageRank vector satisfies the PageRank equation, p = Mp, where M is an N \times N matrix which we’ll call the PageRank matrix. The element M_{jk} of M is just the probability that a crazy websurfer on page j will go to page k. This probability is t/N if there is no link between pages j and k, i.e., it’s just the teleporation probability. The probability is t/N + (1-t)/l if there is a link between page j and k, where l is the number of outbound links from page j. Note that in writing these formulae I’ve assumed that j is not a dangling page. If it is, then we will use the convention that l = N, i.e., the page is linked to every other page, and so M_{jk} = 1/N.

In this post I’ll consider several scenarios where we imagine altering the structure of the web in some way – by adding a link, deleting a link, adding a page, and so on. I’ll use p to denote the PageRank vector before the alteration, and q to denote the PageRank vector after the alteration. What we’re interested in is understanding the change from p to q. The main quantity we’ll study is the total change \sum_j |q_j-p_j| in PageRank across all webpages. This quantity is derived from the l_1 norm for vectors, \|v\|_1 \equiv \sum_j |v_j|. We’ll drop the subscript 1 in the norm notation \|\cdot\|_1, since we’ll only be using the l_1 norm, never the Euclidean norm, which is more conventionally denoted \|\cdot\|.

A key fact about the l_1 norm and the PageRank matrix is that when we apply the PageRank matrix to any two probability vectors, r and s, they are guaranteed to get closer by a factor of (1-t):

 \|M(r-s)\| \leq (1-t) \|r-s\|.

Intuitively, applying the PageRank matrix is like our crazy websurfer doing a single step. And so if we think of r and s as possible starting distributions for the crazy websurfer, then this inequality shows that these distributions gradually get closer and closer, before finally converging to PageRank. We’ll call this inequality the contractivity property for PageRank. I discuss the contractivity property and its consequences in much more detail in this post.

What happens when we add a single link to the web?

Suppose we add a single new link to the web, from a page w to a page w'. How does the PageRank vector change? In this section we’ll obtain a bound on the total change, \|q-p\|. We’ll obtain this bound under the assumption that w isn’t a dangling page, i.e., a page with no outgoing links. As we’ll see, dangling pages create some complications, which we’ll deal with in a later section.

To obtain the bound on \|p-q\|, let M be the PageRank matrix before the new link is inserted, so the PageRank equation is p = Mp, and let M' be the PageRank matrix after the new link is inserted, when the PageRank equation becomes q = M'q. Note that we have q-p = M'q -Mp. If we introduce a new matrix \Delta \equiv M'-M, then this equation may be rewritten:

 q-p = M'q-(M'-\Delta)p = M'(q-p) + \Delta p.

Taking norms on both sides, and using the triangle inequality:

   \|q-p\| \leq \|M'(q-p)\| + \| \Delta p\|.

Using the contractivity property, \|M'(q-p)\| \leq (1-t) \|q-p\|, and so:

   \|q-p\| \leq (1-t) \|q-p\| + \| \Delta p\|.

Rearranging, we obtain:

   [*] \,\,\,\, \|q-p\| \leq \frac{\| \Delta p \|}{t}.

Up to this point we haven’t used in any way the fact that we’re adding merely a single link to the web. The inequality [*] holds no matter how we change the structure of the web, and so in later sections we’ll use [*] (or an analagous inequality) to analyse more complex situations.

To analyse what happens in the specific case when we add a single new link, we will compute \| \Delta p\|. Recall that \Delta = M'-M is the change between the two PageRank matrices. Suppose that w starts with l outgoing hyperlinks (where l > 0, reflecting the fact that w is not a dangling page). By numbering pages appropriately, we can assume that w is page number 0, that the new link is from page 0 to page l+1, and that before the link was inserted, the page 0 was linked to pages 1,2,\ldots,l. Under this numbering of pages, the only column of M which changes in M' is the first column, corresponding to the new link from page 0 to page l+1. Before the link was added, this column had entries t/N + (1-t)/l for the rows corresponding to pages 1 through l, i.e., for the outgoing links, and t/N everywhere else. After the link is added, this column has entries t/N+(1-t)/(l+1) for the rows corresponding to pages 1 through l+1, and t/N everywhere else. Combining these facts and doing a little algebra we find that the entries in the first column of \Delta are:

   (1-t) \left[ \begin{array}{c}       0 \\       \frac{-1}{l(l+1)} \\       \vdots \\        \frac{-1}{l(l+1)} \\       \frac{1}{l+1} \\       0 \\       \vdots \\       0       \end{array} \right].

The other columns of \Delta are all zero, because M and M' don’t change in those other columns. Substituting for \Delta in [*] and simplifying, we see that [*] becomes:

   [**] \,\,\,\, \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0}{l}

This is a strong result. The inequality [**] tells us that the total PageRank vector changes at most in proportion to the PageRank p_0 of the page to which the outbound link is being added. Ordinarily, p_0 will be tiny – it’s a probability in an N-element probability distribution, and typically such probabilities are of size 1/N. As a result, we learn that the total variation in PageRank will be miniscule in the typical case. [**] also tells us that the total variation in PageRank scales inversely with the number of outbound links. So adding an extra outbound link to a page which already has a large number of links will have little effect on the overall PageRank.

Incidentally, going carefully over the derivation of [**] we can see why we needed to assume that the page w was not a dangling page. We get a hint that something must be wrong from the final result: the right-hand side of [**] diverges for a dangling page, since l = 0. In fact, during the derivation of [**] we assumed that M‘s first column had entries t/N + (1-t)/l for the rows corresponding to pages 1 through l, and t/N everywhere else. In fact, M‘s first column has entries 1/N everywhere in the case of a dangling page. We could fix this problem right now by redoing the analysis for the l=0 case, but I’ll skip it, because we’ll get the result for free as part of a later analysis we need to do anyway.

One final note: in the derivation of [**] I assumed that the existing links from page 0 are to pages 1,\ldots,l, and that the new link is to page l+1. This presumes that all the links are to pages other than the original page, i.e., it’s not self-linking. Of course, in practice many pages are self-linking, and there’s no reason that couldn’t be the case here. However, it’s easy to redo the analysis for the case when the page is self-linking, and if we do so it turns out that we arrive at the same result, [**].

Problems

  • Verify the assertion in the last paragraph.

  • The inequality [**] bounds the change \|q-p\| in terms of the initial PageRank of the linking page, p_0. A very similar derivation can be used to bound it in terms of the final PageRank, q_0. Prove that \|p-q\| \leq \frac{(1-t)}{t} \frac{2q_0}{l}.

Problems for the author

  • Is it possible to saturate the bound [**]? My guess is yes (or close to, maybe within a constant factor), but I haven’t explicitly proved this. What is the best possible bound of this form?

We’ve obtained a bound on the variation in l_1 norm produced by adding a link to the web. We might also wonder whether we can say anything about the change in the PageRank of individual pages, i.e., whether we can bound |q_j-p_j|? This is the type of question that is likely to be of interest to people who run webpages, or to people interested in search engine optimization. Note that it’s really three separate questions: we’d like bounds on |q_j-p_j| when (1) j is the source page for the new link; (2) j is the target page for the new link; and (3) j is neither the source nor the target. Unfortunately, I don’t have any such bounds to report, beyond the obvious observation that in all three cases, |q_j-p_j| \leq \|q-p\|, and so at least the bound on the right-hand side of [**] always applies. But it’d obviously be nice to have a stronger result!

Problems for the author

  • Can I find bounds on |q_j-p_j| for the three scenarios described above? (Or perhaps for some different way of specifying the properties of j?, e.g., when j is another page linked to by the source page.)

What happens when we add and remove multiple links from the same page?

Suppose now that instead of adding a single link from a page, we remove m existing links outbound from that page, and add n new links outbound from the page, for a total of l' = l-m+n links. How does the PageRank change? Exactly as in our earlier analysis, we can show:

   \|q-p\| \leq \frac{\| \Delta p \|}{t},

where \Delta is the change in the PageRank matrix caused by the modified link structure. Assuming as before that the page is not a dangling page, we number the pages so that: (1) the links are outbound from page 0; (2) the links to pages 1,\ldots, l-m are preserved, before and after; (3) the links to pages l-m+1,\ldots,l are removed; and (4) the links to pages l+1,\ldots,l+n are new links. Then all the columns of \Delta are zero, except the first column, which has entries:

   (1-t) \left[ \begin{array}{c}       0 \\       \frac{l-l'}{ll'} \\       \vdots \\       \frac{l-l'}{ll'} \\       \frac{-1}{l} \\       \vdots \\       \frac{-1}{l} \\       \frac{1}{l'} \\       \vdots \\       \frac{1}{l'} \\       0 \\       \vdots \\       0       \end{array} \right]

As a result, we have

   \|q-p\| \leq \frac{(1-t)p_0}{t} \left[ \frac{|l-l'|(l-m)}{ll'}+ \frac{m}{l}     + \frac{n}{l'} \right].

To simplify the quantity on the right we analyse two cases separately: what happens when l' \geq l, and what happens when l' <l. In the case l' \geq l we get after some algebra

   \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0n}{l'},

which generalizes our earlier inequality [**]. In the case l' < l we get

   \|q-p\| \leq \frac{(1-t)}{t} \frac{2 p_0m}{l}.

Observing that l' \geq l is equivalent to n \geq m, we may combine these two inequalities into a single unified inequality, generalizing our earlier result [**]:

   [***] \,\,\,\, \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0 \max(m,n)}{\max(l,l')}.

What about adding links to a dangling page?

Let’s come back to the question we ducked earlier, namely, how PageRank changes when we add a single link to a dangling page. Recall that in this case the crazy websurfer model of PageRank is modified so it’s as though there are really N outgoing links from the dangling page. Because of this, the addition of a single link to a dangling page is equivalent to the deletion of m = N-1 links from a page that started with l = N outgoing links. From the inequality [***], we see that:

   \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0 (N-1)}{N}.

To generalize further, suppose instead that we’d added n links to a dangling page. Then the same method of analysis shows:

   \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0 (N-n)}{N}.

In general, we can always deal with the addition of n links to a dangling page as being equivalent to the deletion of m = N-n links from a page with N outgoing links. Because of this, in the remainder of this post we will use the convention of always treating dangling pages as though they have l=N outgoing links.

What happens when we add and remove links from different pages?

In the last section we assumed that the links were all being added or removed from the same page. In this section we generalize these results to the case where the links aren’t all from the same page. To do this, we’ll start by considering the case where just two pages are involved, which we label 0 and 1. We suppose m_j outbound links are added to page j (j = 0 or 1), and n_j outbound links are removed. Observe that M' = M + \Delta_0+\Delta_1, where \Delta_0 and \Delta_1 are the changes in the PageRank matrix due to the modifications to page 0 and page 1, respectively. Then a similar argument to before shows that:

   \|q-p\| \leq \frac{\|(\Delta_0+\Delta_1)p\|}{t}.

Applying the triangle inequality we get:

   \|q-p\| \leq \frac{\|\Delta_0p\|}{t} + \frac{\|\Delta_1p\|}{t}.

And so we can reuse our earlier analysis to conclude that:

   \|q-p\| \leq \frac{2(1-t)}{t} \left[ \frac{\max(m_0,n_0)p_0}{\max(l_0,l_0')}     + \frac{\max(m_1,n_1)p_1}{\max(l_1,l_1')} \right],

where l_j and l_j' are just the number of links outbound from page j, before and after the modifications to the link structure, respectively. This expression is easily generalized to the case where we are modifying more than two pages, giving:

   \|q-p\| \leq \frac{2(1-t)}{t} \sum_j \frac{\max(m_j,n_j)p_j}{\max(l_j,l_j')}.

This inequality is a quite general bound which can be applied even when very extensive modifications are made to the structure of the web. Note that, as discussed in the last section, if any of the pages are dangling pages then we treat the addition of n_j links to that page as really being deleting m = N-n_j links from a page with l_j = N outgoing links, and so the corresponding term in the sum is (N-n_j)p_j/N. Indeed, we can rewrite the last inequality, splitting up the right-hand side into a sum \sum_{j:{\rm nd}} over pages j which are not dangling, and a sum \sum_{j:{\rm d}} over dangling pages j to obtain:

   \|q-p\| \leq \frac{2(1-t)}{t} \left( \sum_{j:{\rm nd}} \frac{\max(m_j,n_j)p_j}{\max(l_j,l_j')}     + \sum_{j:{\rm d}} \frac{(N-n_j)p_j}{N} \right).

Modified teleportation step

The original PageRank paper describes a way of modifying PageRank to use a teleportation step which doesn’t take us to a page chosen uniformly at random. Instead, some non-uniform probability distribution is used for teleportation. I don’t know for sure whether Google uses this idea, but I wouldn’t be suprised if they use it as a way of decreasing the PageRank of suspected spam and link-farm pages, by making them less likely to be the target of teleportation. It’s possible to redo all the analyses to date, and to prove that the bounds we’ve obtained hold even with this modified teleportation step.

Problems

  • Prove the assertion in the last paragraph.

What happens when we add a new page to the web?

So far, our analysis has involved only links added between existing pages. How about when a new page is added to the web? How does the PageRank change then? To analyse this question we’ll begin by focusing on understanding how PageRank changes when a single page is created, with no incoming or outgoing links. Obviously, this is rather unrealistic, but by understanding this simple case first we’ll be in better position to understand more realistic cases.

It’s perhaps worth commenting on why we’d expect the PageRank to change at all when we create a new page with no incoming or outgoing links. After all, it seems as though the random walk taken by our crazy websurfer won’t be changed by the addition of this new and completely isolated page. In fact, things do change. The reason is because the teleportation step is modified. Instead of going to a page chosen uniformly at random from N pages, each with probability 1/N, teleportation now takes us to a page chosen uniformly at random from the full set of N+1 pages, each with probability 1/(N+1). It is this change which causes a change in the PageRank.

A slight quirk in analysing the change in the PageRank vector is that while the initial PageRank vector p is an N-dimensional vector, the new PageRank vector q is an N+1-dimensional vector. Because of the different dimensionalities, the quantity \|q-p\| is not even defined! We resolve this problem by extending p into the N+1-dimensional space, adding a PageRank of 0 for the new page, page number N. We’ll denote this extended version of p by p_e. We also need to extend M so that it becomes an N+1 by N+1 matrix M_e satisfying the PageRank equation M_ep_e = p_e for the extended p. One way of doing this is to extend M by adding a row of zeros at the bottom, and 1/N everywhere in the final column, except the final entry:

   M_e =   \left[ \begin{array}{cccc}          &        &   & \frac{1}{N} \\         & M      &   & \vdots \\         &        &   & \frac{1}{N} \\       0 & \ldots & 0 & 0       \end{array} \right]

With the extended p and M it is easy to verify that the PageRank equation M_ep_e = p_e is satisfied. As in our earlier discussions, we have the inequality

   \|q-p_e\| \leq \frac{\|\Delta p_e \|}{t},

where now \Delta = M'-M_e. Computing \Delta, we have:

   \Delta = \left[ \begin{array}{cccc}       \frac{t}{N+1}-\frac{t}{N} & \ldots & \frac{t}{N+1}-\frac{t}{N} & \frac{1}{N+1}-\frac{1}{N} \\       \vdots                    & \ddots & \vdots                    & \vdots \\       \frac{t}{N+1}-\frac{t}{N} & \ldots & \frac{t}{N+1}-\frac{t}{N} & \frac{1}{N+1}-\frac{1}{N} \\       \frac{t}{N+1}             & \ldots & \frac{t}{N+1}             & \frac{1}{N+1}     \end{array} \right].

Substituting and simplifying, we obtain the bound:

   [****] \,\,\,\, \|q-p_e\| \leq \frac{2}{N+1}.

Not surprisingly, this bound shows that creating a new and completely isolated page makes only a very small difference to PageRank. Still, it’s good to have this intuition confirmed and precisely quantified.

Exercises

  • Suppose we turn the question asked in this section around. Suppose we have a web of N pages, and one of those pages is an isolated page which is deleted. Can you prove an analogous bound to [****] in this situation?

Problems for the author

  • How does the bound [****] change if we use a non-uniform probability distribution to teleport? The next problem may assist in addressing this question.

  • An alternative approach to analysing the problem in this section is to use the formula p = t (I-(1-t)G)^{-1} e, where e is the vector whose entries are the uniform probability distribution (or, more generally, the vector of teleportation probabilities), and G is the matrix whose jk entry is 1/l_j if j links to k, and 0 otherwise. This formula is well-suited to analysing the problem considered in the current section; without going into details, the essential reason is that the modified matrix I-(1-t)G' has a natural block-triangular structure, and such block-triangular matrices are easy to invert. What is the outcome of doing such an analysis?

Adding multiple pages to the web

Suppose now that instead of adding a single page to the web we add multiple pages to the web. How does the PageRank change? Again, we’ll analyse the case where no new links are created, on the understanding that by analysing this simple case first we’ll be in better position to understand more realistic cases.

We’ll consider first the case of just two new pages. Suppose p is the original N-dimensional PageRank vector. Then we’ll use p_e as before to denote the extension of p by an extra 0 entry, and p_{ee} to denote the extension of p by two 0 entries. We’ll use q to denote the N+1-dimensional PageRank vector after adding a single webpage, and use q_e to denote the extension of q by appending an extra 0 entry. Finally, we’ll use r to denote the N+2-dimensional PageRank vector after adding two new webpages.

Our interest is in analysing \|r-p_{ee}\|. To do this, note that

 \|r-p_{ee}\| = \|r-q_{e}+q_e-p_{ee}\|.

Applying the triangle inequality gives

   \|r-p_{ee} \| \leq \|r-q_{e}\| + \|q_e-p_{ee}\|.

Observe that \|q_e-p_{ee}\| = \|q-p_e\|, since the extra 0 appended to the end of both vectors makes no difference to the l_1 norm. And so the previous inequality may be rewritten

   \|r-p_{ee} \| \leq \|r-q_{e}\| + \|q-p_e\|.

Now we can apply the results of last section twice on the right-hand side to obtain

   \|r-p_{ee} \| \leq \frac{2}{N+2} + \frac{2}{N+1}.

More generally, suppose we add \Delta N new pages to the web. Denote the final N+\Delta N-dimensional PageRank vector by q. Let p denote the initial N-dimensional PageRank vector, and let p_e denote the N+\Delta N-dimensional extension obtained by appending \Delta N 0 entries to p. Then applying an argument similar to that above, we obtain:

   \|q-p_e \| \leq \sum_{j=1}^{\Delta N} \frac{2}{N+j}.

Adding multiple pages, adding and removing multiple links

Suppose now that we add many new pages to the web, containing links to existing webpages. How does the PageRank change? Not surprisingly, answering this kind of question can be done in a straightforward way using the techniques described already in this post. To see this, suppose we add a single new webpage, which contains n new links back to existing webpages. As before, we use q to denote the new PageRank vector, p to denote the old PageRank vector, and p_e to denote the extension of p obtained by appending a 0 entry. We have

   \|q-p_e\| = \|M'q-M_e p_e\|,

where M_e is the (N+1) \times (N+1) matrix obtained by adding a row of 0s to the bottom of M, and a column (except the bottom right entry) of 1/N values to the right of M. We can write M' = M_e+\Delta_1+\Delta_2, where \Delta_1 is the change in M_e due to the added page, and \Delta_2 is the change in M_e due to the added links. Applying a similar analysis to before we obtain

   \|q-p_e\| \leq \frac{\|\Delta_1 p_e\| + \| \Delta_2 p_e \|}{t}.

We can bound the first term on the right-hand side by 2/(N+1), by our earlier argument. The second term vanishes since \Delta_2 is zero everywhere except in the last column, and p_e‘s final entry is zero. As a result, we have

   \|q-p_e\| \leq \frac{2}{N+1},

i.e., the bound is exactly as if we had added a new page with no links.

Exercises

  • There is some ambiguity in the specification of \Delta_1 and \Delta_2 in the above analysis. Resolve this ambiguity by writing out explicit forms for \Delta_1 and \Delta_2, and then verifying that the remainder of the proof of the bound goes through.

Suppose, instead, that the new link had been added from an existing page. Then the term \| \Delta_2 p_e\| above would no longer vanish, and the bound would become

   \|q-p_e\| \leq \frac{2}{N+1} + \frac{2(1-t)}{t} \frac{p}{(l+1)},

where p was the initial PageRank for the page to which the link was added, and l was the initial number of links from that page.

Generalizing this analysis, it’s possible to write a single inequality that unites all our results to date. In particular, suppose we add \Delta N new pages to the web. Denote the final N+\Delta N-dimensional PageRank vector by q. Let p denote the initial N-dimensional PageRank vector, and let p_e denote the N+\Delta N-dimensional extension obtained by appending \Delta N 0 entries to p. Suppose we add m_j outbound links to page j, and remove n_j outbound links from page j. Then an analysis similar to that above shows that:

   \|q-p_e \| \leq \sum_{j=1}^{\Delta N} \frac{2}{N+j}+ \frac{2(1-t)}{t} \left( \sum_{j:{\rm nd}} \frac{\max(m_j,n_j)p_j}{\max(l_j,l_j')}     + \sum_{j:{\rm d}} \frac{(N-n_j)p_j}{N} \right),

where, as before, the sum \sum_{j:{\rm nd}} is over those pages j which are non-dangling, and the sum \sum_{j:{\rm d}} is over dangling pages j. Note that we can omit all the new pages j from this latter sum, for the same reason the quantity \| \Delta_2 p_e \| vanished earlier in this section. This inequality generalizes all our earlier results.

Problems

  • Fill in the details of the above proof. To do so it helps to consider the change in PageRank in two steps: (1) the change due to the addition of new links to existing pages; and (2) the change due to the addition of new webpages, and links on those webpages.

Recomputing PageRank on the fly

Part of my reason for being interested in the questions discussed in this post is that I’m interested in how to quickly update PageRank during a web crawl. The most obvious way to compute the PageRank of a set of crawled webpages is as a batch job, doing the computation maybe once a week or once a month. But if you’re operating a web crawler, and constantly adding new pages to an index, you don’t want to have to wait a week or a month before computing the PageRank (or similar measure) of a newly crawled page. You’d like to compute – or at least estimate – the PageRank immediately upon adding the page to the index, without having to do an enormous matrix computation. Is there a way this can be done?

Unfortunately, I don’t know how to quickly update the PageRank. However, the bounds in this post do help at least a little in figuring out how to update the PageRank as a batch job. Suppose we had an extant multi-billion page crawl, and knew the corresponding values for PageRank. Suppose then that we did a mini-crawl to update the index, perhaps with a few million new pages and links. Suppose p was the (known value of) the PageRank vector before the mini-crawl, and q is the PageRank after the mini-crawl, which we are now trying to compute. Suppose also that M' is the new PageRank matrix. One natural way of updating our estimate for PageRank would be to apply M' repeatedly to p (more precisely, its extension, p_e), in the hopes that it will quickly converge to q. The bounds in this post can help establish how quickly this convergence will happen. To see this, observe that

   \|q-(M')^np_e\| = \| (M')^n(q-p_e) \| \leq (1-t)^n \|q-p_e\|,

where we used the PageRank equation M'q = q to get the first equality, and the contractivity of PageRank to get the second inequality. This equation tells us that we can compute an estimate for q by applying M' repeatedly to p_e, and it will converge exponentially quickly. This is all standard stuff about PageRank. However, the good news is that the results in this post tell us something about how to bound \|q-p_e\|, and frequently this quantity will be very small to start with, and so the convergence will occur much more quickly than we would a priori expect.

Problems for the author

  • Is there a better way of updating PageRank on the fly, rather than applying powers of M'?

Interested in more? Please follow me on Twitter. You may also enjoy reading my new book about open science, Reinventing Discovery.

Feb 8 12

Guest post: Judea Pearl on correlation, causation and the psychology of Simpson’s paradox

by Michael Nielsen

In a recent post I discussed Judea Pearl’s work on causal inference, and in particular on the causal calculus. The post contained a number of “problems for the author” (i.e., me, Michael Nielsen). Judea Pearl has been kind enough to reply with some enlightening comments on four of those problems, as well as contributing a stimulating mini-essay on the psychology of Simpson’s paradox. The following post contains both the text of my questions, as well as Judea’s responses, which he’s agreed to make public (thanks also to Kaoru Mulvihill for helping make this possible!). The post covers five topics, which can be read independently: some remarks on the intuition behind causal probabilities, the intuition behind d-separation, the independence of the three rules of the causal calculus, practical considerations in applying the causal calculus, and the psychology of Simpson’s paradox. Over to Judea.

Dear Michael,

Thanks for giving me the opportunity to respond to your thoughtful post on correlation, causation and the causal calculus. I will first answer four of your questions, and then comment on the Simpson’s paradox and its role in understanding causation.

Question 1: On the intuition behind causal probabilities

Question (MN): We used causal models in our definition of causal conditional probabilities. But our informal definition — imagine a hypothetical world in which it’s possible to force a variable to take a particular value didn’t obviously require the use of a causal model. Indeed, in a real-world randomized controlled experiment it may be that there is no underlying causal model. This leads me to wonder if there is some other way of formalizing the informal definition we have given?

Answer (JP)

Yes, there is.

But we need to distinguish between “there is no underlying causal model” and “we have no idea what the underlying causal model is.” The first notion leads to chaotic dead-end, because it may mean: “everything can happen,” or “every time I conduct an experiment the world may change,” or “God may play tricks on us” etc. Even the simple idea that randomized experiments tell us something about policy making requires the assumption that there is some order in the universe, that the randomized coin depends on some things and not others, and that the coin affects some things and not others (e.g.,the outcome). So, let us not deal with this total chaos theory.

Instead, let us consider the “no idea” interpretation. It is not hard to encode this state of ignorance in the language of graphs; we simply draw a big cloud of hidden variables, imagine that each of them can influence and be influenced by all the others and we start from there. The calculus that you described nicely in the section “Causal conditional probabilities” now tells us precisely what conclusions we can still draw in this state of ignorance and what we can no longer draw. For example, every conclusion that relies on the identity and values of pa(X_j)p (e.g., P(X_i|do(x_j))) will no longer be computable, because most of X_j‘s parents will be unmeasured.

Remarkably, there are some conclusions that are still inferrable even with such ignorance, and you mentioned the important ones — conclusions we draw from randomized controlled experiments. Why is that so? Because there are certain things we can assume about the experiment even when we do not have any idea about the causal relationships among variables that are not under our control. For example, we can safely assume that the outcome of the coin is independent on all factors that affect the outcome Y, hence those factors will be the same (in probability) for treatment and control group, and will remain invariant after the experiment is over. We also know that we can overrule whatever forces compelled subjects to smoke or not smoke before the experiment and, likewise, we can assume that the coin does not affect the outcome (cancer) directly but only through the treatment (smoking). All these assumptions can be represented in the graph by replacing the former parents of X by a new parent, the coin, while keeping everything else the same. From this we can derive the basic theorem of experimental studies

 P(y|do(x)) = P(y|x)

which connects policy predictions P(y|do(x)) to experimental finding P(y|x).

There is another way of representing black-box ignorance of the kind faced by a researcher who conducts randomized experiment with zero knowledge about what affects what. This approach, called “potential outcome” was pioneered by Neyman in the 1930′s and further developed by Don Rubin since 1974. The idea is to define a latent variable Y_x which stands for the value that Y would attain if we were to manipulate X to value x. The target of analysis is to estimate the expected value of the difference Y_1 - Y_0 from experimental data, where Y_x is only partially observed. Y_1 is known for the treatment group and Y_0 for the control group. The analysis within this framework proceeds by assuming that, in a randomized trial, Y_x is independent on X, an assumption called “ignorability.” The justification of this assumption demands some stretching of imagination — because we really do not know much about the hypothetical Y_x, so how can we judge whether it is dependent or independent of other variables, like X. The justification becomes compelling if we go back to the graph and ask ourselves: what could account for the statistical variations of Y_x (i.e., the variations from individual to individual). The answer is that Y_x is none other but all factors that affect Y when X is held constant at X=x. With this understanding, it makes sense to assume that Y_x is independent of X, since X is determined by the coin and the coin is presumed independent of all factors that affect Y.

Here, again, the graph explicates an independence that the formal counterfactual approach takes for granted and thus provides the justification for inferences under CRT.

In a typical application however, we may have only partial knowledge of the model’s structure and, while the black-box approach fails to utilize what we do know, the graphical approach displays what we know and can be used to determine
whether we know enough to infer what we need. More specifically, while the two approaches are logically equivalent, the potential outcome approach requires that we express what we know in terms of independencies among counterfactual variables, a cognitively formidable task, while the graphical approach permit us to express knowledge in the form of missing arrows. The latter is transparent, the former opaque.

Question 2: the intuition behind d-separation

Question (MN): The concept of d-separation plays a central role in the causal calculus. My sense is that it should be possible to find a cleaner and more intuitive definition that substantially simplifies many proofs. It’d be good to spend some time trying to find such a definition.

Answer (JP)

This was done, indeed, by Lauritzen et al. (1990) and is described on pages 426-427 of the Epilogue of my book. It involves “marrying” the parents of any node which is an ancestor of a variable on which we condition. This method may indeed be more convenient for theorem proving in certain circumstances. However, since the conditioning set varies from problem to problem, the graph too would vary from problem to problem and I find it more comfortable to work with d-separation. A gentle, no-tear introduction to d-separation is given on page 335 of Causality.

Question 3: the independence of the rules of the causal calculus

Question (MN): Suppose the conditions of rules 1 and 2 [of the causal calculus] hold. Can we deduce that the conditions of rule 3 also hold?

Answer (JP)

No, these rules are both “independent” and “complete.” The independence can be seen in any graph in which X does not affect Z. Regardless of other conditions, such a graph allows us to invoke rule 3 and conclude

P(z|do(x)) = P(z).

This makes sense, manipulating a later variable does not affect the earlier variable. Completeness means that we have not forgotten any rule, every correct sentence (given that the graph is correct) is derivable syntactically using our three rules. (This was proven only recently, in 2006)

Update (MN) In the comments, below, Judea notes that this needs amendment: Ilya Shpitser noted that my answer to Question 3 needs correction. I said “No, these rules are both ‘independent’ and ‘complete.’” In fact, Huang and Valtorta showed that rule 1 is implied by rules 2 and 3, their proof can be found here: (Lemma 4, link).

Question 4: practical issues in applying the causal calculus

Question (MN): In real-world experiments there are many practical issues that must be addressed to design a realizable randomized, controlled experiment. These issues include selection bias, blinding, and many others. There is an entire field of experimental design devoted to addressing such issues. By comparison, my description of causal inference ignores many of these practical issues. Can we integrate the best thinking on experimental design with ideas such as causal conditional probabilities and the causal calculus?

Answer (JP)

Of course.

All the issues you described have nice graphical representations in causal graphs and lend themselves to precise formulation, and meaningful solution. Take selection bias, for instance, my student Elias Bareinboim just came up with a definitive criterion that tells us which graphs allow for the removal of selection bias and which do not.
The same can be said about measurement errors, effect decomposition (mediation) and other frills of experimental design. One would be hard pressed today to find a journal of epidemiology without causal diagrams. The last breakthrough I can report is a solution to the century-old problem of “transportability” or “external validity,” that is, under what conditions it is possible to generalize experimental results from one population to another, which may differ in several aspects from the first.

This is truly an incredible breakthrough (see http://ftp.cs.ucla.edu/pub/stat_ser/r372.pdf) and the only reason we have not heard the splash in journals like Science or Nature is that the people who can benefit from it are still living in the Arrow-phobic age, when graphs were considered “too seductive” to have scientific value. Hopefully, some of your readers will help change that.

The Psychology of Simpson’s Paradox

A full understanding of Simpson’s paradox should explain why an innocent arithmetic reversal of an association, not uncommon, came to be regarded as “paradoxical,” and why it has captured the fascination of statisticians, mathematicians and philosophers for over a century (it was first labelled “paradox” by Blyth (1972)). The arithmetics of proportions has its share of peculiarities, no doubt, but these tend to become objects of curiosity once demonstrated and explained away by vivid examples. For example, naive students of probability expect the average of a product to equal the product of the averages but quickly learn to guard against such expectations, given a few counterexamples. Likewise, students expect an association measured in a mixture distribution to equal a weighted average of the source associations. They are surprised, therefore, when ratios of sums, \frac{(a+b)}{(c+d)}, are found to be ordered differently than individual ratios, \frac{a}{c} and \frac{b}{d}. Again, such arithmetic peculiarities are quickly accommodated by seasoned students as reminders against simplistic reasoning.

The sign reversal we see in Simpson’s paradox is of different nature because, even after seeing the numbers and checking the arithmetics people are left with a sense of “impossibility.” In other words, the ramifications of reversal seem to clash so intensely with deeply held convictions that people are willing to proclaim the reversal impossible. This is seen particularly vivid in the Kidney Stone treatment example that you mentioned, because it challenges our deeply held beliefs about rational decision making. In this example, it is clear that if one is diagnosed with “Small Stones” or “Large Stones” Treatment A would be preferred to Treatment B But what if a patient is not diagnosed, and the size of the stone is not known; would it be appropriate then to consult the aggregated data, according to which Treatment B is preferred?
This would stand contrary to commonsense; a treatment that is preferred both under one condition and under its negation should also be preferred when the condition is unknown.

Or perhaps we should always prefer the partitioned data over the aggregate? This leads us to another dead alley. What prevents one from partitioning the data into arbitrary sub-categories (say based on eye color or post-treatment pain) each advertising a different treatment preference? Which among the many possible ways of partitioning the data should give us the correct decision?

The last problem, fortunately, was given a complete solution using d-separation: A good partition is one that d-separates all paths between X and Y that contain
an arrow into X. All such partitions will lead to the same decision and the same P(y|do(x)). So, here we have an instance where a century-old paradox is reduced to graph theory and resolved by algorithmic routine — a very gratifying experience for mathematicians and computer scientists.

But I want to include psychologists and philosophers of mind in this excitement by asking if they can identify the origin of the principle we invoked earlier: “a treatment that is preferred both under one condition and under its negation should also be preferred when the condition is unknown.” Whence comes this principle? Is it merely
strong intuition or an experimentally verifiable proposition. Can a drug be invented that is beneficial to males and females but harmful to the population as a whole? Whatever its origin, how is this conviction encoded and processed in the mind? The fact that people share this conviction and are willing to bet that a miracle drug with such characteristics is simply impossible tells us that we share a calculus through which the conclusion is derived swiftly and unambiguously. What is this calculus? It is certainly not probability calculus, because we have seen that probability does sanction reversals so there is nothing in probability calculus that would rule out the miracle drug that people deem impossible.

I hate to hold readers in suspense, but I am proposing that the calculus we use in our head is none other but the do-calculus described in your post. Indeed the impossibility of such a miracle drug follows from a theorem in do-calculus (Pearl, 2009, p. 181): which reads:

An action A that increases the probability of an event B in each subpopulation C_i of C must also increase the probability of B in the population as a whole, provided that the action does not change the distribution of the subpopulations.

Thus, regardless of whether effect size is measured by differences or ratios, regardless of whether C is a confounder or not, and regardless of whether we have the correct causal structure on our hand, our intuition should be offended by any effect reversal due to combining subpopulations.

This is fairly general, and it is hard to imagine another calculus ruling out effect reversal with comparable assertiveness. I dare to conjecture therefore that it is do-calculus that drives our intuition about cause and effect relations.

Feb 6 12

Addendum on search and support vector machines

by Michael Nielsen

In the last post I described how to use support vector machines (SVMs) to combine multiple notions of search relevance. After posting I realized I could greatly simplify my discussion of an important subject in the final section of that post. What I can simplify is this: once you’ve found the SVM parameters, how should you use them to rank a long list of documents for a particular query, in practice?

Here’s the setup. We suppose we’ve already taken a bunch of training data, as in the last post, and used that training data to find a SVM with soft margin. That SVM can be used to rank any query and document. In particular, the SVM is specified by a vector \vec w, and we rank a document d above a document d' for query q if:

   [*] \,\,\,\, \vec w \cdot \vec \phi(q,d,d') \geq 0,

where \vec \phi(q,d,d') is the feature difference vector. (In the last post a parameter b also appeared in this condition, but as I noted in one of the exercises in that post, for the particular problem of search b=0, which is why it doesn’t appear here.)

In the final section of the last post I did lots of perambulations with the condition [*] to figure out how to construct a linear order ranking documents, given a particular query q. We can simplify life quite a bit by recalling that \vec \phi(q,d,d') = \vec \psi(q,d)-\vec \psi(q,d'), where \vec \psi(q,d) is the feature vector for query q and document d. The condition above can then be rewritten as:

   [**] \,\,\,\, \vec w \cdot \vec \psi(q,d) \geq \vec w \cdot \vec \psi(q,d')

This condition suggests a much easier way of ranking documents: simply treat \vec w \cdot \vec \psi(q,d) as a score of how good document d is when the query is q, so that we rank d above d' whenever the score for d is better than d'. We then just construct a linear ordering of the documents, based on the score. Or, if we prefer, we can find just the top 10 (or 20, or whatever) documents by iterating linearly over the documents, and keeping a running tally of the best 10 (or 20) found to date.

This is quite a bit simpler than the discussion at the end of my last post. It only works, though, because our classifer (the SVM) is a linear function of the feature difference vectors. It’s not difficult to construct other types of classifer for which it really wouldn’t be obvious how to construct this kind of score. For those classifiers you could still fall back on the analysis I did in the last post, so it wasn’t all wasted effort.

Incidentally, one thing I like about the equation [**] is that it makes it a lot clearer why \vec w is called the weight vector. It means that the score for page d on query q is just a linear combination of the different feature values, with weights given by the weight vector. That makes the notation \vec w and nomenclature “weight vector” much clearer, as well as making it a lot clearer what the vector \vec w means!

Feb 4 12

How to combine multiple notions of relevance in search?

by Michael Nielsen

In earlier posts I’ve described two different ways we can assess how relevant a given webpage is to a search query: (1) the cosine similarity measure; and (2) the PageRank, which is a query-independent measure of the importance of a page. While it’s good that we have multiple insights into what makes a webpage relevant, it also gives rise to a problem: how should we combine these two measures to determine the ranking of a particular webpage for a particular search query?

In fact, in practice the problem gets much more complex than this, because there are more than two useful notions of search relevance. For instance, we’d surely also wish to incorporate a relevance measure that quantifies how likely or unlikely a given page is to be spam. By doing that we could ensure that pages which are likely to be spam receive a much lower ranking. We might also wish to incorporate a measure which ranks a page based on how close together the words in the search query are on that page. Once you start to think about it, we humans combine a very large number of factors when assessing the usefulness of a webpage. This is reflected in the fact that, according to Google, their search engine combines not just two or three measures of relevance but more than 200 measures of relevance. How should we best combine all these multiple measures in order to determine how relevant a page is to a given query?

You might think that the right approach would be to think hard about the meaning of measures like cosine similarity and PageRank, and then on the basis of that understanding, to figure out optimal ways of combining those measures. This approach is certainly worth pursuing, but it suffers from a problem: it doesn’t scale very well. Even if you come up with a good way of combining cosine similarity and PageRank, how would you combine 200 different measures? It’s not so obvious. And if you decide to trial the addition of a 201st measure of relevance, how exactly should you incorporate it into your algorithm, and how should you check to see whether or not it improves search results?

In this post, I’ll describe an approach to combining multiple measures of relevance that doesn’t require us to consider the details of the individual measures. Instead, the procedure I describe lets the machine automatically learn how to combine different measures. It does this with the help of a set of training data, where humans have ranked some set of webpages according to their relevance to some set of training queries. The idea is to figure out the best way of combining the measures of relevance in order to reproduce the results of the training data. Whatever method of combination is found is then applied more broadly, to all queries, and all webpages. The big advantage of this machine learning approach is that it lets us easily combine many different notions of search relevance. But it also has some drawbacks, as we’ll see.

The post is based principally on Chapter 15 of the book about information retrieval by Manning, Raghavan, and Sch\”utze. The book is also available for free on the web.

Originally, I intended this post to be a mix of theory and working code to illustrate how the theory works in practice. This is the style I’ve used for many earlier posts, and is the style I intend to use whenever possible. However, I ran into a problem when I attempted to do that for the current post. The problem was that if I wanted to construct interesting examples (and ask interesting questions about those examples), I needed to add a lot of extra context in order for things to make sense. It would have tripled (or more) the length of an already long post. I ultimately decided the extra overhead wasn’t worth the extra insight. Instead, the post focuses on the theory. However, I have included a few pointers to libraries which make it easy to construct your own working code, if you’re so inclined. At some point I expect I’ll come back to this question, in a context where it makes much more sense to include working code.

As usual, I’ll finish the introduction with the caveat that I’m not an expert on any of this. I’m learning as I go, and there may be mistakes or misunderstandings in the post. Still, I hope the post is useful. At some point, I’ll stop adding this caveat to my posts, but I’m still a long way from being expert enough to do that! Also as per usual, the post will contain some simple exercises for the reader, some slightly harder problems, and also some problems for the author, which are things I’d like to understand better.

General approach

In this section, I’ll work through some simple hypothetical examples, using them to build up a set of heuristics about how to rank webpages. These heuristics will ultimately suggest a complete algorithm for learning how to rank webpages from a set of training data.

One small point about nomenclature: I’m going to switch from talking about “webpages” (or “pages”), and start referring instead to “documents”. In part this is because the document terminology is more standard. But it’s also because the techniques apply more broadly than the web.

To keep things simple and concrete in our hypothetical examples, we’ll assume that for any given query and document we have just two different measures of relevance, let’s say the cosine similarity and the PageRank (or some other similar measure, if we’re working with documents not from the web). We’ll call these features. And so for any given query and document pair (q,d) we have a feature vector:

   \vec \psi(q,d) = \left[ \begin{array}{c} \mbox{PageRank}(q,d) \\        \mbox{cosine similarity}(q,d) \end{array} \right].

Actually, the PageRank of a document doesn’t depend on the query q, but in general we’ll allow the features in the feature vector to depend on the query. Also, in general there will be more than two features, and so the feature vector might have quite a few components. But it turns out that the generalization beyond the two-dimensional case is straightforward, so we’ll stick with two dimensions for now.

Our broad goal can now be restated in the language of feature vectors. What we want is to find an algorithm which combines the different components of the feature vector \vec \psi(q,d) in order to rank the document d for any particular query q. Our task is to find an algorithm which does a pretty good job doing this ranking.

To get some insight into how we should solve this problem, let’s suppose we have an extremely simple set of training data. We’ll suppose a human operator has ranked three documents d_1, d_2, d_3 for just a single query q. Of course, real training data will need to involve many more documents and queries, but we can learn a lot by starting with this. We have three feature vectors for these three cases, \vec \psi(q,d_1), \vec \psi(q,d_2), \vec \psi(q,d_3). I’ve illustrated these in the following diagram – to avoid clutter, rather than write \vec \psi(q,d_j) repeatedly in the diagram, I’ve labelled each vector simply by its document number, d_j, as well as (in parentheses) by its order of relevance as ranked by a human operator:

There are a few reasonable observations:

  • If a feature vector \vec \psi(q,d) is up and to the right of \vec \psi(q,d') then it’s better along both axes (PageRank and cosine similarity). It seems reasonable to conclude that d is a better result than d'.

  • Coversely, if \vec \psi(q,d) is down and to the left of \vec \psi(q,d') then d should always be ranked below d'.
  • The hard cases are when \vec \psi(q,d) is up and to the left of \vec \psi(q,d'), or down and to the right, in which case it’s not entirely clear

Note, by the way, that I don’t want to claim that that these observations are “proveable” in any way: they’re just reasonable observations, at least for the particular features (cosine similarity and PageRank) that we’re using. The idea here is simply to figure out some reasonable heuristics which we will eventually combine to suggest an algorithm for learning how to rank webpages.

With the observations above as motivation we’ll adopt the heuristic that it is the vector of feature differences \vec \phi(q,d,d') := \vec \psi(q,d)-\vec \psi(q,d') that determines whether d should be ranked above d', or vice versa. In other words, we’re going to require that the relative ranking of d and d' is a function of the components of \vec \phi(q,d,d'), and not of either individual feature vector alone. So the problem now becomes: how should we use \vec \phi(q,d,d') to rank d and d'? A clue is provided by looking at the vectors of feature differences for our training data:

I haven’t labelled the vectors explicitly with q, d and d', but it’s pretty easy to figure out which is which, if you look carefully. Instead, I’ve labelled each feature difference vector \vec \phi(q,d,d') with a filled in oval when d is ranked better than d', and with a star when d' is ranked better than d. Examining the vectors carefully, we see that the vectors with an oval all “point the same way” as one another, while the vectors with a star also all point the same way as one another, but in the opposite direction. We can make this intuition more precise by saying that the two sets of vectors are separated into two half-spaces by a line:

So one way we could determine whether a feature difference vector is labelled by an oval or a cross is simply by determining which half-space it is in, i.e., by determing which side of the line it’s on. Or to reformulate it in our original terms: we can tell whether a webpage d ranks higher than a webpage d' simply by determining which half-space the feature difference vector \vec \phi(q,d,d') is in.

In higher-dimensional feature spaces this idea generalizes to separating the feature difference vectors into two half-spaces which are on opposite sides of a separating hyperplane. A convenient way of specifying this separating hyperplane, in any number of dimensions, is to introduce a normal vector \vec w to the hyperplane. I’ve illustrated such a normal vector below for two dimensions, but you should imagine that we’re working in higher dimensions:

The condition for a feature difference vector to be in (say) the upper half-space is that \vec \phi(q,d,d') \cdot \vec w > 0. To be in the lower half-space the condition is that \vec \phi(q,d,d') \cdot \vec w < 0. Summing everything up, we can check whether the document d should be ranked above or below d' simply by computing the sign of \vec \phi(q,d,d') \cdot \vec w.

The above observations suggest an algorithm for using the feature difference vectors to determine search relevance. It's a pretty obvious generalization of what I've just described, but at the risk of repeating myself I'll write it all out explicitly.

To start, we assume that a set of training data has been provided by human operators. Those operators have been given a set of training queries q_1,q_2,\ldots,q_m and training documents d_1,d_2,\ldots,d_n. For each training query they've ranked each training document in order of relevance to that query. They might decide, for example, that for the query q_1, the document d_{17} should be the top-ranked query, d_5 the second-ranked query, and so on. This training data provides us with a whole lot of feature difference vectors \vec \phi(q_i,d_j,d_k). These vectors can be divided up into two sets. The first set, which we'll call training data set A, contains those feature difference vectors for which d_j has been ranked more highly than d_k. The second set, which we'll call training data set B, contains those feature difference vectors for which d_k has been ranked more highly than d_j. We then find a separating hyperplane that separates these two data sets, i.e., with the upper half-space containing data set A (where d_j is ranked more highly than d_k for q_i), and a lower half-space containing data set B.

Suppose now that q is any query – not necessarily a training query. And suppose d and d' are any two documents, not just training documents. We rank d above d' for query q if the feature difference vector \vec \phi(q,d,d') lies in the upper half-space. And we rank d below d' if it lies in the lower half-space. This completes the description of the algorithm.

There are many problems with this basic algorithm. One problem becomes evident by going back to our primitive training data and adding an extra document:

Inspecting the feature difference vectors reveals a problem:

I’ve simplified the picture by showing only the feature difference vectors \phi(q,d,d') in data set A, i.e., those for which d ranks more highly than d'; data set B contains the negation of those vectors. It should be clear that there is no half-space which can be used to divide up the vectors into two sets. It’s not geometrically possible. The way we’ll deal with this is by modifying the algorithm in such a way as to find a half-space which approximately divides the two sets of feature difference vectors into half-spaces. I’ll explain how to do this approximate division later in the post.

Before getting to the approximate division, though, we’ll warm up by figuring out much more explicitly how to do the division into two half-spaces when it is possible. It’s all very well for me to glibly say that we should “figure out which half-space the vector \phi(q,d,d') is in”, but how can we actually do this in practice? In the next section I’ll introduce support vector machines, which are a way of doing this kind of division explicitly. Once we’ve understood the basics of support vector machines, we’ll come back to the question of how to divide two sets of vectors into approximate half-spaces.

Problems

  • Suppose the basic algorithm that I’ve described works, i.e., a division into half-spaces is possible. Suppose that for a particular query q the document d is ranked above d' and d' is ranked above d''. Prove that the algorithm will rank d above d''.

Support vector machines

Support vector machines are a technique for partitioning two sets of vectors (data set A and data set B) into half-spaces separated by a separating hyperplane. The technique is guaranteed to work whenever such a partitioning is possible, and whatsmore the partitioning is optimal, in a sense I’ll make precise. In this section I’ll describe briefly how support vector machines work.

As an aside, you’ll note that the notion of search (and related topics) wasn’t mentioned anywhere in the last paragraph. That’s because support vector machines aren’t about search. Instead, they’re a general technique for dividing sets of vectors into half-spaces, a technique which can be applied to many different problems in machine learning and artificial intelligence, not just search. So support vector machines are a useful technique to understand, even if you’re not especially interested in search. End of aside.

A priori if someone just gives you two sets of vectors, A and B, it’s not always going to be clear whether a partitioning into two half-spaces is possible. But let’s assume that such a partitioning is possible, and we’ll return later to what happens when it’s not. What we’d like to do is to find an explicit way of specifying a separating hyperplane:

Note that in the last section our separating hyperplanes passed through the origin. But for support vector machines we’ll also allow hyperplanes which don’t pass through the origin. One way to explicitly specify such a hyperplane is as the set of vectors \vec x satisfying the equation:

   \vec w \cdot \vec x + b = 0,

where \vec w is a vector normal to the separating hyperplane, and b is a constant. Together, \vec w and b specify the hyperplane. The goal of the support vector machine is to take data sets A and B, and use them to construct values for \vec w and b which specify a separating hyperplane. Whatsmore, we’ll try to do this in such a way that we maximize the size of the “wedge” between the two data sets,

where we require that the two edges of the wedge (called support or supporting hyperplanes) be parallel to the separating hyperplane itself. In other words, the goal is to choose the separating hyperplane (i.e., to choose \vec w and b) in such a way that these two supporting hyperplanes are as far apart from one another as possible.

Let me mention, by the way, that I’m not mad keen on the notations we’ve been using, such as \vec w and b. Unfortunately, it’s not going to get any better – we have some \xi‘s and C‘s in our near future, and I’m sorry to say that the origin of those notations won’t be much more transparent than the origins of \vec w and b. I’m using what seems to be standard notation in the support vector machine literature, but it’s not very good notation, in my opinion: it has little or no mnemonic value, and is very non-uniform to boot. While I’m complaining, I’ll mention one more thing: the vector \vec w is sometimes known as the weight vector, again, for reasons which seem obscure. End of rant.

Let’s get back to the problem of choosing \vec w and b so that the two supporting hyperplanes are as far apart from one another as possible. To do that, we need to figure out a way of expressing the distance between the two supporting hyperplanes in terms of \vec w and b. Without any loss of generality, we’ll suppose that the separating hyperplane is halfway between the two supporting hyperplanes:

Let’s fix our attention on one of the two supporting hyperplanes, say, the one that is “higher up” in the picture above. We’ll suppose, also without any loss of generality, that this is the hyperplane supporting data set A. By rescaling \vec w and b if necessary, we can ensure that this support hyperplane has the equation

   \vec w \cdot \vec x + b = 1,

and so the constraint that this hyperplane support data set A may be expressed as \vec w \cdot \vec x + b \geq 1. This is an important constraint: it’s the constraint on data set A.

Exercises

  • If you’re not comfortable with hyperplanes you may not immediately see why the kind of rescaling of \vec w and b which I did in the last paragraph is always possible. If this is the case, then write out an explicit proof.

We can now determine the distance between the supporting and separating hyperplane by introducing a vector \Delta which is: (a) normal to both hyperplanes, and (b) connects the two hyperplanes. A bit more precisely, \vec \Delta := \vec x'-\vec x, where \vec x' is in the supporting hyperplane and \vec x is the projection of \vec x' onto the separating hyperplane:

Our goal is to find the length of \vec \Delta. Taking the difference of the equations \vec w \cdot \vec x' + b = 1 and \vec w \cdot \vec x + b = 0, we have \vec w \cdot \vec \Delta = 1. Since \vec \Delta and \vec w are both normal to the separating hyperplane, and thus parallel, it follows that \Delta w = 1, where I am omitting the \vec \cdot in order to denote length. It follows that the distance from the separating to the supporting hyperplane is \Delta = 1/w.

We chose the separating hyperplane to be halfway between the two supporting hyperplanes, and so the total size of the wedge is 2\Delta = 2/w. As a result, maximizing the size of the wedge is equivalent to maximizing 2/w. This is subject to the constraint on data set A, namely that \vec w \cdot \vec x_j + b \geq 1 for all the vectors \vec x_j in data set A. There’s also a similar constraint for data set B,

   \vec w \cdot \vec x_j + b \leq -1

for all the vectors \vec x_j in data set B. The supporting hyperplane for data set B satisfies this condition with equality.

Exercises

  • Prove that \vec w \cdot x + b = -1 is the equation for the supporting hyperplane for data set B.

There is a reasonably nice way of combining the two sets of constraint inequalities for data sets A and B. That’s to label each vector \vec x_j with a value y_j = +1 if \vec x_j is in data set A, and with y_j = -1 if \vec x_j is in data set B. With these labels, we can now express the constraints on \vec w and b in a single set of inequalities

   y_j (\vec w \cdot \vec x_j + b) \geq 1,

with equality holding if the vector in question is on a supporting hyperlane (a so-called support vector). This single set of inequalities represents the constraint that \vec w and b really do define separating and supporting hyperplanes for data sets A and B, and that the distance between the supporting hyperplanes is 2/w.

Rather than maximize 2/w subject to the above constraints, it’s conventional to instead minimize w^2/2 = \vec w \cdot \vec w /2, subject to the same constraints. This is obviously equivalent, and the reason we make this change will become clear shortly.

Summing up, our problem is to find \vec w and b which minimize the value of:

   \frac{w^2}{2}

subject to the constraints

   y_j (\vec w \cdot \vec x_j + b) \geq 1,

where y_j = +1 if \vec x_j is in data set A, and y_j = -1 if \vec x_j is in data set B.

This formulation of our problem represents real progress. The reason is that the problem just described is an instance of a very well-known type of problem in mathematics and computer science, called a quadratic programming problem. Although quadratic programs don’t in general have simple, closed-form solutions, they’ve been studied for decades and are well understood, and there are good algorithms and libraries for solving them. It’s because quadratic programs are so well understood that we changed from maximizing 2/w to minimizing w^2/2 (the 1/2 is conventional, since it simplifies certain other expressions in the theory of quadratic programming). Indeed, quadratic programs are so well understood that not only are there many libraries for solving quadratic programs, some of those libraries have been specially designed to solve the quadratic programs which arise from support vector machines. Here’s a rather lengthy list of libraries and packages which are designed for support vector machines, and you may wish to experiment with those libraries.

Incidentally, one potentially confusing thing about the quadratic programming problem posed above – and one of the reasons I complain about the notation – is to be clear on exactly what is being solved for. The y_j and \vec x_j are known input data, which should be regarded as fixed, and the task is to solve for \vec w and b. This is obviously rather different from the traditional use of x and y to denote what we’re solving for, and it’s good to keep this clearly in mind.

We can now explain what a support vector machine actually is. We have some problem – like the search ranking problem – which generate sets of training data, A and B, in a vector space. The support vector machine will find the parameters \vec w and b for the separating hyperplane, so that the “wedge” between these two data sets is maximized, i.e., the distance between the supporting hyperplanes is maximized. We can then use the parameters \vec w and b to classify new problem instances \vec x as being of type A or type B. In particular, we classify \vec x as being of type A if \vec w \cdot \vec x + b \geq 0, and of type B if \vec w \cdot \vec x + b < 0.

A support vector machine is an example of a linear classifier: a decision procedure which makes a classification decision (“Is page d ranked higher than page d'“) using some linear function of the feature vector describing the problem instance. It’s a binary linear classifier, since the classification is into two types. Such classifiers are useful because they have such generality. Have a bunch of photos, and want to classify which are photos of fish? Spend some time (or money) generating training data, and some more time coming up with a few features that seem like they might be helpful. Then find a support vector machine which can be used to classify more problem instances, and see how well it works! Of course, nothing guarantees that it’ll work well, but with good feature selection it seems that if often really does work well in practice.

I won’t go any deeper into the theory or practice of support vector machines in this post. Instead, we’ll return to the question of how to deal with the fact that sometimes it’s not possible to separate the training data into two half-spaces. What we’ll see is that it is possible to do an approximate separation, using the same techniques of quadratic programming. This is known as a support vector machine with a soft margin. Once we’ve understood how that works I’ll come back to what this all means in the context of search.

Support vector machines with a soft margin

In our earlier discussion of search, we saw an example where the feature difference vectors in a set of training data cannot be partitioned into two half-spaces (again, as earlier, this diagram only shows feature difference vectors \vec \phi(q,d,d') where d should be ranked higher than d', i.e., it shows only data set A):

What can we do in this situation? There is a way of modifying the support vector machine idea to find an approximate way of partitioning two sets of vectors into half-spaces. The idea is to allow some of the vectors to slightly violate the half-space constraints. We compensate for these violations by paying a penalty in the function being minimized. This will ensure that any violations are quite small.

The way this idea is implemented is as follows. We introduce some slack variables \xi_j, one for each vector \vec x_j. The size of the slack variable \xi_j will determine how much violation of the half-space constraint is allowed for the vector \vec x_j. In particular, the constraints on the vectors are now relaxed to be

   y_j (\vec w \cdot \vec x_j + b) \geq 1-\xi_j.

We also impose the constraint on the slack variables that

   \xi_j \geq 0,

simply so as to ensure that we’re not over-constraining our vectors by picking a negative value for \xi_j. The larger a given \xi_j the more the vector \vec x_j can violate the partitioning into half-spaces. For that reason we introduce a new term in the function we want to minimize, in order to penalize excursions from the half-space:

   \frac{w^2}{2} + C \sum_j \xi_j.

Here C > 0 is called the soft margin parameter. If C is large, then we pay a substantial penalty even for small excursions from the half-spaces. On the other hand, if C is small then we pay only a small penalty, and so we can expect larger excursions from the half-spaces.

The problem defined in the last paragraph is (again) a quadratic program, and standard algorithms and libraries can be used to solve the program. Also as in the last section, given a solution to the program we can use that solution as a binary linear classifier. In particular, suppose \vec w and b appear in such a solution. Then given a new problem instance \vec x to be classified, we classify it as being of type A if \vec w \cdot \vec x + b \geq 0, and of type B if \vec w \cdot \vec x + b < 0.

An issue that I haven't addressed - and don't yet know how to solve - is how to choose the soft margin parameter, C. As a starting point for understanding this choice, let me at least state the following:

Theorem: A solution to the quadratic program above always exists.

This is a major improvement over the situation without the soft margin, where it’s not even guaranteed that a solution will exist. It means that no matter what the training data or soft margin parameter, we can always build a binary linear classifier using the support vector machine with that soft margin.

I won’t give the proof of the theorem, but the intuition is simple enough: for any choice of \vec w and b, by choosing \xi_j sufficiently large we can ensure that \vec x_j satisfies the constraints. A fairly standard continuity and compactness argument should do the trick (modulo some reasonable assumptions, e.g., that we’re working in finite dimensions and with finite data sets).

While the theorem above is an encouraging start, it’s not what we really want. What we’d like is to understand how best to choose C. Ideally, this means understanding how different choices of C impact the solutions which are found, and being more precise about what sorts of excursions from the half-spaces we are willing to allow, and what impact this has on the quality of classifications. This is, obviously, a challenging theoretical program! However, for the very practical problem of combining different relevancy factors in search, I think a great deal of progress could be made simply by trying out different values of C, and testing to see which value empirically gives the best results. Trial-and-error is often better than being smart.

Problems

  • Above, I talked about the soft margin parameter C being “large” or “small”, but didn’t explain exactly what I meant. How could we make these notions more precise?

  • Suppose the data sets have the property that data set A is the negation of data set B, i.e., for each vector \vec x in A, there is a corresponding vector -\vec x in B, and vice versa. (This is the case in the search problem, for example.) Prove that if a value for b appears in a solution to the quadratic program, then there is also a solution to the quadratic program with value -b. In fact, it’s possible to prove (though we won’t do so) that the solution to the quadratic program is unique, and so we must have b=-b, and therefore b=0. It follows that for the search problem we can assume b=0 in the solution to the quadratic program.

Problems for the author

  • The modifications to the support vector machine model made in this section are quite ad hoc. In particular, choosing the penalty to be C \sum_j \xi_j seems to me to be quite unmotivated. Is there some natural geometric way of deriving this model? Even better, is there a principled way of deriving the model?

Search revisited

Let’s return to the problem of search. As we saw in the last section, we can use training data to build a support vector machine (with soft margin) that, given a query, q, will tell us whether a document d is more relevant than d', or vice versa. Of course, this ranking decision won’t in any sense be “provably” optimal – it’s not even clear what that would mean, exactly. But the support vector machine has the advantage that it takes into account all the different relevancy information we have (all the different features), and it does so in a way that reflects the training data.

Problems

  • Suppose that the relevancy ranking algorithm just described ranks the document d above d', and d' is ranked above d'', for some query q. Prove that it will also rank d above d''.

How can we use this support vector machine in practice? One way is as follows. Suppose we have n documents. A user enters a query q. Then because we can easily compare any two documents in the collection, we can use a sort algorithm such as quicksort to produce a rank ordering of the n documents, using O(n \log n) comparisons.

This approach isn’t really the best, though. Suppose, for comparison, that you had n documents and wanted to figure out the top 10 documents (say), as ranked by PageRank. You could do a single pass over all n documents, keeping a running tally of the best 10 documents found to date. That’d take O(n) running time, which is significantly faster.

Of course, we can apply the same idea with our support vector machine. Simply do a single pass over all n documents, maintaining a running tally of the best 10 documents found to date. More precisely, each time you examine a new document, simply compare it to the current list of the top 10 documents, using the support vector machine, and if the new document is better than any of those, update the list.

Quite aside from being significantly faster than ranking all documents, a major advantage of the running tally method is that it is easily run on a large cluster. Each machine simply computes the top 10 documents stored on that machine, and then those short lists are merged at a central location.

This post has introduced a simple technique which can be used to combine different notions of search relevancy. It’s very much an introduction, and there are a huge number of problems I have not addressed. Perhaps the biggest one is this: how well does this procedure work in practice? That is, does it give a satisfying search experience, one which is significantly improved by the addition of new relevancy factors? I’d love to test this out, but I don’t have good enough trial data (yet) to do a really good test. Maybe someone can point to some real data along these lines.

At this point in the original drafting of this post, I began enumerating problems one might want to think about in applying these ideas to improve a real search engine. The list quickly became very long: there’s a lot to think about! So rather than list all those problems, I’ll conclude with just a couple of problems which you may ponder. Enjoy!

Problems for the author

  • Suppose we’re running a real-life search engine, and were thinking of introducing an additional feature. How could we use A/B testing to determine whether or not that additional feature does a little or a lot to help improve relevancy ranking?

  • At the beginning of this post I proposed figuring out (by thinking hard!) a principled way of combining cosine similarity and PageRank to come up with a composite measure of relevance. Find such a principled way of making the combination. I expect that the value in attacking this problem will lie not so much in whatever explicit combination I find, but rather in better understanding how to make such combinations. It may also shed some light on when we expect the machine learning approach to work well, and when we’d expect it to work poorly.

Interested in more? Please follow me on Twitter. You may also enjoy reading my new book about open science, Reinventing Discovery.

Jan 23 12

If correlation doesn’t imply causation, then what does?

by Michael Nielsen

It is a commonplace of scientific discussion that correlation does not imply causation. Business Week recently ran an spoof article pointing out some amusing examples of the dangers of inferring causation from correlation. For example, the article points out that Facebook’s growth has been strongly correlated with the yield on Greek government bonds: (credit)

Despite this strong correlation, it would not be wise to conclude that the success of Facebook has somehow caused the current (2009-2012) Greek debt crisis, nor that the Greek debt crisis has caused the adoption of Facebook!

Of course, while it’s all very well to piously state that correlation doesn’t imply causation, it does leave us with a conundrum: under what conditions, exactly, can we use experimental data to deduce a causal relationship between two or more variables?

The standard scientific answer to this question is that (with some caveats) we can infer causality from a well designed randomized controlled experiment. Unfortunately, while this answer is satisfying in principle and sometimes useful in practice, it’s often impractical or impossible to do a randomized controlled experiment. And so we’re left with the question of whether there are other procedures we can use to infer causality from experimental data. And, given that we can find more general procedures for inferring causal relationships, what does causality mean, anyway, for how we reason about a system?

It might seem that the answers to such fundamental questions would have been settled long ago. In fact, they turn out to be surprisingly subtle questions. Over the past few decades, a group of scientists have developed a theory of causal inference intended to address these and other related questions. This theory can be thought of as an algebra or language for reasoning about cause and effect. Many elements of the theory have been laid out in a famous book by one of the main contributors to the theory, Judea Pearl. Although the theory of causal inference is not yet fully formed, and is still undergoing development, what has already been accomplished is interesting and worth understanding.

In this post I will describe one small but important part of the theory of causal inference, a causal calculus developed by Pearl. This causal calculus is a set of three simple but powerful algebraic rules which can be used to make inferences about causal relationships. In particular, I’ll explain how the causal calculus can sometimes (but not always!) be used to infer causation from a set of data, even when a randomized controlled experiment is not possible. Also in the post, I’ll describe some of the limits of the causal calculus, and some of my own speculations and questions.

The post is a little technically detailed at points. However, the first three sections of the post are non-technical, and I hope will be of broad interest. Throughout the post I’ve included occasional “Problems for the author”, where I describe problems I’d like to solve, or things I’d like to understand better. Feel free to ignore these if you find them distracting, but I hope they’ll give you some sense of what I find interesting about the subject. Incidentally, I’m sure many of these problems have already been solved by others; I’m not claiming that these are all open research problems, although perhaps some are. They’re simply things I’d like to understand better. Also in the post I’ve included some exercises for the reader, and some slightly harder problems for the reader. You may find it informative to work through these exercises and problems.

Before diving in, one final caveat: I am not an expert on causal inference, nor on statistics. The reason I wrote this post was to help me internalize the ideas of the causal calculus. Occasionally, one finds a presentation of a technical subject which is beautifully clear and illuminating, a presentation where the author has seen right through the subject, and is able to convey that crystalized understanding to others. That’s a great aspirational goal, but I don’t yet have that understanding of causal inference, and these notes don’t meet that standard. Nonetheless, I hope others will find my notes useful, and that experts will speak up to correct any errors or misapprehensions on my part.

Simpson’s paradox

Let me start by explaining two example problems to illustrate some of the difficulties we run into when making inferences about causality. The first is known as Simpson’s paradox. To explain Simpson’s paradox I’ll use a concrete example based on the passage of the Civil Rights Act in the United States in 1964.

In the US House of Representatives, 61 percent of Democrats voted for the Civil Rights Act, while a much higher percentage, 80 percent, of Republicans voted for the Act. You might think that we could conclude from this that being Republican, rather than Democrat, was an important factor in causing someone to vote for the Civil Rights Act. However, the picture changes if we include an additional factor in the analysis, namely, whether a legislator came from a Northern or Southern state. If we include that extra factor, the situation completely reverses, in both the North and the South. Here’s how it breaks down:

North: Democrat (94 percent), Republican (85 percent)

South: Democrat (7 percent), Republican (0 percent)

Yes, you read that right: in both the North and the South, a larger fraction of Democrats than Republicans voted for the Act, despite the fact that overall a larger fraction of Republicans than Democrats voted for the Act.

You might wonder how this can possibly be true. I’ll quickly state the raw voting numbers, so you can check that the arithmetic works out, and then I’ll explain why it’s true. You can skip the numbers if you trust my arithmetic.

North: Democrat (145/154, 94 percent), Republican (138/162, 85 percent)

South: Democrat (7/94, 7 percent), Republican (0/10, 0 percent)

Overall: Democrat (152/248, 61 percent), Republican (138/172, 80 percent)

One way of understanding what’s going on is to note that a far greater proportion of Democrat (as opposed to Republican) legislators were from the South. In fact, at the time the House had 94 Democrats, and only 10 Republicans. Because of this enormous difference, the very low fraction (7 percent) of southern Democrats voting for the Act dragged down the Democrats’ overall percentage much more than did the even lower fraction (0 percent) of southern Republicans who voted for the Act.

(The numbers above are for the House of Congress. The numbers were different in the Senate, but the same overall phenomenon occurred. I’ve taken the numbers from Wikipedia’s article about Simpson’s paradox, and there are more details there.)

If we take a naive causal point of view, this result looks like a paradox. As I said above, the overall voting pattern seems to suggest that being Republican, rather than Democrat, was an important causal factor in voting for the Civil Rights Act. Yet if we look at the individual statistics in both the North and the South, then we’d come to the exact opposite conclusion. To state the same result more abstractly, Simpson’s paradox is the fact that the correlation between two variables can actually be reversed when additional factors are considered. So two variables which appear correlated can become anticorrelated when another factor is taken into account.

You might wonder if results like those we saw in voting on the Civil Rights Act are simply an unusual fluke. But, in fact, this is not that uncommon. Wikipedia’s page on Simpson’s paradox lists many important and similar real-world examples ranging from understanding whether there is gender-bias in university admissions to which treatment works best for kidney stones. In each case, understanding the causal relationships turns out to be much more complex than one might at first think.

I’ll now go through a second example of Simpson’s paradox, the kidney stone treatment example just mentioned, because it helps drive home just how bad our intuitions about statistics and causality are.

Imagine you suffer from kidney stones, and your Doctor offers you two choices: treatment A or treatment B. Your Doctor tells you that the two treatments have been tested in a trial, and treatment A was effective for a higher percentage of patients than treatment B. If you’re like most people, at this point you’d say “Well, okay, I’ll go with treatment A”.

Here’s the gotcha. Keep in mind that this really happened. Suppose you divide patients in the trial up into those with large kidney stones, and those with small kidney stones. Then even though treatment A was effective for a higher overall percentage of patients than treatment B, treatment B was effective for a higher percentage of patients in both groups, i.e., for both large and small kidney stones. So your Doctor could just as honestly have said “Well, you have large [or small] kidney stones, and treatment B worked for a higher percentage of patients with large [or small] kidney stones than treatment A”. If your Doctor had made either one of these statements, then if you’re like most people you’d have decided to go with treatment B, i.e., the exact opposite treatment.

The kidney stone example relies, of course, on the same kind of arithmetic as in the Civil Rights Act voting, and it’s worth stopping to figure out for yourself how the claims I made above could possibly be true. If you’re having trouble, you can click through to the Wikipedia page, which has all the details of the numbers.

Now, I’ll confess that before learning about Simpson’s paradox, I would have unhesitatingly done just as I suggested a naive person would. Indeed, even though I’ve now spent quite a bit of time pondering Simpson’s paradox, I’m not entirely sure I wouldn’t still sometimes make the same kind of mistake. I find it more than a little mind-bending that my heuristics about how to behave on the basis of statistical evidence are obviously not just a little wrong, but utterly, horribly wrong.

Perhaps I’m alone in having terrible intuition about how to interpret statistics. But frankly I wouldn’t be surprised if most people share my confusion. I often wonder how many people with real decision-making power – politicians, judges, and so on – are making decisions based on statistical studies, and yet they don’t understand even basic things like Simpson’s paradox. Or, to put it another way, they have not the first clue about statistics. Partial evidence may be worse than no evidence if it leads to an illusion of knowledge, and so to overconfidence and certainty where none is justified. It’s better to know that you don’t know.

Correlation, causation, smoking, and lung cancer

As a second example of the difficulties in establishing causality, consider the relationship between cigarette smoking and lung cancer. In 1964 the United States’ Surgeon General issued a report claiming that cigarette smoking causes lung cancer. Unfortunately, according to Pearl the evidence in the report was based primarily on correlations between cigarette smoking and lung cancer. As a result the report came under attack not just by tobacco companies, but also by some of the world’s most prominent statisticians, including the great Ronald Fisher. They claimed that there could be a hidden factor – maybe some kind of genetic factor – which caused both lung cancer and people to want to smoke (i.e., nicotine craving). If that was true, then while smoking and lung cancer would be correlated, the decision to smoke or not smoke would have no impact on whether you got lung cancer.

Now, you might scoff at this notion. But derision isn’t a principled argument. And, as the example of Simpson’s paradox showed, determining causality on the basis of correlations is tricky, at best, and can potentially lead to contradictory conclusions. It’d be much better to have a principled way of using data to conclude that the relationship between smoking and lung cancer is not just a correlation, but rather that there truly is a causal relationship.

One way of demonstrating this kind of causal connection is to do a randomized, controlled experiment. We suppose there is some experimenter who has the power to intervene with a person, literally forcing them to either smoke (or not) according to the whim of the experimenter. The experimenter takes a large group of people, and randomly divides them into two halves. One half are forced to smoke, while the other half are forced not to smoke. By doing this the experimenter can break the relationship between smoking and any hidden factor causing both smoking and lung cancer. By comparing the cancer rates in the group who were forced to smoke to those who were forced not to smoke, it would then be possible determine whether or not there is truly a causal connection between smoking and lung cancer.

This kind of randomized, controlled experiment is highly desirable when it can be done, but experimenters often don’t have this power. In the case of smoking, this kind of experiment would probably be illegal today, and, I suspect, even decades into the past. And even when it’s legal, in many cases it would be impractical, as in the case of the Civil Rights Act, and for many other important political, legal, medical, and econonomic questions.

Causal models

To help address problems like the two example problems just discussed, Pearl introduced a causal calculus. In the remainder of this post, I will explain the rules of the causal calculus, and use them to analyse the smoking-cancer connection. We’ll see that even without doing a randomized controlled experiment it’s possible (with the aid of some reasonable assumptions) to infer what the outcome of a randomized controlled experiment would have been, using only relatively easily accessible experimental data, data that doesn’t require experimental intervention to force people to smoke or not, but which can be obtained from purely observational studies.

To state the rules of the causal calculus, we’ll need several background ideas. I’ll explain those ideas over the next three sections of this post. The ideas are causal models (covered in this section), causal conditional probabilities, and d-separation, respectively. It’s a lot to swallow, but the ideas are powerful, and worth taking the time to understand. With these notions under our belts, we’ll able to understand the rules of the causal calculus

To understand causal models, consider the following graph of possible causal relationships between smoking, lung cancer, and some unknown hidden factor (say, a hidden genetic factor):

This is a quite general model of causal relationships, in the sense that it includes both the suggestion of the US Surgeon General (smoking causes cancer) and also the suggestion of the tobacco companies (a hidden factor causes both smoking and cancer). Indeed, it also allows a third possibility: that perhaps both smoking and some hidden factor contribute to lung cancer. This combined relationship could potentially be quite complex: it could be, for example, that smoking alone actually reduces the chance of lung cancer, but the hidden factor increases the chance of lung cancer so much that someone who smokes would, on average, see an increased probability of lung cancer. This sounds unlikely, but later we’ll see some toy model data which has exactly this property.

Of course, the model depicted in the graph above is not the most general possible model of causal relationships in this system; it’s easy to imagine much more complex causal models. But at the very least this is an interesting causal model, since it encompasses both the US Surgeon General and the tobacco company suggestions. I’ll return later to the possibility of more general causal models, but for now we’ll simply keep this model in mind as a concrete example of a causal model.

Mathematically speaking, what do the arrows of causality in the diagram above mean? We’ll develop an answer to that question over the next few paragraphs. It helps to start by moving away from the specific smoking-cancer model to allow a causal model to be based on a more general graph indicating possible causal relationships between a number of variables:

Each vertex in this causal model has an associated random variable, X_1,X_2,\ldots. For example, in the causal model above X_2 could be a two-outcome random variable indicating the presence or absence of some gene that exerts an influence on whether someone smokes or gets lung cancer, X_3 indicates “smokes” or “does not smoke”, and X_4 indicates “gets lung cancer” or “doesn’t get lung cancer”. The other variables X_1 and X_5 would refer to other potential dependencies in this (somewhat more complex) model of the smoking-cancer connection.

A notational convention that we’ll use often is to interchangeably use X_j to refer to a random variable in the causal model, and also as a way of labelling the corresponding vertex in the graph for the causal model. It should be clear from context which is meant. We’ll also sometimes refer interchangeably to the causal model or to the associated graph.

For the notion of causality to make sense we need to constrain the class of graphs that can be used in a causal model. Obviously, it’d make no sense to have loops in the graph:

We can’t have X causing Y causing Z causing X! At least, not without a time machine. Because of this we constrain the graph to be a directed acyclic graph, meaning a (directed) graph which has no loops in it.

By the way, I must admit that I’m not a fan of the term directed acyclic graph. It sounds like a very complicated notion, at least to my ear, when what it means is very simple: a graph with no loops. I’d really prefer to call it a “loop-free graph”, or something like that. Unfortunately, the “directed acyclic graph” nomenclature is pretty standard, so we’ll go with it.

Our picture so far is that a causal model consists of a directed acyclic graph, whose vertices are labelled by random variables X_1,X_2,\ldots. To complete our definition of causal models we need to capture the allowed relationships between those random variables.

Intuitively, what causality means is that for any particular X_j the only random variables which directly influence the value of X_j are the parents of X_j, i.e., the collection X_{\mbox{pa}(j)} of random variables which are connected directly to X_j. For instance, in the graph shown below (which is the same as the complex graph we saw a little earlier), we have X_{\mbox{pa}(4)} = (X_2,X_3):

Now, of course, vertices further back in the graph – say, the parents of the parents – could, of course, influence the value of X_4. But it would be indirect, an influence mediated through the parent vertices.

Note, by the way, that I’ve overloaded the X notation, using X_{\mbox{pa}(4)} to denote a collection of random variables. I’ll use this kind of overloading quite a bit in the rest of this post. In particular, I’ll often use the notation X (or W, Y or Z) to denote a subset of random variables from the graph.

Motivated by the above discussion, one way we could define causal influence would be to require that X_j be a function of its parents:

 X_j = f_j(X_{\mbox{pa}(j)}),

where f_j(\cdot) is some function. In fact, we’ll allow a slightly more general notion of causal influence, allowing X_j to not just be a deterministic function of the parents, but a random function. We do this by requiring that X_j be expressible in the form:

 X_j = f_j(X_{\mbox{pa}(j)},Y_{j,1},Y_{j,2},\ldots),

where f_j is a function, and Y_{j,\cdot} is a collection of random variables such that: (a) the Y_{j,\cdot} are independent of one another for different values of j; and (b) for each j, Y_{j,\cdot} is independent of all variables X_k, except when X_k is X_j itself, or a descendant of X_j. The intuition is that the Y_{j,\cdot} are a collection of auxiliary random variables which inject some extra randomness into X_j (and, through X_j, its descendants), but which are otherwise independent of the variables in the causal model.

Summing up, a causal model consists of a directed acyclic graph, G, whose vertices are labelled by random variables, X_j, and each X_j is expressible in the form X_j = f_j(X_{\mbox{pa}(j)},Y_{j,\cdot}) for some function f_j. The Y_{j,\cdot} are independent of one another, and each Y_{j,\cdot} is independent of all variables X_k, except when X_k is X_j or a descendant of X_j.

In practice, we will not work directly with the functions f_j or the auxiliary random variables Y_{j,\cdot}. Instead, we’ll work with the following equation, which specifies the causal model’s joint probability distribution as a product of conditional probabilities:

   p(x_1,x_2,\ldots) = \prod_j p(x_j | \mbox{pa}(x_j)).

I won’t prove this equation, but the expression should be plausible, and is pretty easy to prove; I’ve asked you to prove it as an optional exercise below.

Exercises

  • Prove the above equation for the joint probability distribution.

Problems

  • (Simpson’s paradox in causal models) Consider the causal model of smoking introduced above. Suppose that the hidden factor is a gene which is either switched on or off. If on, it tends to make people both smoke and get lung cancer. Find explicit values for conditional probabilities in the causal model such that p(\mbox{cancer} | \mbox{smokes}) > p(\mbox{cancer} | \mbox{doesn't     smoke}), and yet if the additional genetic factor is taken into account this relationship is reversed. That is, we have both p(\mbox{cancer} | \mbox{smokes, gene on}) < p(\mbox{cancer} |   \mbox{doesn't smoke, gene on}) and p(\mbox{cancer} | \mbox{smokes,     gene off}) < p(\mbox{cancer} | \mbox{doesn't smoke, gene off}).

Problems for the author

  • An alternate, equivalent approach to defining causal models is as follows: (1) all root vertices (i.e., vertices with no parents) in the graph are labelled by independent random variables. (2) augment the graph by introducing new vertices corresponding to the Y_{j,k}. These new vertices have single outgoing edges, pointing to X_j. (3) Require that non-root vertices in the augmented graph be deterministic functions of their parents. The disadvantage of this definition is that it introduces the overhead of dealing with the augmented graph. But the definition also has the advantage of cleanly separating the stochastic and deterministic components, and I wouldn’t be surprised if developing the theory of causal inference from this point of view was stimulating, at the very least, and may possibly have some advantages compared to the standard approach. So the problem I set myself (and anyone else who is interested!) is to carry the consequences of this change through the rest of the theory of causal inference, looking for advantages and disadvantages.

I’ve been using terms like “causal influence” somewhat indiscriminately in the discussion above, and so I’d like to pause to discuss a bit more carefully about what is meant here, and what nomenclature we should use going forward. All the arrows in a causal model indicate are the possibility of a direct causal influence. This results in two caveats on how we think about causality in these models. First, it may be that a child random variable is actually completely independent of the value of one (or more) of its parent random variables. This is, admittedly, a rather special case, but is perfectly consistent with the definition. For example, in a causal model like

it is possible that the outcome of cancer might be independent of the hidden causal factor or, for that matter, that it might be independent of whether someone smokes or not. (Indeed, logically, at least, it may be independent of both, although of course that’s not what we’ll find in the real world.) The second caveat in how we think about the arrows and causality is that the arrows only capture the direct causal influences in the model. It is possible that in a causal model like

X_1 will have a causal influence on X_5 through its influence on X_2 and X_3. This would be an indirect causal influence, mediated by other random variables, but it would still be a causal influence. In the next section I’ll give a more formal definition of causal influence that can be used to make these ideas precise.

Causal conditional probabilities

In this section I’ll explain what I think is the most imaginative leap underlying the causal calculus. It’s the introduction of the concept of causal conditional probabilities.

The notion of ordinary conditional probabilities is no doubt familiar to you. It’s pretty straightforward to do experiments to estimate conditional probabilities such as p(\mbox{cancer}| \mbox{smoking}), simply by looking at the population of people who smoke, and figuring out what fraction of those people develop cancer. Unfortunately, for the purpose of understanding the causal relationship between smoking and cancer, p(\mbox{cancer}| \mbox{smoking}) isn’t the quantity we want. As the tobacco companies pointed out, there might well be a hidden genetic factor that makes it very likely that you’ll see cancer in anyone who smokes, but that wouldn’t therefore mean that smoking causes cancer.

As we discussed earlier, what you’d really like to do in this circumstance is a randomized controlled experiment in which it’s possible for the experimenter to force someone to smoke (or not smoke), breaking the causal connection between the hidden factor and smoking. In such an experiment you really could see if there was a causal influence by looking at what fraction of people who smoked got cancer. In particular, if that fraction was higher than in the overall population then you’d be justified in concluding that smoking helped cause cancer. In practice, it’s probably not practical to do this kind of randomized controlled experiment. But Pearl had what turns out to be a very clever idea: to imagine a hypothetical world in which it really is possible to force someone to (for example) smoke, or not smoke. In particular, he introduced a conditional causal probability p(\mbox{cancer}| \mbox{do(smoking)}), which is the conditional probability of cancer in this hypothetical world. This should be read as the (causal conditional) probability of cancer given that we “do” smoking, i.e., someone has been forced to smoke in a (hypothetical) randomized experiment.

Now, at first sight this appears a rather useless thing to do. But what makes it a clever imaginative leap is that although it may be impossible or impractical to do a controlled experiment to determine p(\mbox{cancer}|\mbox{do}(\mbox{smoking})), Pearl was able to establish a set of rules – a causal calculus – that such causal conditional probabilities should obey. And, by making use of this causal calculus, it turns out to sometimes be possible to infer the value of probabilities such as p(\mbox{cancer}|\mbox{do}(\mbox{smoking})), even when a controlled, randomized experiment is impossible. And that’s a very remarkable thing to be able to do, and why I say it was so clever to have introduced the notion of causal conditional probabilities.

We’ll discuss the rules of the causal calculus later in this post. For now, though, let’s develop the notion of causal conditional probabilities. Suppose we have a causal model of some phenomenon:

Now suppose we introduce an external experimenter who is able to intervene to deliberately set the value of a particular variable X_j to x_j. In other words, the experimenter can override the other causal influences on that variable. This is equivalent to having a new causal model:

In this new causal model, we’ve represented the experimenter by a new vertex, which has as a child the vertex X_j. All other parents of X_j are cut off, i.e., the edges from the parents to X_j are deleted from the graph. In this case that means the edge from X_2 to X_3 has been deleted. This represents the fact that the experimenter’s intervention overrides the other causal influences. (Note that the edges to the children of X_j are left undisturbed.) In fact, it’s even simpler (and equivalent) to consider a causal model where the parents have been cut off from X_j, and no extra vertex added:

This model has no vertex explicitly representing the experimenter, but rather the relation X_j = f_j(X_{{\rm pa}(j},Y_{j,\cdot}) is replaced by the relation X_j = x_j. We will denote this graph by G_{\overline X_j}, indicating the graph in which all edges pointing to X_j have been deleted. We will call this a perturbed graph, and the corresponding causal model a perturbed causal model. In the perturbed causal model the only change is to delete the edges to X_j, and to replace the relation X_j = f_j(X_{{\rm     pa}(j},Y_{j,\cdot}) by the relation X_j = x_j.

Our aim is to use this perturbed causal model to compute the conditional causal probability p(x_1,\ldots,\hat x_j, \ldots, x_n | \mbox{do}(x_j)). In this expression, \hat x_j indicates that the x_j term is omitted before the |, since the value of x_j is set on the right. By definition, the causal conditional probability p(x_1,\ldots,\hat x_j, \ldots, x_n | \mbox{do}(x_j)) is just the value of the probability distribution in the perturbed causal model, p'(x_1,\ldots,x_n). To compute the value of the probability in the perturbed causal model, note that the probability distribution in the original causal model was given by

   p(x_1,\ldots,x_n) = \prod_k p(x_k| \mbox{pa}(x_k)),

where the product on the right is over all vertices in the causal model. This expression remains true for the perturbed causal model, but a single term on the right-hand side changes: the conditional probability for the x_j term. In particular, this term gets changed from p(x_j| \mbox{pa}(x_j)) to 1, since we have fixed the value of X_j to be x_j. As a result we have:

 p(x_1,\ldots,\hat x_j,\ldots,x_n | \mbox{do}(x_j))  = \frac{p(x_1,\ldots,x_n)}{p(x_j|\mbox{pa}(x_j))}.

This equation is a fundamental expression, capturing what it means for an experimenter to intervene to set the value of some particular variable in a causal model. It can easily be generalized to a situation where we partition the variables into two sets, X and Y, where X are the variables we suppose have been set by intervention in a (possibly hypothetical) randomized controlled experiment, and Y are the remaining variables:

 [1] \,\,\,\, p(Y=y| \mbox{do}(X=x)) = \frac{p(X=x,Y=y)}{\Pi_j p(X_j = x_j|   \mbox{pa}(X_j))}.

Note that on the right-hand side the values for \mbox{pa}(X_j) are assumed to be given by the appropriate values from x and y. The expression [1] can be viewed as a definition of causal conditional probabilities. But although this expression is fundamental to understanding the causal calculus, it is not always useful in practice. The problem is that the values of some of the variables on the right-hand side may not be known, and cannot be determined by experiment. Consider, for example, the case of smoking and cancer. Recall our causal model:

What we’d like is to compute p(\mbox{cancer}| \mbox{do(smoking)}). Unfortunately, we immediately run into a problem if we try to use the expression on the right of equation [1]: we’ve got no way of estimating the conditional probabilities for smoking given the hidden common factor. So we can’t obviously compute p(\mbox{cancer}| \mbox{do(smoking)}). And, as you can perhaps imagine, this is the kind of problem that will come up a lot whenever we’re worried about the possible influence of some hidden factor.

All is not lost, however. Just because we can’t compute the expression on the right of [1] directly doesn’t mean we can’t compute causal conditional probabilities in other ways, and we’ll see below how the causal calculus can help solve this kind of problem. It’s not a complete solution – we shall see that it doesn’t always make it possible to compute causal conditional probabilities. But it does help. In particular, we’ll see that although it’s not possible to compute p(\mbox{cancer}| \mbox{do(smoking)}) for this causal model, it is possible to compute p(\mbox{cancer}| \mbox{do(smoking)}) in a very similar causal model, one that still has a hidden factor.

With causal conditional probabilities defined, we’re now in position to define more precisely what we mean by causal influence. Suppose we have a causal model, and X and Y are distinct random variables (or disjoint subsets of random variables). Then we say X has a causal influence over Y if there are values x_1 and x_2 of X and y of y such that p(y|\mbox{do}(x_1)) \neq p(y|\mbox{do}(x_2)). In other words, an external experimenter who can intervene to change the value of X can cause a corresponding change in the distribution of values at Y. The following exercise gives an information-theoretic justification for this definition of causal influence: it shows that an experimenter who can intervene to set X can transmit information to Y if and only if the above condition for causal inference is met.

Exercises

  • (The causal capacity) This exercise is for people with some background in information theory. Suppose we define the causal capacity between X and Y to be \max_{p(\hat     x)} H(\hat X: \hat Y), where H(\cdot:\cdot) is the mutual information, the maximization is over possible distributions p(\hat   x) for \hat X (we use the hat to indicate that the value of X is being set by intervention), and \hat Y is the corresponding random variable at Y, with distribution p(\hat y) = \sum_{\hat x}   p(\hat y|\mbox{do}(\hat x)) p(\hat x). Shannon’s noisy channel coding theorem tells us that an external experimenter who can intervene to set the value of X can transmit information to an observer at Y at a maximal rate set by the causal capacity. Show that the causal capacity is greater than zero if and only if X has a causal influence over Y.

We’ve just defined a notion of causal influence between two random variables in a causal model. What about when we say something like “Event A” causes “Event B”? What does this mean? Returning to the smoking-cancer example, it seems that we would say that cancer causes smoking provided p(\mbox{cancer} | \mbox{do}(\mbox{smoking})) > p(\mbox{cancer}), so that if someone makes the choice to smoke, uninfluenced by other causal factors, then they would increase their chance of cancer. Intuitively, it seems to me that this notion of events causing one another should be related to the notion of causal influence just defined above. But I don’t yet see quite how to do that. The first problem below suggests a conjecture in this direction:

Problems for the author

  • Suppose X and Y are random variables in a causal model such that p(Y=y | \mbox{do}(X=x)) > p(Y=y) for some pair of values x and y. Does this imply that X exerts a causal influence on Y?
  • (Sum-over-paths for causal conditional probabilities?) I believe a kind of sum-over-paths formulation of causal conditional probabilities is possible, but haven’t worked out details. The idea is as follows (the details may be quite wrong, but I believe something along these lines should work). Supose X and Y are single vertices (with corresponding random variables) in a causal model. Then I would like to show first that if X is not an ancestor of Y then p(y|\mbox{do}(x)) = p(y), i.e., intervention does nothing. Second, if X is an ancestor of Y then p(y|\mbox{do}(x)) may be obtained by summing over all directed paths from X to Y in G_{\overline X}, and computing for each path a contribution to the sum which is a product of conditional probabilities along the path. (Note that we may need to consider the same path multiple times in the sum, since the random variables along the path may take different values).
  • We used causal models in our definition of causal conditional probabilities. But our informal definiton – imagine a hypothetical world in which it’s possible to force a variable to take a particular value – didn’t obviously require the use of a causal model. Indeed, in a real-world randomized controlled experiment it may be that there is no underlying causal model. This leads me to wonder if there is some other way of formalizing the informal definition we’ve given?
  • Another way of framing the last problem is that I’m concerned about the empirical basis for causal models. How should we go about constructing such models? Are they fundamental, representing true facts about the world, or are they modelling conveniences? (This is by no means a dichotomy.) It would be useful to work through many more examples, considering carefully the origin of the functions f_j(\cdot) and of the auxiliary random variables Y_{j,\cdot}.

d-separation

In this section we’ll develop a criterion that Pearl calls directional separation (d-separation, for short). What d-separation does is let us inspect the graph of a causal model and conclude that a random variable X in the model can’t tell us anything about the value of another random variable Y in the model, or vice versa.

To understand d-separation we’ll start with a simple case, and then work through increasingly complex cases, building up our intuition. I’ll conclude by giving a precise definition of d-separation, and by explaining how d-separation relates to the concept of conditional independence of random variables.

Here’s the first simple causal model:

Clearly, knowing X can in general tell us something about Y in this kind of causal model, and so in this case X and Y are not d-separated. We’ll use the term d-connected as a synonym for “not d-separated”, and so in this causal model X and Y are d-connected.

By contrast, in the following causal model X and Y don’t give us any information about each other, and so they are d-separated:

A useful piece of terminology is to say that a vertex like the middle vertex in this model is a collider for the path from X to Y, meaning a vertex at which both edges along the path are incoming.

What about the causal model:

In this case, it is possible that knowing X will tell us something about Y, because of their common ancestry. It’s like the way knowing the genome for one sibling can give us information about the genome of another sibling, since similarities between the genomes can be inferred from the common ancestry. We’ll call a vertex like the middle vertex in this model a fork for the path from X to Y, meaning a vertex at which both edges are outgoing.

Exercises

  • Construct an explicit causal model demonstrating the assertion of the last paragraph. For example, you may construct a causal model in which X and Y are joined by a fork, and where Y is actually a function of X.

  • Suppose we have a path from X to Y in a causal model. Let c be the number of colliders along the path, and let f be the number of forks along the path. Show that |f-c| can only take the values 0, 1 or -1, i.e., the number of forks and colliders is either the same or differs by at most one.

We’ll say that a path (of any length) from X to Y that contains a collider is a blocked path. By contrast, a path that contains no colliders is called an unblocked path. (Note that by the above exercise, an unblocked path must contain either one or no forks.) In general, we define X and Y to be d-connected if there is an unblocked path between them. We define them to be d-separated if there is no such unblocked path.

It’s worth noting that the concepts of d-separation and d-connectedness depend only on the graph topology and on which vertices X and Y have been chosen. In particular, they don’t depend on the nature of the random variables X and Y, merely on the identity of the corresponding vertices. As a result, you can determine d-separation or d-connectdness simply by inspecting the graph. This fact – that d-separation and d-connectdness are determined by the graph – also holds for the more sophisticated notions of d-separation and d-connectedness we develop below.

With that said, it probably won’t surprise you to learn that the concept of d-separation is closely related to whether or not the random variables X and Y are independent of one another. This is a connection you can (optionally) develop through the following exercises. I’ll state a much more general connection below.

Exercises

  • Suppose that X and Y are d-separated. Show that X and Y are independent random variables, i.e., that p(x,y) = p(x)p(y).

  • Suppose we have two vertices which are d-connected in a graph G. Explain how to construct a causal model on that graph such that the random variables X and Y corresponding to those two vertices are not independent.
  • The last two exercises almost but don’t quite claim that random variables X and Y in a causal model are independent if and only if they are d-separated. Why does this statement fail to be true? How can you modify the statement to make it true?

So far, this is pretty simple stuff. It gets more complicated, however, when we extend the notion of d-separation to cases where we are conditioning on already knowing the value of one or more random variables in the causal model. Consider, for example, the graph:

(Figure A.)

Now, if we know Z, then knowing X doesn’t give us any additional information about Y, since by our original definition of a causal model Y is already a function of Z and some auxiliary random variables which are independent of X. So it makes sense to say that Z blocks this path from X to Y, even though in the unconditioned case this path would not have been considered blocked. We’ll also say that X and Y are d-separated, given Z.

It is helpful to give a name to vertices like the middle vertex in Figure A, i.e., to vertices with one ingoing and one outgoing edge. We’ll call such vertices a traverse along the path from X to Y. Using this language, the lesson of the above discussion is that if Z is in a traverse along a path from X to Y, then the path is blocked.

By contrast, consider this model:

In this case, knowing X will in general give us additional information about Y, even if we know Z. This is because while Z blocks one path from X to Y there is another unblocked path from X to Y. And so we say that X and Y are d-connected, given Z.

Another case similar to Figure A is the model with a fork:

Again, if we know Z, then knowing X as well doesn’t give us any extra information about Y (or vice versa). So we’ll say that in this case Z is blocking the path from X to Y, even though in the unconditioned case this path would not have been considered blocked. Again, in this example X and Y are d-separated, given Z.

The lesson of this model is that if Z is located at a fork along a path from X to Y, then the path is blocked.

A subtlety arises when we consider a collider:

(Figure B.)

In the unconditioned case this would have been considered a blocked path. And, naively, it seems as though this should still be the case: at first sight (at least according to my intuition) it doesn’t seem very likely that X can give us any additional information about Y (or vice versa), even given that Z is known. Yet we should be cautious, because the argument we made for the graph in Figure A breaks down: we can’t say, as we did for Figure A, that Y is a function of Z and some auxiliary independent random variables.

In fact, we’re wise to be cautious because X and Y really can tell us something extra about one another, given a knowledge of Z. This is a phenomenon which Pearl calls Berkson’s paradox. He gives the example of a graduate school in music which will admit a student (a possibility encoded in the value of Z) if either they have high undergraduate grades (encoded in X) or some other evidence that they are exceptionally gifted at music (encoded in Y). It would not be surprising if these two attributes were anticorrelated amongst students in the program, e.g., students who were admitted on the basis of exceptional gifts would be more likely than otherwise to have low grades. And so in this case knowledge of Y (exceptional gifts) would give us knowledge of X (likely to have low grades), conditioned on knowledge of Z (they were accepted into the program).

Another way of seeing Berkson’s paradox is to construct an explicit causal model for the graph in Figure B. Consider, for example, a causal model in which X and Y are independent random bits, 0 or 1, chosen with equal probabilities 1/2. We suppose that Z = X \oplus Y, where \oplus is addition modulo 2. This causal model does, indeed, have the structure of Figure B. But given that we know the value Z, knowing the value of X tells us everything about Y, since Y = Z \oplus X.

As a result of this discussion, in the causal graph of Figure B we’ll say that Z unblocks the path from X to Y, even though in the unconditioned case the path would have been considered blocked. And we’ll also say that in this causal graph X and Y are d-connected, conditional on Z.

The immediate lesson from the graph of Figure B is that X and Y can tell us something about one another, given Z, if there is a path between X and Y where the only collider is at Z. In fact, the same phenomenon can occur even in this graph:

(Figure C.)

To see this, suppose we choose X and Y as in the example just described above, i.e., independent random bits, 0 or 1, chosen with equal probabilities 1/2. We will let the unlabelled vertex be W = X \oplus Y. And, finally, we choose Z = W. Then we see as before that X can tell us something about Y, given that we know Z, because X = Y \oplus Z.

The general intuition about graphs like that in Figure C is that knowing Z allows us to infer something about the ancestors of Z, and so we must act as though those ancestors are known, too. As a result, in this case we say that Z unblocks the path from X to Y, since Z has an ancestor which is a collider on the path from X to Y. And so in this case X is d-connected to Y, given Z.

Given the discussion of Figure C that we’ve just had, you might wonder why forks or traverses which are ancestors of Z can’t block a path, for similar reasons? For instance, why don’t we consider X and Y to be d-separated, given Z, in the following graph:

The reason, of course, is that it’s easy to construct examples where X tells us something about Y in addition to what we already know from Z. And so we can’t consider X and Y to be d-separated, given Z, in this example.

These examples motivate the following definition:

Definition: Let X, Y and Z be disjoint subsets of vertices in a causal model. Consider a path from a vertex in X to a vertex in Y. We say the path is blocked by Z if the path contains either: (a) a collider which is not an ancestor of Z, or (b) a fork which is in Z, or (c) a traverse which is in Z. We say the path is unblocked if it is not blocked. We say that X and Y are d-connected, given Z, if there is an unblocked path between some vertex in X and some vertex in Y. X and Y are d-separated, given Z, if they are not d-connected.

Saying “X and Y are d-separated given Z” is a bit of a mouthful, and so it’s helpful to have an abbreviated notation. We’ll use the abbreviation (X \perp Y|Z)_G. Note that this notation includes the graph G; we’ll sometimes omit the graph when the context is clear. We’ll write (X \perp Y)_G to denote unconditional d-separation.

As an aside, Pearl uses a similar but slightly different notation for d-separation, namely (X \perp \! \! \perp Y|Z)_G. Unfortunately, while the symbol \perp \! \! \perp looks like a LaTeX symbol, it’s not, but is most easily produced using a rather dodgy LaTeX hack. Instead of using that hack over and over again, I’ve adopted a more standard LaTeX notation.

While I’m making asides, let me make a second: when I was first learning this material, I found the “d” for “directional” in d-separation and d-connected rather confusing. It suggested to me that the key thing was having a directed path from one vertex to the other, and that the complexities of colliders, forks, and so on were a sideshow. Of course, they’re not, they’re central to the whole discussion. For this reason, when I was writing these notes I considered changing the terminology to i-separated and i-connected, for informationally-separated and informationally-connected. Ultimately I decided not to do this, but I thought mentioning the issue might be helpful, in part to reassure readers (like me) who thought the “d” seemed a little mysterious.

Okay, that’s enough asides, let’s get back to the main track of discussion.

We saw earlier that (unconditional) d-separation is closely connected to the independence of random variables. It probably won’t surprise you to learn that conditional d-separation is closely connected to conditional independence of random variables. Recall that two sets of random variables X and Y are conditionally independent, given a third set of random variables Z, if p(x,y|z) = p(x|z)p(y|z). The following theorem shows that d-separation gives a criterion for when conditional independence occurs in a causal model:

Theorem (graphical criterion for conditional independence): Let G be a graph, and let X, Y and Z be disjoint subsets of vertices in that graph. Then X and Y are d-separated, given Z, if and only if for all causal models on G the random variables corresponding to X and Y are conditionally independent, given Z.

(Update: Thanks to Rob Spekkens for pointing out an error in my original statement of this theorem.)

I won’t prove the theorem here. However, it’s not especially difficult if you’ve followed the discussion above, and is a good problem to work through:

Problems

  • Prove the above theorem.

Problems for the author

  • The concept of d-separation plays a central role in the causal calculus. My sense is that it should be possible to find a cleaner and more intuitive definition that substantially simplifies many proofs. It’d be good to spend some time trying to find such a definition.

The causal calculus

We’ve now got all the concepts we need to state the rules of the causal calculus. There are three rules. The rules look complicated at first, although they’re easy to use once you get familiar with them. For this reason I’ll start by explaining the intuition behind the first rule, and how you should think about that rule. Having understood how to think about the first rule it’s easy to get the hang of all three rules, and so after that I’ll just outright state all three rules.

In what follows, we have a causal model on a graph G, and W, X, Y, Z are disjoint subsets of the variables in the causal model. Recall also that G_{\overline X} denotes the perturbed graph in which all edges pointing to X from the parents of X have been deleted. This is the graph which results when an experimenter intervenes to set the value of X, overriding other causal influences on X.

Rule 1: When can we ignore observations: I’ll begin by stating the first rule in all its glory, but don’t worry if you don’t immediately grok the whole rule. Instead, just take a look, and try to start getting your head around it. What we’ll do then is look at some simple special cases, which are easily understood, and gradually build up to an understanding of what the full rule is saying.

Okay, so here’s the first rule of the causal calculus. What it tells us is that when (Y \perp Z|W,X)_{G_{\overline X}}, then we can ignore the observation of Z in computing the probability of Y, conditional on both W and an intervention to set X:

 p(y|w,\mbox{do}(x),z) = p(y|w,\mbox{do}(x))

To understand why this rule is true, and what it means, let’s start with a much simpler case. Let’s look at what happens to the rule when there are no X or W variables in the mix. In this case, our starting assumption simply becomes that Y is d-separated from Z in the original (unperturbed) graph G. There’s no need to worry about G_{\overline X} because there’s no X variable whose value is being set by intervention. In this circumstance we have (Y \perp Z)_G, so Y is independent of Z. But the statement of the rule in this case is merely that p(y|z) = p(y), which is, indeed, equivalent to the standard definition of Y and Z being independent.

In other words, the first rule is simply a generalization of what it means for Y and Z to be independent. The full rule generalizes the notion of independence in two ways: (1) by adding in an extra variable W whose value has been determined by passive observation; and (2) by adding in an extra variable X whose value has been set by intervention. We’ll consider these two ways of generalizing separately in the next two paragraphs.

We begin with generalization (1), i.e., there is no X variable in the mix. In this case, our starting assumption becomes that Y is d-separated from Z, given W, in the graph G. By the graphical criterion for conditional independence discussed in the last section this means that Y is conditionally independent of Z, given W, and so p(y|z,w) = p(y|w), which is exactly the statement of the rule. And so the first rule can be viewed as a generalization of what it means for Y and Z to be independent, conditional on W.

Now let’s look at the other generalization, (2), in which we’ve added an extra variable X whose value has been set by intervention, and where there is no W variable in the mix. In this case, our starting assumption becomes that Y is d-separated from Z, given X, in the perturbed graph G_{\overline X}. In this case, the graphical criterion for conditional indepenence tells us that Y is independent from Z, conditional on the value of X being set by experimental intervention, and so p(y|\mbox{do}(x),z) = p(y|\mbox{do}(x)). Again, this is exactly the statement of the rule.

The full rule, of course, merely combines both these generalizations in the obvious way. It is really just an explicit statement of the content of the graphical criterion for conditional independence, in a context where W has been observed, and the value of X set by experimental intervention.

The rules of the causal calculus: All three rules of the causal calculus follow a similar template to the first rule: they provide ways of using facts about the causal structure (notably, d-separation) to make inferences about conditional causal probabilities. I’ll now state all three rules. The intuition behind rules 2 and 3 won’t necessarily be entirely obvious, but after our discussion of rule 1 the remaining rules should at least appear plausible and comprehensible. I’ll have bit more to say about intuition below.

As above, we have a causal model on a graph G, and W, X, Y, Z are disjoint subsets of the variables in the causal model. G_{\overline   X} denotes the perturbed graph in which all edges pointing to X from the parents of X have been deleted. G_{\underline X} denotes the graph in which all edges pointing out from X to the children of X have been deleted. We will also freely use notations like G_{\overline W, \overline X, \underline Z} to denote combinations of these operations.

Rule 1: When can we ignore observations: Suppose (Y \perp Z|W,X)_{G_{\overline X}}. Then:

   p(y|w,\mbox{do}(x),z) = p(y|w,\mbox{do}(x)).

Rule 2: When can we ignore the act of intervention: Suppose (Y \perp Z|W,X)_{G_{\overline X,\underline Z}}. Then:

 p(y|w,\mbox{do}(x),\mbox{do}(z)) = p(y|w,\mbox{do}(x),z).

Rule 3: When can we ignore an intervention variable entirely: Let Z(W) denote the set of nodes in Z which are not ancestors of W. Suppose (Y \perp Z|W,X)_{G_{\overline X, \overline{Z(W)}}}. Then:

 p(y|w,\mbox{do}(x),\mbox{do}(z)) = p(y|w,\mbox{do}(x)).

In a sense, all three rules are statements of conditional independence. The first rule tells us when we can ignore an observation. The second rule tells us when we can ignore the act of intervention (although that doesn’t necessarily mean we can ignore the value of the variable being intervened with). And the third rule tells us when we can ignore an intervention entirely, both the act of intervention, and the value of the variable being intervened with.

I won’t prove rule 2 or rule 3 – this post is already quite long enough. (If I ever significantly revise the post I may include the proofs). The important thing to take away from these rules is that they give us conditions on the structure of causal models so that we know when we can ignore observations, acts of intervention, or even entire variables that have been intervened with. This is obviously a powerful set of tools to be working with in manipulating conditional causal probabilities!

Indeed, according to Pearl there’s even a sense in which this set of rules is complete, meaning that using these rules you can identify all causal effects in a causal model. I haven’t yet understood the proof of this result, or even exactly what it means, but thought I’d mention it. The proof is in papers by Shpitser and Pearl and Huang and Valtorta. If you’d like to see the proofs of the rules of the calculus, you can either have a go at proving them yourself, or you can read the proof.

Problems for the author

  • Suppose the conditions of rules 1 and 2 hold. Can we deduce that the conditions of rule 3 also hold?

Using the causal calculus to analyse the smoking-lung cancer connection

We’ll now use the causal calculus to analyse the connection between smoking and lung cancer. Earlier, I introduced a simple causal model of this connection:

The great benefit of this model was that it included as special cases both the hypothesis that smoking causes cancer and the hypothesis that some hidden causal factor was responsible for both smoking and cancer.

It turns out, unfortunately, that the causal calculus doesn’t help us analyse this model. I’ll explain why that’s the case below. However, rather than worrying about this, at this stage it’s more instructive to work through an example showing how the causal calculus can be helpful in analysing a similar but slightly modified causal model. So although this modification looks a little mysterious at first, for now I hope you’ll be willing to accept it as given.

The way I’m going to modify the causal model is by introducing an extra variable, namely, whether someone has appreciable amounts of tar in their lungs or not:

(By tar, I don’t mean “tar” literally, but rather all the material deposits found as a result of smoking.)

This causal model is a plausible modification of the original causal model. It is at least plausible to suppose that smoking causes tar in the lungs and that those deposits in turn cause cancer. But if the hidden causal factor is genetic, as the tobacco companies argued was the case, then it seems highly unlikely that the genetic factor caused tar in the lungs, except by the indirect route of causing those people to smoke. (I’ll come back to what happens if you refuse to accept this line of reasoning. For now, just go with it.)

Our goal in this modified causal model is to compute probabilities like p(\mbox{cancer}|\mbox{do}(smoking)) = p(y| \mbox{do}(x)). What we’ll show is that the causal calculus lets us compute this probability entirely in terms of probabilities like p(y|z), p(z|y) and other probabilities that don’t involve an intervention, i.e., that don’t involve \mbox{do}.

This means that we can determine p(\mbox{cancer}|\mbox{do}(smoking)) without needing to know anything about the hidden factor. We won’t even need to know the nature of the hidden factor. It also means that we can determine p(\mbox{cancer}|\mbox{do}(smoking)) without needing to intervene to force someone to smoke or not smoke, i.e., to set the value for X.

In other words, the causal calculus lets us do something that seems almost miraculous: we can figure out the probability that someone would get cancer given that they are in the smoking group in a randomized controlled experiment, without needing to do the randomized controlled experiment. And this is true even though there may be a hidden causal factor underlying both smoking and cancer.

Okay, so how do we compute p(\mbox{cancer}|\mbox{do}(smoking)) = p(y| \mbox{do}(x))?

The obvious first question to ask is whether we can apply rule 2 or rule 3 directly to the conditional causal probability p(y|\mbox{do}(x)).

If rule 2 applies, for example, it would say that intervention doesn’t matter, and so p(y|\mbox{do}(x)) = p(y|x). Intuitively, this seems unlikely. We’d expect that intervention really can change the probability of cancer given smoking, because intervention would override the hidden causal factor.

If rule 3 applies, it would say that p(y|\mbox{do}(x)) = p(y), i.e., that an intervention to force someone to smoke has no impact on whether they get cancer. This seems even more unlikely than rule 2 applying.

However, as practice and a warm up, let’s work through the details of seeing whether rule 2 or rule 3 can be applied directly to p(y|\mbox{do}(x)).

For rule 2 to apply we need (Y\perp X)_{G_{\underline X}}. To check whether this is true, recall that G_{\underline X} is the graph with the edges pointing out from X deleted:

Obviously, Y is not d-separated from X in this graph, since X and Y have a common ancestor. This reflects the fact that the hidden causal factor indeed does influence both X and Y. So we can’t apply rule 2.

What about rule 3? For this to apply we’d need (Y \perp X)_{G_{\overline X}}. Recall that G_{\overline X} is the graph with the edges pointing toward X deleted:

Again, Y is not d-separated from X, in this case because we have an unblocked path directly from X to Y. This reflects our intuition that the value of X can influence Y, even when the value of X has been set by intervention. So we can’t apply rule 3.

Okay, so we can’t apply the rules of the causal calculus directly to determine p(y|\mbox{x}). Is there some indirect way we can determine this probability? An experienced probabilist would at this point instinctively wonder whether it would help to condition on the value of z, writing:

 [2] \,\,\,\, p(y| \mbox{do}(x)) = \sum_z p(y|z,\mbox{do}(x)) p(z|\mbox{do}(x)).

Of course, saying an experienced probabilist would instinctively do this isn’t quite the same as explaining why one should do this! However, it is at least a moderately obvious thing to do: the only extra information we potentially have in the problem is z, and so it’s certainly somewhat natural to try to introduce that variable into the problem. As we shall see, this turns out to be a wise thing to do.

Exercises

  • I used without proof the equation p(y| \mbox{do}(x)) = \sum_z   p(y|z,\mbox{do}(x)) p(z|\mbox{do}(x)). This should be intuitively plausible, but really requires proof. Prove that the equation is correct.

To simplify the right-hand side of equation [2], we first note that we can apply rule 2 to the second term on the right-hand side, obtaining p(z|\mbox{do}(x)) = p(z|x). To check this explicitly, note that the condition for rule 2 to apply is that (Z \perp X)_{G_{\underline     X}}. We already saw the graph G_{\underline X} above, and, indeed, Z is d-separated from X in that graph, since the only path from Z to X is blocked at Y. As a result, we have:

 [3] \,\,\,\, p(y| \mbox{do}(x)) = \sum_z p(y|z,\mbox{do}(x)) p(z|x).

At this point in the presentation, I’m going to speed the discussion up, telling you what rule of the calculus to apply at each step, but not going through the process of explicitly checking that the conditions of the rule hold. (If you’re doing a close read, you may wish to check the conditions, however.)

The next thing we do is to apply rule 2 to the first term on the right-hand side of equation [3], obtaining p(y|z,\mbox{do}(x)) = p(y|\mbox{do}(z),\mbox{do}(x)). We then apply rule 3 to remove the \mbox{do}(x), obtaining p(y|z,\mbox{do}(x)) = p(y|\mbox{do}(z)). Substituting back in gives us:

 [4] \,\,\,\, p(y| \mbox{do}(x)) = \sum_z p(y|\mbox{do}(z)) p(z|x).

So this means that we’ve reduced the computation of p(y|\mbox{do}(x)) to the computation of p(y|\mbox{do}(z)). This doesn’t seem terribly encouraging: we’ve merely substituted the computation of one causal conditional probability for another. Still, let us continue plugging away, and see if we can make progress. The obvious first thing to try is to apply rule 2 or rule 3 to simplify p(y|\mbox{do}(z). Unfortunately, though not terribly surprisingly, neither rule applies. So what do we do? Well, in a repeat of our strategy above, we again condition on the other variable we have available to us, in this case x:

 p(y|\mbox{do}(z)) = \sum_x p(y|x,\mbox{do}(z)) p(x|\mbox{do}(z)).

Now we’re cooking! Rule 2 lets us simplify the first term to p(y|x,z), while rule 3 lets us simplify the second term to p(x), and so we have p(y|\mbox{do}(z)) = \sum_x p(y|x,z) p(x). To substitute this expression back into equation [4] it helps to change the summation index from x to x', since otherwise we would have a duplicate summation index. This gives us:

  [5] \,\,\,\, p(y| \mbox{do}(x)) = \sum_{x'z} p(y|x',z) p(z|x) p(x') .

This is the promised expression for p(y|\mbox{do}(x)) (i.e., for probabilities like p(\mbox{cancer}| \mbox{do(smoking)}), assuming the causal model above) in terms of quantities which may be observed directly from experimental data, and which don’t require intervention to do a randomized, controlled experiment. Once p(\mbox{cancer}| \mbox{do(smoking)}) is determined, we can compare it against p(\mbox{cancer}). If p(\mbox{cancer}| \mbox{do(smoking)}) is larger than p(\mbox{cancer}) then we can conclude that smoking does, indeed, play a causal role in cancer.

Something that bugs me about the derivation of equation [5] is that I don’t really know how to “see through” the calculations. Yes, it all works out in the end, and it’s easy enough to follow along. Yet that’s not the same as having a deep understanding. Too many basic questions remain unanswered: Why did we have to condition as we did in the calculation? Was there some other way we could have proceeded? What would have happeed if we’d conditioned on the value of the hidden variable? (This is not obviously the wrong thing to do: maybe the hidden variable would ultimately drop out of the calculation). Why is it possible to compute causal probabilities in this model, but not (as we shall see) in the model without tar? Ideally, a deeper understanding would make the answers to some or all of these questions much more obvious.

Problems for the author

  • Why is it so much easier to compute p(y|\mbox{do}(z)) than p(y|\mbox{do}(x)) in the model above? Is there some way we could have seen that this would be the case, without needing to go through a detailed computation?

  • Suppose we have a causal model G, with S a subset of vertices for which all conditional probabilities are known. Is it possible to give a simple characterization of for which subsets X and Y of vertices it is possible to compute p(y|\mbox{do}(x)) using just the conditional probabilities from S?

Unfortunately, I don’t know what the experimentally observed probabilities are in the smoking-tar-cancer case. If anyone does, I’d be interested to know. In lieu of actual data, I’ll use some toy model data suggested by Pearl; the data is quite unrealistic, but nonetheless interesting as an illustration of the use of equation [5]. The toy model data is as follows:

(1) 47.5 percent of the population are nonsmokers with no tar in their lungs, and 10 percent of these get cancer.

(2) 2.5 percent are smokers with no tar, and 90 percent get cancer.

(3) 2.5 percent are nonsmokers with tar, and 5 percent get cancer.

(4) 47.5 percent are smokers with tar, and 85 percent get cancer.

In this case, we get:

   p(\mbox{cancer} | \mbox{do}(smoking)) = 45.25 \

By contrast, p(\mbox{cancer}) = 47.5 percent, and so if this data was correct (obviously it’s not even close) it would show that smoking actually somewhat reduces a person’s chance of getting lung cancer. This is despite the fact that p(\mbox{cancer} | \mbox{smoking}) = 85.25 percent, and so a naive approach to causality based on correlations alone would suggest that smoking causes cancer. In fact, in this imagined world smoking might actually be useable as a preventative treatment for cancer! Obviously this isn’t truly the case, but it does illustrate the power of this method of analysis.

Summing up the general lesson of the smoking-cancer example, suppose we have two competing hypotheses for the causal origin of some effect in a system, A causes C or B causes C, say. Then we should try to construct a realistic causal model which includes both hypotheses, and then use the causal calculus to attempt to distinguish the relative influence of the two causal factors, on the basis of experimentally accessible data.

Incidentally, the kind of analysis of smoking we did above obviously wasn’t done back in the 1960s. I don’t actually know how causality was established over the protestations that correlation doesn’t impy causation. But it’s not difficult to think of ways you might have come up with truly convincing evidence that smoking was a causal factor. One way would have been to look at the incidence of lung cancer in populations where smoking had only recently been introduced. Suppose, for example, that cigarettes had just been introduced into the (fictional) country of Nicotinia, and that this had been quickly followed by a rapid increase in rates of lung cancer. If this pattern was seen across many new markets then it would be very difficult to argue that lung cancer was being caused solely by some pre-existing factor in the population.

Exercises

  • Construct toy model data where smoking increases a person’s chance of getting lung cancer.

Let’s leave this model of smoking and lung cancer, and come back to our original model of smoking and lung cancer:

What would have happened if we’d tried to use the causal calculus to analyse this model? I won’t go through all the details, but you can easily check that whatever rule you try to apply you quickly run into a dead end. And so the causal calculus doesn’t seem to be any help in analysing this problem.

This example illustrates some of the limitations of the causal calculus. In order to compute p(\mbox{cancer}| \mbox{do}(smoking)) we needed to assume a causal model with a particular structure:

While this model is plausible, it is not beyond reproach. You could, for example, criticise it by saying that it is not the presence of tar deposits in the lungs that causes cancer, but maybe some other factor, perhaps something that is currently unknown. This might lead us to consider a causal model with a revised structure:

So we could try instead to use the causal calculus to analyse this new model. I haven’t gone through this exercise, but I strongly suspect that doing so we wouldn’t be able to use the rules of the causal calculus to compute the relevant probabilities. The intuition behind this suspicion is that we can imagine a world in which the tar may be a spurious side-effect of smoking that is in fact entirely unrelated to lung cancer. What causes lung cancer is really an entirely different mechanism, but we couldn’t distinguish the two from the statistics alone.

The point of this isn’t to say that the causal calculus is useless. It’s remarkable that we can plausibly get information about the outcome of a randomized controlled experiment without actually doing anything like that experiment. But there are limitations. To get that information we needed to make some presumptions about the causal structure in the system. Those presumptions are plausible, but not logically inevitable. If someone questions the presumptions then it may be necessary to revise the model, perhaps adopting a more sophisticated causal model. One can then use the causal calculus to attempt to analyse that more sophisticated model, but we are not guaranteed success. It would be interesting to understand systematically when this will be possible and when it will not be. The following problems start to get at some of the issues involved.

Problems for the author

  • Is it possible to make a more precise statement than “the causal calculus doesn’t seem to be any help” for the original smoking-cancer model?

  • Given a probability distribution over some random variables, it would be useful to have a classification theorem describing all the causal models in which those random variables could appear.
  • Extending the last problem, it’d be good to have an algorithm to answer questions like: in the space of all possible causal models consistent with a given set of observed probabilities, what can we say about the possible causal probabilities? It would also be useful to be able to input to the algorithm some constraints on the causal models, representing knowledge we’re already sure of.
  • In real-world experiments there are many practical issues that must be addressed to design a realiable randomized, controlled experiment. These issues include selection bias, blinding, and many others. There is an entire field of experimental design devoted to addressing such issues. By comparison, my description of causal inference ignores many of these practical issues. Can we integrate the best thinking on experimental design with ideas such as causal conditional probabilities and the causal calculus?
  • From a pedagogical point of view, I wonder if it might have been better to work fully through the smoking-cancer example before getting to the abstract statement of the rules of the causal calculus. Those rules can all be explained and motivated quite nicely in the context of the smoking-cancer example, and that may help in understanding.

Conclusion

I’ve described just a tiny fraction of the work on causality that is now going on. My impression as an admittedly non-expert outsider to the field is that this is an exceptionally fertile field which is developing rapidly and giving rise to many fascinating applications. Over the next few decades I expect the theory of causality will mature, and be integrated into the foundations of disciplines ranging from economics to medicine to social policy.

Causal discovery: One question I’d like to understand better is how to discover causal structures inside existing data sets. After all, human beings do a pretty good (though far from perfect) job at figuring out causal models from their observation of the world. I’d like to better understand how to use computers to automatically discover such causal models. I understand that there is already quite a literature on the automated discovery of causal models, but I haven’t yet looked in much depth at that literature. I may come back to it in a future post.

I’m particularly fascinated by the idea of extracting causal models from very large unstructured data sets. The KnowItAll group at the University of Washington (see Oren Etzioni on Google Plus) have done fascinating work on a related but (probably) easier problem, the problem of open information extraction. This means taking an unstructured information source (like the web), and using it to extract facts about the real world. For instance, using the web one would like computers to be able to learn facts like “Barack Obama is President of the United States”, without needing a human to feed it that information. One of the things that makes this task challenging is all the misleading and difficult-to-understand information out on the web. For instance, there are also webpages saying “George Bush is President of the United States”, which was probably true at the time the pages were written, but which is now misleading. We can find webpages which state things like “[Let's imagine] Steve Jobs is President of the United States“; it’s a difficult task for an unsupervised algorithm to figure out how to interpret that “Let’s imagine”. What the KnowItAll team have done is made progress on figuring out how to learn facts in such a rich but uncontrolled environment.

What I’m wondering is whether such techniques can be adapted to extract causal models from data? It’d be fascinating if so, because of course humans don’t just reason with facts, they also reason with (informal) causal models that relate those facts. Perhaps causal models or a similar concept may be a good way of representing some crucial part of our knowledge of the world.

Problems for the author

  • What systematic causal fallacies do human beings suffer from? We certainly often make mistakes in the causal models we extract from our observations of the world – one example is that we often do assume that correlation implies causation, even when that’s not true – and it’d be nice to understand what systematic biases we have.

  • Humans aren’t just good with facts and causal models. We’re also really good at juggling multiple causal models, testing them against one another, finding problems and inconsistencies, and making adjustments and integrating the results of those models, even when the results conflict. In essence, we have a (working, imperfect) theory of how to deal with causal models. Can we teach machines to do this kind of integration of causal models?
  • We know that in our world the sun rising causes the rooster to crow, but it’s possible to imagine a world in which it is the rooster crowing that causes the sun to rise. This could be achieved in a suitably designed virtual world, for example. The reason we believe the first model is correct in our world is not intrinsic to the data we have on roosters and sunrise, but rather depends on a much more complex network of background knowledge. For instance, given what we know about roosters and the sun we can easily come up with plausible causal mechanisms (solar photons impinging on the rooster’s eye, say) by which the sun could cause the rooster to crow. There do not seem to be any similarly plausible causal models in the other direction. How do we determine what makes a particular causal model plausible or not? How do we determine the class of plausible causal models for a given phenomenon? Can we make this kind of judgement automatically? (This is all closely related to the last problem).

Continuous-time causality: A peculiarity in my post is that even though we’re talking about causality, and time is presumably important, I’ve avoided any explicit mention of time. Of course, it’s implicitly there: if I’d been a little more precise in specifying my models they’d no doubt be conditioned on events like “smoked at least a pack a day for 10 or more years”. Of course, this way of putting time into the picture is rather coarse-grained. In a lot of practical situations we’re interested in understanding causality in a much more temporally fine-grained way. To explain what I mean, consider a simple model of the relationship between what we eat and our insulin levels:

This model represents the fact that what we eat determines our insulin levels, and our insulin levels in turn play a part in determining how hungry we feel, and thus what we eat. But as a model, it’s quite inadequate. In fact, there’s a much more complex feedback relationship going on, a constant back-and-forth between what we eat at any given time, and our insulin levels. Ideally, this wouldn’t be represented by a few discrete events, but rather by a causal model that reflects the continual feedback between these possibilities. What I’d like to see developed is a theory of continuous-time causal models, which can address this sort of issue. It would also be useful to extend the calculus to continuous spaces of events. So far as I know, at present the causal calculus doesn’t work with these kinds of ideas.

Problems for the author

  • Can we formulate theories like electromagnetism, general relativity and quantum mechanics within the framework of the causal calculus (or some generalization)? Do we learn anything by doing so?

Other notions of causality: A point I’ve glossed over in the post is how the notion of causal influence we’ve been studying relates to other notions of causality.

The notion we’ve been exploring is based on the notion of causality that is established by a (hopefully well-designed!) randomized controlled experiment. To understand what that means, think of what it would mean if we used such an experiment to establish that smoking does, indeed, cause cancer. All this means is that in the population being studied, forcing someone to smoke will increase their chance of getting cancer.

Now, for the practical matter of setting public health policy, that’s obviously a pretty important notion of causality. But nothing says that we won’t tomorrow discover some population of people where no such causal influence is found. Or perhaps we’ll find a population where smoking actively helps prevent cancer. Both these are entirely possible.

What’s going on is that while our notion of causality is useful for some purposes, it doesn’t necessarily say anything about the details of an underlying causal mechanism, and it doesn’t tell us how the results will apply to other populations. In other words, while it’s a useful and important notion of causality, it’s not the only way of thinking about causality. Something I’d like to do is to understand better what other notions of causality are useful, and how the intervention-based approach we’ve been exploring relates to those other approaches.

Acknowledgments

Thanks to Jen Dodd, Rob Dodd, and Rob Spekkens for many discussions about causality. Especial thanks to Rob Spekkens for pointing me toward the epilogue of Pearl’s book, which is what got me hooked on causality!

Principal sources and further reading

A readable and stimulating overview of causal inference is the epilogue to Judea Pearl’s book. The epilogue, in turn, is based on a survey lecture by Pearl on causal inference. I highly recommend getting a hold of the book and reading the epilogue; if you cannot do that, I suggest looking over the survey lecture. A draft copy of the first edition of the entire book is available on Pearl’s website. Unfortunately, the draft does not include the full text of the epilogue, only the survey lecture. The lecture is still good, though, so you should look at it if you don’t have access to the full text of the epilogue. I’ve also been told good things about the book on causality by Spirtes, Glymour and Scheines, but haven’t yet had a chance to have a close look at it. An unfortunate aspect of the current post is that it gives the impression that the theory of causal inference is entirely Judea Pearl’s creation. Of course that’s far from the case, a fact which is quite evident from both Pearl’s book, and the Spirtes-Glymour-Scheines book. However, the particular facets I’ve chosen to focus on are due principally to Pearl and his collaborators: most of the current post is based on chapter 3 and chapter 1 of Pearl’s book, as well as a 1994 paper by Pearl, which established many of the key ideas of the causal calculus. Finally, for an enjoyable and informative discussion of some of the challenges involved in understanding causal inference I recommend Jonah Lehrer’s recent article in Wired.

Interested in more? Please follow me on Twitter. You may also enjoy reading my new book about open science, Reinventing Discovery.