A problem with the standard importance function? Trading off query terms against one another
Working notes ahead! This post is different to my last two posts. Those posts were broad reviews of topics of general interest (at least if you’re interested in data-driven intelligence) – the Pregel graph framework, and the vector space model of documents. This post is not a review or distillation of a topic in the style of those posts. Rather, it’s my personal working notes – me, thinking out loud – as I try to understand a particular question in some depth. You can think of it as a variation on open notebook science. As such, the notes go into much more detail than previous posts, as I try to wrap my head around all the details of the problem at hand. In writing this blog I expect to alternate between these two formats (working notes and broader reviews). When it’s my personal working notes, I’ll make that clear in the first paragraph or so, so people who just want carefully distilled highlights can skip the more specialized posts.
With that out of the way, let’s get on with the main problem. In the last post I described a way of measuring how relevant a document is to a given search query. That notion of relevance was based on the cosine similarity (essentially, the angle) between the document vector and the query vector. Those vectors, in turn, were defined based on a notion of the importance of a term in a document . In the post I used what I called the standard importance function (I’ll remind you how that’s defined in a moment). But although the standard importance function is widely used, that doesn’t mean it’s without problems, and in this post I describe a problem caused by the standard importance function.
The strategy in the post is to first describe a specific example that illustrates the problem with the standard importance function. The example is somewhat artificial, but, once we understand it, we’ll see that it’s part of a broader problem that is very real, and not at all artificial.
To describe the example, let’s suppose our search query contains two terms, “hobbit baggins”. We can represent this by a query vector whose components include a term for “hobbit” and a term for “baggins”. Recall from the last post that, if we’re using the standard importance function, then those components are given by the inverse document frequency for each term, . Here, logarithms are taken to base two, is the total number of documents in the search corpus, is the term (“hobbit” and “baggins” being the relevant terms in this example), and is the document frequency for term , i.e., the number of documents in the corpus in which occurs.
Now, both “hobbit” and “baggins” are likely to be pretty unusual terms in the corpus, and so the inverse document frequency will be pretty large. Suppose, for example, that approximately one in a thousand documents contains “hobbit”, and similarly for “baggins”. Then the inverse document frequency will be . So to a good approximation the query vector will be:
where I’ve written the query vector so the first component is the component for “hobbit”, the second component is the component for “baggins”, and the other components are for other terms – all those components are in this case, since those other terms don’t appear in the query. I’ve also written the query vector as a transposed row vector, because row vectors display better on the blog.
Now suppose we have a document that contains ten occurrences of “hobbit”, and ten occurrences of “baggins”. Using the standard importance function, its document vector will have the form
Here, I’ve used the fact that the component of in the “direction” of term is given by the standard importance function:
where is the term frequency of in document . I’ve also used to denote the document vector for the remainder of the document, i.e., for all the terms not involving “hobbit” or “baggins”.
Suppose, now, that we change the document, so that for some reason we replace all occurrences of “baggins” by “hobbit”, so there are now occurrences of “hobbit”, and none of “baggins”. Intuitively, it seems to me that the resulting modified document is much less relevant to the query “baggins hobbit”. The reason is that although it contains twice as many occurrences of “hobbit”, those extra occurrences don’t compensate for the complete absence of the term “baggins”. Put another way, I think a small decrease in the frequency of “baggins” should only be compensated for by a much larger increase in the frequency of “hobbit”. As we’ll see, though, the cosine similarity doesn’t reflect this intuition very well.
To compute the cosine similarity, note that the document vector for the modified document is:
Recall also that the cosine similarity for an arbitrary document is defined to be , where I use the convention that . In principle this notation could cause confusion, since we now use (or ) to refer to the query (or document), as well as the length of the corresponding query (or document) vector. In practice it should be clear from context which is meant.
We get a hint of the problem with the standard importance function when we observe that . I.e., changing all occurrences of “hobbit” to “baggins” has no effect at all on the inner product, simply moving weight in the inner product from one term to the other. And so the only difference in the cosine similarity is due to the change in length between and . One way of expressing this is to observe that:
Note furthermore that
and so we have:
The good news is that this gives us the right ordering – the document is more parallel to than is , since , and so . That's what we hoped for, and so at least the ordering of relevance will be correct.
The bad news is that, for complex documents, the difference in cosine similarity is tiny. The reason is that for a long, complex document containing many unusual terms ("Gandalf", "Galadriel", "Saruman"), the value of may be enormous - let us say, conservatively, billions. In this approximation, we see that . By itself this isn't a problem. But it means that the ordering of and is enormously sensitive to changes in . If, when we modified the first document, we changed not just "hobbit" to "baggins", but also other terms, then that might cause the value of to decrease. If it decreased by more than - easy to imagine, with just a few minor changes - then it would reverse the ordering of the two terms. And that doesn't seem like such a good thing.
The example I have given may be artificial, but the broader phenomenon is not. More broadly, if we use the standard importance function when computing cosine similarity, then it becomes much too easy to trade off query terms in a document. For two-term queries, , an increasing number of occurrences of can be traded off almost exactly by a corresponding decrease in the number of occurrences of . So, for example, doubling the number of occurrences of allows us to (approximately) completely wipe out the occurrences of , while making only a tiny difference in cosine similarity.
In broad terms it's pretty clear how to address this problem - change the scoring function so that we can't trade off query terms in this way. It's also easy to see how we might go about changing the scoring function to do that. In fact, it's a bit too easy - there are so many ways we could go about doing it that it's not at all clear which will give the best results. Here's one trivial way: redefine the importance function to be:
That is, modify the importance function so it has the square root of the term frequency in it rather than the term frequency. If we do this for the “hobbit”-”baggins” example, then increasing the frequency of “hobbit” to occurrences (from ) will require there to be no fewer than occurrences of “baggins” for the score to remain as high (assuming the rest of the document is complex, i.e., ). This matches our intuition much better. Unfortunately, it’s by no means clear that it preserves whatever other properties of cosine similarity made it a good measure of relevance in the first place. That’s a question that most likely would require empirical user testing to resolve.
To wrap up, my takeaways from this are that: (1) the standard importance function makes it too easy to trade off query terms against one another in a document; (2) it’s easy to make ad hoc modifications to the standard importance function that remove this problem; but (3) it’s not clear which of those modifications would preserve the other qualities we want. The obvious thing to do is some empirical testing to see how satisfied users are with different measures. It’d also be good to think more deeply about the foundation of the problem: how should we define the relevance of a document to a given user query?