Web search engines must be the most useful tools the Web brought us. We can answer difficult questions in seconds, find obscure pieces of information and stop bothering about organizing data. You would expect that systems with such impact on our lives will be measured, evaluated and compared, so that we can make an informed decision on which one to choose. Nope, nothing there.
Some years ago, search engines competed in size. Danny Sullivan wrote angry pieces on that, and eventually they stopped, but still six months ago Cuil launched and made a fool of itself by boasting size again (BTW – Cuil is still alive, but my blog is not indexed, not much to boast about coverage there).
Now, academic research on search (Information Retrieval, or IR in academic jargon) does have a very long and comprehensive tradition of relevance evaluation methodologies, TREC being the best example. IR systems are evaluated, analyzed, and compared across standard benchmarks, and TREC researchers carry out excellent research into the reliability and soundness of these benchmarks. So why isn’t this applied to evaluating web search engines?
One of the major problems is, yes, size. Much of the challenges TREC organizers are facing, is scaling the evaluation methods and measurements to web size scale. One serious obstacle was the evaluation measure itself. Most IR research uses Mean Average Precision (MAP), which proved to be a very reliable and useful measure, but it requires knowing stuff you just can’t know on the web, such as the total number of relevant documents for the evaluated query. Moreover, with no use case reasoning, there was no indication that it indeed measures true search user satisfaction.
Luckily, the latest volume of TOIS journal (Transactions on Information Systems) included a paper that could change that picture. Justin Zobel and Alistair Moffat, two Australian key figures in IR and IR evaluation, with Zobel a veteran of TREC methodology analysis, suggest a new measure called “Rank-Biased Precision” (RBP). In their words, the model goes as follows:
The user has no desire to examine every answer. Instead, our suggestion is that they progress from one document in the ranked list to the next with persistence (or probability) p, and, conversely, end their examination of the ranking at that point with probability 1− p… That is,we assume that the user always looks at the first document, looks at the second with probability p, at the third with probability p2, and at the ith with probability pi−1. Figure 3 shows this model as a state machine, where the labels on the edges represent the probability of changing state.
They then go to show that the RBP measure, derived from this user model, does not depend on any unknowns, behaves well with real life uncertainties (e.g. unjudged documents, queries with no relevant documents at all), and is comparable to previous measures in showing statistically significant differences between systems.
Eventually, beyond presenting an interesting web search user model, RBP also eliminates one more obstacle to true comparison of search engine relevance. The sad reality, though, is that with Yahoo’s and Live’s current poor state of results relevance, such a comparison may not show us anything new, but an objective, visible measurement could at least provide incentive to measurable improvements on their account. Of course, then we’ll get to the other major issue, of what constitutes a relevant result…
Update: I gave a talk on RBP in my research group, slides are here.