Tag Archives: TREC

Evaluating algorithms’ quality

As part of a “creativity dojo” we’ve had at work, I finally got to implement something I’ve long felt was needed in our QA – a framework for evaluating algorithms’ quality.

Living on the seam between algorithm development and product management in the past few years, I’ve come to appreciate the need to be able to evaluate not just that it works, but that it works well. A search engine may return results that contain the keywords, but are these the most relevant ones? a recommendation algorithm may return products that are related to the user in some way, but can they be considered “good” recommendations?

During my master’s studies I came to know the work done over at TREC, and was fascinated by the strong emphasis on what we developers often skim over – evaluating results’ quality statistically, and moreover analyzing the evaluation method itself, to ensure that it is sound. So with that approach in mind, I teamed with our talented QA team to create a working framework in 2 days. Here are some lessons and tips learned along the way, that could be useful for others trying to achieve a similar feat:

  1. Create a generic tool. TREC is mostly about search; however, with some imagination, most AI algorithms can be reduced to similar building blocks. Search, recommendation, classification – all could eventually be reduced to taking an input and returning a ranked list of results, on which the same quality metric can be applied. Code-wise, we used a generic scoring class, with a wrapping interface that has different implementations for different algos to provide the varying context.
  2. Use large data. This may sound trivial in the academic world, but when you’re in a QA state of mind, you sometimes tend to get used to creating small worlds that are easy to control. Not here. It’s very important to simulate real-life user scenarios by using data that’s similar to production, so we used out integration environment, which replicates from production data.
  3. Facilitate judging. Obtaining relevance judgments is crucial to getting useful tests. The customer here is a business owner / product manager, who may not appreciate the tedious task of rating results. We created a browser plugin that allows rating from within the actual results page, and accumulates those ratings in a per-test relevance file.
  4. Measure test staleness. The downside of using non-controlled data is that it moves the carpet from under your feet. Data may change over time and your test may become less relevant. We used Buckley’s Binary Preference (bPref) measure that functions well with incomplete judgments, and also introduced a weighted measure of how many unjudged results are found, to trigger a test failure when results become too unreliable (requiring another judging round).

Evaluating Search Engine Relevance

Web search engines must be the most useful tools the Web brought us. We can answer difficult questions in seconds, find obscure pieces of information and stop bothering about organizing data. You would expect that systems with such impact on our lives will be measured, evaluated and compared, so that we can make an informed decision on which one to choose. Nope, nothing there.

Some years ago, search engines competed in size. Danny Sullivan wrote angry pieces on that, and eventually they stopped, but still six months ago Cuil launched and made a fool of itself by boasting size again (BTW – Cuil is still alive, but my blog is not indexed, not much to boast about coverage there).

TRECNow, academic research on search (Information Retrieval, or IR in academic jargon) does have a very long and comprehensive tradition of relevance evaluation methodologies, TREC being the best example. IR systems are evaluated, analyzed, and compared across standard benchmarks, and TREC researchers carry out excellent research into the reliability and soundness of these benchmarks. So why isn’t this applied to evaluating web search engines?

One of the major problems is, yes, size. Much of the challenges TREC organizers are facing, is scaling the evaluation methods and measurements to web size scale. One serious obstacle was the evaluation measure itself. Most IR research uses Mean Average Precision (MAP), which proved to be a very reliable and useful measure, but it requires knowing stuff you just can’t know on the web, such as the total number of relevant documents for the evaluated query. Moreover, with no use case reasoning, there was no indication that it indeed measures true search user satisfaction.

Luckily, the latest volume of TOIS journal (Transactions on Information Systems) included a paper that could change that picture. Justin Zobel and Alistair Moffat, two Australian key figures in IR and IR evaluation, with Zobel a veteran of TREC methodology analysis, suggest a new measure called “Rank-Biased Precision” (RBP). In their words, the model goes as follows:

The user has no desire to examine every answer. Instead, our suggestion is that they progress from one document in the ranked list to the next with persistence (or probability) p, and, conversely, end their examination of the ranking at that point with probability 1− p… That is,we assume that the user always looks at the first document, looks at the second with probability p, at the third with probability p2, and at the ith with probability pi−1. Figure 3 shows this model as a state machine, where the labels on the edges represent the probability of changing state.

The user model assumed by rank-biased precision

They then go to show that the RBP measure,  derived from this user model, does not depend on any unknowns, behaves well with real life uncertainties (e.g. unjudged documents, queries with no relevant documents at all), and is comparable to previous measures in showing statistically significant differences between systems.

Eventually,  beyond presenting an interesting web search user model, RBP also eliminates one more obstacle to true comparison of search engine relevance. The sad reality, though, is that with Yahoo’s and Live’s current poor state of results relevance, such a comparison may not show us anything new, but an objective, visible measurement could at least provide incentive to measurable improvements on their account. Of course, then we’ll get to the other major issue, of what constitutes a relevant result…

Update: I gave a talk on RBP in my research group, slides are here.