Tag Archives: Research

Searching for Faceted Search

Just finished reading Daniel Tunkelang’s recently published book on Faceted Search. I read Daniel’s blog (“The Noisy Channel“) regularly, and enjoy his good mix of IR practice with emphasis on Human-Computer Interaction (HCI). With faceted search tasks on the roadmap at work, I wanted to better educate myself on the topic, and this one looked like a good read, with the cover promising:

“… a self-contained treatment of the topic, with an extensive bibliography for those who would like to pursue particular aspects in more depth”

With 70 pages, the book reads quickly and smoothly. Daniel provides a fascinating intro to faceted search, from early taxonomies, to facets, to faceted navigation and on to faceted search. He adds an introductory chapter on IR, which is a worthwhile read even for IR professionals with some interesting insights. One is how ranked retrieval that we all grew so accustomed of, blurred the once clear border of relevant vs. non-relevant that set retrieval enforced. Daniel suggests that this issue is significant for faceted search, being a set-retrieval oriented task, and a pingback on his blog led me to a fascinating elaboration on this pain in another fine search blog (recommended read!).

With such elaborate introductory chapters and more on faceted search history, not much is left though for the actual chapters on research and practice, and as a reader I felt there could be a lot more there. But then, it is reasonable to leave a lot to the reader and just give a taste of the challenges, to be later explored by the curious reader from the bibliography.

However, that promise for extensive bibliography somewhat disappointed me… With 119 references, and only about a quarter being academic publications from the past 5 years, I felt a bit back to square one. I was hoping for more of a literature survey and pointers when discussing the techniques for those tough issues, such as how to choose the most informational facets for a given query or how to extract facets from unstructured fields. Daniel provide some useful tips on those, but reading more on these topics will require doing my own literature scan.

In any case, for a newcomer with little background in search in general and faceted in particular, this book is an excellent introduction. Those more versed with classic IR moving into faceted search, will find the book an interesting read but probably not sufficient as a full reference.

Friendly advice from your “Social Trust Graph”

While scanning for worthy Information Retrieval papers in the recent SIGIR 2009, I came across a paper titled “Learning to Recommend with Social Trust Ensemble“, by a team from the University of Hong Kong. This one is about recommender systems, but putting the social element into text analytics tasks is always interesting (me).

The premise is an interesting one – using your network of trust to improve classic (Collaborative Filtering) recommendations. The authors begin by observing that users’ decisions are the balance between their own tastes, and those of their trusted friends’ recommendations.

Figure 1 from "Learning to Recommend with Social Trust Ensemble" by Ma et al.

Then, they proceed to propose a model that blends analysis of classic user-item matrix where ratings of items by users are stored (the common tool of CF), with analysis of a “social trust graph” that links the user to other users, and through them to their opinions on the items.

This follows the intuition that when trying to draw a recommendation from behavior of other users (which basically is what CF does), some users’ opinions may be more important than others’, and the fact that classic CF ignores that, and treats all users as having identical importance.

The authors show results that out-perform classic CF on a dataset extracted from Epinions. That’s encouraging for any researcher interested in the contribution of the social signal into AI tasks.

free advice at renegade craft fair - CC Flickr/arimoore

However, some issues bother me with this research:

  1. Didn’t the netflix prize winning team approach (see previous post) “prove” that statistical analysis of the user-item matrix beats any external signal other teams tried to use? the answer here may be related to the sparseness of the Epinions data, which makes life very difficult for classic CF. Movie recommendations have much higher density than retail (Epinions’ domain).
  2. To evaluate, the authors sampled 80% or 90% of the ratings as training and the remaining as testing. But if you choose as training the data before the user started following someone, then test it after the user is following that someone, don’t you get a bit mixed up with cause and effect? I mean, if I follow someone and discover a product through his recommendation, there’s a high chance my opinion will also be influenced by his. So there’s no true independence between the training and test data…
  3. Eventually, the paper shows that combining two good methods (social trust graph and classic CF) outperforms each of the methods alone. The general idea of fusion or ensemble of methods is pretty much common knowledge for any Machine Learning researcher. The question should be (but it wasn’t) – does this specific ensemble of methods outperform any other ensemble? and does it fare better than the state of the art result for the same dataset?
own taste and his/her trusted friends’ favors.

The last point is of specific interest to me, having combined keyword-based retrieval with concept-based retrieval in my M.Sc. work. I could easily show that the resulting system outperformed each of the separate elements, but to counter the above questions, I further tested combining other similarly high performing methods to show performance gained there was much lower, and also showed that the combination could take a state of the art result and further improve on it.

Nevertheless, the idea of using opinions from people you know and trust (rather than authorities) in ranking recommendations is surely one that will gain more popularity, as social players start pushing ways to monetize the graphs they worked so hard to build…

Semantic Search using Wikipedia

Today I gave my master’s thesis talk in the Technion as part of my master’s duties. Actually, the non-buzzword title is “Concept-Based Information Retrieval using Explicit Semantic Analysis”, but that’s not a very click-worthy post title :-)…

The whole thing went far better than I expected – the room was packed, the slides flew smoothly (and quickly too, luckily Shaul convinced me to add some spare slides just in case), and I ended up with over 10 minutes of Q&A (an excellent sign for a talk that went well…)

Click to view on Slideshare

BTW – anybody has an idea how to embed slideshare into a hosted blog? doesn’t seem to work…

In Authority We Trust (Not)

Product reviews are a great thing.

Fake reviews suck.

In the most recent example, an employee solicited paid reviews for his company’s products on Amazon’s Mechanical Turk – got to appreciate the progress.

How can you tell which reviews to trust? Trust is built out of relationship. You trust a site, a person, a brand, after your interactions accumulated enough positive history to earn that trust.  With review sites, you may learn to trust a specific site, but that still doesn’t mean you trust a specific reviewer. I usually try to look at the reviewer’s history, and to look for the “human” side of them – spelling mistakes, topic changes, findings flaws and not just praising. But naturally, the adversary here is also informed, and will try to imitate these aspects…

"Trust us, we're experts" by flickr/phaulyReview sites attempt to bestow trust of their own on their members, to assist us. Amazon uses badges, and encourages users to provide their real name, using a credit card as the identity proof. Midrag is an Israeli service provider ratings site I recently used, that attaches identity to a cellular phone, with a login token sent over SMS. But when you want to attract a large number of reviews, you want to allow unvalidated identities too. Epinions, for example, builds a “web of trust” model based on reviewers trusting or blocking other reviewers. But with Epinions (and similarly Amazon) keeping their trust calculation formula secret, how can users be convinced that this metric fits their needs?

In reality, my model of trust may be quite different from yours. Two Italian researchers published a paper in AAAI-05 titled “Controversial Users demand Local Trust Metrics“, where they experimented with Epinions’ data on the task of predicting users’  trust score, based on existing trust statements. Their findings show that for some users, trust is not an average quantity, but a very individual one, and therefore requires local methods.

Trust metrics can be classified into global and local ones (Massa & Avesani 2004; Ziegler & Lausen 2004). Local trust metrics take into account the subjective opinions of the active user when predicting the trust she places in unknown users. For this reason, the trust score of a certain user can be different when predicted from the point of view of different users. Instead, global trust metrics compute a trust score that approximates how much the community as a whole trusts a specific user.

Have you spotted a familiar pattern?… Just exchange “trust” with “relevance”, and the paragraph will all of a sudden describe authority-based search (PageRank) versus socially-connected search (Delver). Local metrics were found to be more effective for ranking controversial users, meaning users that are assigned individual trust scores that highly deviate from their average score. The search equivalent can be considered queries that are for subjective information, where opinions may vary and an authority score may not be the best choice for each individual searcher.

To read more about trust metrics, see here: trustlet.org

Evaluating Search Engine Relevance

Web search engines must be the most useful tools the Web brought us. We can answer difficult questions in seconds, find obscure pieces of information and stop bothering about organizing data. You would expect that systems with such impact on our lives will be measured, evaluated and compared, so that we can make an informed decision on which one to choose. Nope, nothing there.

Some years ago, search engines competed in size. Danny Sullivan wrote angry pieces on that, and eventually they stopped, but still six months ago Cuil launched and made a fool of itself by boasting size again (BTW – Cuil is still alive, but my blog is not indexed, not much to boast about coverage there).

TRECNow, academic research on search (Information Retrieval, or IR in academic jargon) does have a very long and comprehensive tradition of relevance evaluation methodologies, TREC being the best example. IR systems are evaluated, analyzed, and compared across standard benchmarks, and TREC researchers carry out excellent research into the reliability and soundness of these benchmarks. So why isn’t this applied to evaluating web search engines?

One of the major problems is, yes, size. Much of the challenges TREC organizers are facing, is scaling the evaluation methods and measurements to web size scale. One serious obstacle was the evaluation measure itself. Most IR research uses Mean Average Precision (MAP), which proved to be a very reliable and useful measure, but it requires knowing stuff you just can’t know on the web, such as the total number of relevant documents for the evaluated query. Moreover, with no use case reasoning, there was no indication that it indeed measures true search user satisfaction.

Luckily, the latest volume of TOIS journal (Transactions on Information Systems) included a paper that could change that picture. Justin Zobel and Alistair Moffat, two Australian key figures in IR and IR evaluation, with Zobel a veteran of TREC methodology analysis, suggest a new measure called “Rank-Biased Precision” (RBP). In their words, the model goes as follows:

The user has no desire to examine every answer. Instead, our suggestion is that they progress from one document in the ranked list to the next with persistence (or probability) p, and, conversely, end their examination of the ranking at that point with probability 1− p… That is,we assume that the user always looks at the first document, looks at the second with probability p, at the third with probability p2, and at the ith with probability pi−1. Figure 3 shows this model as a state machine, where the labels on the edges represent the probability of changing state.

The user model assumed by rank-biased precision

They then go to show that the RBP measure,  derived from this user model, does not depend on any unknowns, behaves well with real life uncertainties (e.g. unjudged documents, queries with no relevant documents at all), and is comparable to previous measures in showing statistically significant differences between systems.

Eventually,  beyond presenting an interesting web search user model, RBP also eliminates one more obstacle to true comparison of search engine relevance. The sad reality, though, is that with Yahoo’s and Live’s current poor state of results relevance, such a comparison may not show us anything new, but an objective, visible measurement could at least provide incentive to measurable improvements on their account. Of course, then we’ll get to the other major issue, of what constitutes a relevant result…

Update: I gave a talk on RBP in my research group, slides are here.

IBM IR Seminar Highlights (part 2)

The seminar’s third highlight for me (in addition to IBM’s social software and Mor’s talk), was the keynote speech by Human-Computer Interaction (HCI) veteran Professor Ben Schneiderman of UMD. Ben’s presentation was quite an experience, but not in a sophisticated Lessig way (which Dick Hardt adopted so well for identity 2.0), rather by sheer amounts of positive energy and passion streaming out of this 60-year-old.

[Warning – this post turned out longer and heavier than I thought…]

Ben Shneiderman in front of Usenet Treemap - flickr/Marc_SmithBen is one of the founding fathers of HCI, and the main part of his talk focused on how visualization tools can serve as human analysis enhancers, just like the web as a tool enhances our information.

He presented tools such as ManyEyes (IBM’s),  SpotFire (which was his own hitech exit), TreeMap (with many examples of trend and outlier spotting using it) and others. The main point was in what the human eye can do using those tools, that no predefined automated analysis can, especially in fields such as Genomics and Finance.

Then the issue moved to how to put such an approach to work in Search, which like those tools, is also a power multiplier for humans. Ben described today’s search technology as adequate mainly in “known item finding”. The more difficult tasks that can’t be answered well in today’s search, are usually for a task that is not “one-minute job”, such as:

  • Comprehensive search (e.g. Legal or Patent search)
  • Proving negation (Patent search)
  • Finding exceptions (outliers)
  • Finding bridges (connecting two subsets)

The clusters of current and suggested strategies to address such tasks are:

  • Enriching query formulation – non-textual, structured queries, results preview, limiting of result type…
  • Expanding result management – better snippets, clustering, visualization, summarization…
  • Enabling long-term effort – saving/bookmarking, annotation, notebooking/history-keeping, comparing…
  • Enhancing collaboration – sharing, publishing, commenting, blogging, feedback to search provider…

So far, pretty standard HCI ideas, but then Ben started taking this into the second part of the talk. A lot of the experimentation employed in these efforts by web players has built an entire methodology, that is quite different from established research paradigms. Controlled usability tests in the labs are no longer the tool of choice, rather A/B testing on user masses with careful choice of system changes. This is how Google/Yahoo/Live modify their ranking algorithms, how Amazon/NetFlix recommend products, how the Wikipedia collective “decides” on article content.

This is where the term “Science 2.0” pops up. Ben’s thesis is that some of society’s great challenges today have more to learn from Computer Science, rather than traditional Social Science. “On-site” and “interventionist” approaches should take over controlled laboratory approaches when dealing with large social challenges such as security, emergency, health and others. You (government? NGOs? web communities?) could make actual careful changes to how specific social systems work, in real life,  then measure the impact, and repeat.

This may indeed sound like a lot of fluff, as some think, but the collaboration and decentralization demonstrated on the web can be put to real life uses. One example on HCIL is the 911.gov project for emergency response, as emergency is a classic case when centralized systems collapse. Decentralizing the report and response circles can leverage the power of the masses also beyond the twitter journalism effect.

IBM IR Seminar Highlights (part 1)

IBM Haifa Research LabsYesterday’s seminar was also packed with some very interesting talks from a wide range of social aspects to IR and NLP.

Mor Naaman of Rutgers University and formerly at Yahoo! Research gave an excellent talk on using social inputs to improve the experience of multimedia search. The general theme was about discovering metadata for a given multimedia concept from web 2.0 sites, then using those to cluster potential results and choose representative ones.

In one application, this approach was used to identify “representative” photos of a certain landmark, say the Golden Gate bridge, see WorldExplorer for an illustration. So first, you’d find all flickr photos geotagged and/or fickr-tagged by the location and name of the bridge (or any given landmark). Next, image processing (SIFT)  is applied to those images to cluster them into subsets that are likely to be of the same section and/or perspective of the bridge. Finally, relations between the images in each cluster are formed based on the visual relation, and link analysis is employed to find a “canonical view”. The result is what we see on the right sidebar in World Explorer, and described in this WWW’08 paper.

[Update: Mor commented that the content-based analysis part is not yet deployed in World Explorer. Thanks Mor!]

tagmaps1

Another example applied this approach to concerts on YouTube, and the purpose was to find good clips of the concert itself, rather than videos discussing it etc. Metadata describing the event (say, an Iron Maiden concert) was collected from both YouTube and sites such as Upcoming.org, and Audio Fingerprinting was employed to detect overlapping video sections, as it’s quite likely the concert itself would have the most overlap. Note that in both cases, the image/audio processing is a heavy task, and applying it only to a small subset filtered by social tags makes the work involved more feasible.

I’ll talk about the keynote (by Prof. Ben Schneiderman) on another post, this one is already way too long… Here are soundbites from some other talks:

Emil Ismalon of Collarity referred to personalized search (e.g. Google’s) as a form of overfitting, not letting me learn anything new as it trains itself only on my own history. That, of course, as a motivation for community-based personalization. 

Ido Guy of IBM talked about research they did comparing social network extracted from public and private sources. The bottom line is that some forms of social relations are stronger, representing collaboration (working on projects together, co-authoring papers or patents), and others are weaker, being more around the socializing activities (friending/following on SN, commenting on blogs etc) . Of course, that would be relevant for Enterprise social graph, not necessarily personal life…

Daphne Raban of Haifa University summarized her (empirical) research into motivations of participants in Q&A sites. The main bottom lines were: 1) money was less important to people who participate very often, but it’s a catalyst, 2) Being awarded with gratitude and conversation is the main factor driving people to become more frequent participants, and 3) in quality comparison, paid results ranked highest, free community results (Yahoo! Answers) ranked close, and unpaid single experts ranked lowest.