Tag Archives: Recommender systems

Death of a News Reader

Dave Winer says I don’t read his posts. He’s right, I admit. I skim.

I’m overloaded. So in the past few months I’ve gradually reduced my subscription list from over 50 feeds to around a dozen, and at the same time increased my reliance on Genieo, which claims to be tracking already 537 feeds for me (though not all are ones I really would fully subscribe to, but that’s the beauty of it…)

When trying to understand what had happened, I came to realize my reader subscriptions list was made of two types of feeds:

  1. Feeds that are generally on topics I’m interested in
  2. Blogs where I thought the author was interesting or smart

Type #1 is, being practical, simply not scalable. There are just too many good sources out there, and not all posts in them are really read-worthy for me, even if just to skim through. So I let Genieo discover those feeds (just clicking through to some posts) and then removed them from my subscription list. It’s amazing how good it feels to safely eliminate a feed from your reader (“…yes, I am sure I want to delete!:) )

Type #2 is more tricky as I would usually be interested in all of the posts even if not in my topics of interest. These include blogs by friends, and blogs by smart people I stumbled upon who seemed worth following. I also wouldn’t want Genieo (or any other learning reader for that matter) to think I’m generally interested in those more random topics and clutter my personalized feed. So I still kept this much shorter list in my reader, but I know I can visit them a lot less frequently and not lose anything.

This combination has been working well for me in recent months. Social diet hurray!

Web(MD) 2.0

Just when I thought that the uses for recommendation systems were already exhausted…

CureTogether is a site that lets you enter your medical conditions (strictly anonymous, only aggregated data are public), and get recommended for… other “co-morbid” conditions you may have. In other words, “people who have your disease usually also have that one too, perhaps you have it too?

Beyond the obvious jokes, this truly has potential. You don’t only get “recommended” for conditions, but rather also for treatments and causes. We all know that sometimes we have our own personal treatment that works only for us. What if it works for people in our profile, and sharing that profile, anonymously, will help similar people as well? so far this direction is not explicit enough in how the site works, possibly for lack of sufficient data, but you can infer it as you go through the questionnaires.

The data mining aspect of having a resource such as CureTogether’s database is naturally extremely valuable. CureTogether’s founders share some of their findings on their blog. The power of applying computer science analytics and experimentation methodologies – sharpened by web-derived needs – to social sciences and others, reminded me of Ben Schneiderman’s talk on “Science 2.0. The idea that computer science can contribute methodologies that stretch beyond the confines of computing machines is a mind-boggling one, at least for me.

But would you trust collaborative filtering with your health? it’s no wonder that the main popular conditions on the site are far from life threatening, and the popular ones are such with unclear causes and treatments, such as migraines, back pains and allergies. Still, the benefit on these alone will probably be sufficient for most users to justify signing up.

The (Filtered) Web is Your Feed

A few months ago I was complaining here about my rss overload. A commenter suggested that I take a look at my6sense, a browser extension (now also iPhone app) that acts as a smart RSS reader, emphasizing the entries you should be reading. I wanted to give my6sense a go then, but the technical experience was lousy, and moreover – I was expected to migrate my rss reading to it. Too much of a hassle, I gave it up.

In the past few weeks I’ve been test-driving a new player – Genieo, which takes the basic my6sense idea a few steps further. Genieo installs an actual application, not just extension, that plugs into your browser. It tracks your rss feeds automatically, simply by looking for rss feeds in the pages you’re browsing, and learns your feeds without any setup work.

Genieo then goes further to discover feeds on pages you visit even if you’re not subscribed to them, turning your entire browsing history into one big rss feed.  It finally filters this massive pool of content using a semantic profile it builds for your interests, based on analyzing the text you’ve read so far.

For IR people this may sound a lot like Watson, Jay Budzik’s academic project turned contextual search turned an advertising technology acquisition. Watson approached this problem as a search problem: how would I formulate search queries that would run in the background, fetching me the most relevant documents that match the user’s current context? problem is, users are not constantly searching, and would get quickly annoyed by showing general search results when not asked for.

The good thing about an rss feed is that it explicitly says “this is a list of content items to be consumed from this source“, and its temporal nature provides a natural preference ranking (prefer recent items), so a heuristic of “users would be interested in recent and relevant items from feeds in pages they visited” works around the general search difficulty pretty well. Genieo circumvents the expected privacy outcry by running the entire logic on the client side, nothing of the analyzed data leaves your PC (privacy warriors would probably run sniffers to validate that).

In my personal experience, the quality of most results is excellent, and they are almost always posts that would interest me. Genieo quickly picked up my feed subscriptions from clicks I made in my reader to the full article in a browser window (from which it extracted the rss feed), and after a while I could see it gradually picking up on my favorite memes (search, social and others). I did not give up my rss reader for Genieo yet, and I also still have many little annoyances with it, but overall for an initial version, it works surprisingly well.

However, the target audience that is even more suited for Genieo is the not rss-savvy users like me, but the masses out there who don’t know and don’t care about reading feeds. They just want interesting news, and they don’t mind missing on the full list (a-la Dave Winer’s “River of News” concept). Such users will find tools like Genieo as useful as a personal news valet can be.

Friendly advice from your “Social Trust Graph”

While scanning for worthy Information Retrieval papers in the recent SIGIR 2009, I came across a paper titled “Learning to Recommend with Social Trust Ensemble“, by a team from the University of Hong Kong. This one is about recommender systems, but putting the social element into text analytics tasks is always interesting (me).

The premise is an interesting one – using your network of trust to improve classic (Collaborative Filtering) recommendations. The authors begin by observing that users’ decisions are the balance between their own tastes, and those of their trusted friends’ recommendations.

Figure 1 from "Learning to Recommend with Social Trust Ensemble" by Ma et al.

Then, they proceed to propose a model that blends analysis of classic user-item matrix where ratings of items by users are stored (the common tool of CF), with analysis of a “social trust graph” that links the user to other users, and through them to their opinions on the items.

This follows the intuition that when trying to draw a recommendation from behavior of other users (which basically is what CF does), some users’ opinions may be more important than others’, and the fact that classic CF ignores that, and treats all users as having identical importance.

The authors show results that out-perform classic CF on a dataset extracted from Epinions. That’s encouraging for any researcher interested in the contribution of the social signal into AI tasks.

free advice at renegade craft fair - CC Flickr/arimoore

However, some issues bother me with this research:

  1. Didn’t the netflix prize winning team approach (see previous post) “prove” that statistical analysis of the user-item matrix beats any external signal other teams tried to use? the answer here may be related to the sparseness of the Epinions data, which makes life very difficult for classic CF. Movie recommendations have much higher density than retail (Epinions’ domain).
  2. To evaluate, the authors sampled 80% or 90% of the ratings as training and the remaining as testing. But if you choose as training the data before the user started following someone, then test it after the user is following that someone, don’t you get a bit mixed up with cause and effect? I mean, if I follow someone and discover a product through his recommendation, there’s a high chance my opinion will also be influenced by his. So there’s no true independence between the training and test data…
  3. Eventually, the paper shows that combining two good methods (social trust graph and classic CF) outperforms each of the methods alone. The general idea of fusion or ensemble of methods is pretty much common knowledge for any Machine Learning researcher. The question should be (but it wasn’t) – does this specific ensemble of methods outperform any other ensemble? and does it fare better than the state of the art result for the same dataset?
own taste and his/her trusted friends’ favors.

The last point is of specific interest to me, having combined keyword-based retrieval with concept-based retrieval in my M.Sc. work. I could easily show that the resulting system outperformed each of the separate elements, but to counter the above questions, I further tested combining other similarly high performing methods to show performance gained there was much lower, and also showed that the combination could take a state of the art result and further improve on it.

Nevertheless, the idea of using opinions from people you know and trust (rather than authorities) in ranking recommendations is surely one that will gain more popularity, as social players start pushing ways to monetize the graphs they worked so hard to build…

If you liked my blog, you’d like this post. Trust me!

One of the sites that most impressed me when I first started browsing the web was called MovieCritic.com. You would rate a few movies you saw, then it would predict whether you’d like a new movie. It would even let you find one that matches both your taste and your girlfriend’s. Pure magic, for that time. For me that was the first demonstration of what we can achieve with the web as a medium.

MovieCritic is dead for a few years now, but recommender systems are now everywhere. NetFlix runs one of the most successful commercial implementations (Amazon another classic example, “People who bought this book…”), and two years ago they challenged researches to come up with a system that would perform 10% better than their own, in predicting users’ ratings. The best achieving team so far almost got there, and today I attended a talk in the Technion by Yehuda Koren, one of the team members and a researcher at Yahoo! Research Haifa lab.

Most methods follow the neighborhood-based model – find an item’s neighbours (in some representation), and predict based on their rating. This may be done in a user-user matching (find users like this user, then check their rating) or item-item (find items like the rated item, then predict based on how the user rated those items). One of the interesting approaches proposed by Koren’s team represented both users and movies in the same space, then looked for similarity in this unified space.

The most striking finding for me, however, was that winning strategies did not use anything from the movie’s “content” features. Genre, director, actors, length, etc. – all these did not produce any additional value beyond the plain statistical analysis and correlation of ratings and users, and are therefore not used at all. In fact, Koren claims that knowing that a certain user is a Tom Hanks fan makes no difference, we will infer this from the recommendations anyway (assuming there are enough of them of course).

I find that almost sad… Not being able to intelligently reason over the underlying logic exposed by an AI software is a tremendous drawback in my eyes, even if the overall prediction score is better. Telling the user “you may want to watch this movie because A and B and C” can benefit in more satisfaction by the user, understanding even the incorrect predictions, and possibly leading to a feedback cycle. Doing away with it is like showing web search results without keyword highlighting, no visible cue for the user why this result was returned (“…trust me, I know what’s the right answer for you!“).