Tag Archives: Artificial Intelligence

Semantic Search using Wikipedia

Today I gave my master’s thesis talk in the Technion as part of my master’s duties. Actually, the non-buzzword title is “Concept-Based Information Retrieval using Explicit Semantic Analysis”, but that’s not a very click-worthy post title :-)…

The whole thing went far better than I expected – the room was packed, the slides flew smoothly (and quickly too, luckily Shaul convinced me to add some spare slides just in case), and I ended up with over 10 minutes of Q&A (an excellent sign for a talk that went well…)

Click to view on Slideshare

BTW – anybody has an idea how to embed slideshare into a hosted blog? doesn’t seem to work…

The Opportunity in RSS Overload

Dare Obasanjo has an interesting post, with a good comments thread, on overflowing feed readers. He’s quoting from a post by Farhad Manjoo on Slate:

You know that sinking feeling you get when you open your e-mail and discover hundreds of messages you need to respond to…

Well, actually Dare’s post is from two weeks ago. The reason I got to read it only now is exactly that…

Yes, I know I don’t really need to ‘respond’ to subscriptions, and the answer should be – unsubscribe, or go on a feeds (or ‘follow’ edges) social diet. But these binary decisions are not always optimal, as I have plenty of feeds I subscribed to after hitting one or two posts I really liked, but that were not on that author’s main subject (if such exists at all). Thus I have to skim through many un-interesting (for me!) posts, many of them somehow always end up discussing twitter. In fact, that’s how most of my feeds look like (including the twitter part).

We need shades of grey between subscribed and unsubscribed. It would be great to have a feed reader that learns from how you use it. It should be quite clear which posts interest me – ones I took time to read, scroll through, press a link etc. – and which did not. Now train a classifier on that data, preferably per-feed (in addition to a general one), and get some sense of what I’m really looking for.
Mark All As Read Day - flickr/sidereal

Now, I don’t need this smart reader to delete the uninteresting ones, let’s not assume too much on its classification accuracy. Just find the right UI to mark the predicted-to-be-interesting items (or even assign them into a special virtual folder). Then I can read these first, and only if/when have time – read the rest.

I assign this to be my pet project in case I win the lottery next week and go into early retirement. Alternatively, if someone saw this implemented anywhere – let me know!

Update: a related follow-up post on a new filtering product I started using.

Why Search Innovation is Dead

We like to think of web search as quite a polished tool. I still find myself amazed at the ease with which difficult questions can be answered just by googling them. Is there really much to go from here?

"Hasn't Google solved search?" by asmythie/Flickr

Brynn Evans has a great post on why social search won’t topple Google anytime soon. In it, she shares some yet to be published results on difficulty in forming the query being a major cause for failed searches. That resonated well with some citations I’m collecting right now for my thesis (on concept-based information retrieval). It also reminded me of a post Marissa Mayer of Google wrote some months ago, titled “The Future of Search“.  One of the main items on that future of hers was natural language search, or as she put it:

This notion brings up yet another way that “modes” of search will change – voice and natural language search. You should be able to talk to a search engine in your voice. You should also be able to ask questions verbally or by typing them in as natural language expressions. You shouldn’t have to break everything down into keywords.

Mayer gives some examples to questions that were difficult to query or formulate by keywords. But clearly she has the question in her head, so why not just type it in? after all, Google does attempt to answer questions. Mayer (and Brynn too) mentions the lack of context as one reason. Some questions, if phrased naively, refer to the user’s location, activities or other context. It’s a reasonable, though somewhat exaggerated point.  Users aren’t really that naive or lazy, if instead of using search they’d call up a friend, they wouldn’t ask “can you tell me the name of that bird flying out there?”. The same info they would provide verbally, they can also provide to a natural-language search engine, if properly guided.

The more significant reason in my eyes revolves around habits. Search is usually a means, rather than a goal. So we don’t want to think where and how to search, we just want to type something quickly into that good old search box and fire it away. It’s no wonder that the search engine most bent on sending you away asap, has most loyal users coming back for more. That same engine even has a button, that hardly anyone uses, and supposedly costs them over 100M$ a year in revenues, that sends users away even faster.  So changing this habit is a tough task for any newcomers.

But these habits go deeper than that. Academic researchers have long studied natural-language search and concept-based search. A combination of effective keyword-based search, together with a more elaborate approach that kicks in when the query is a tough one, could have gained momentum, and some attempts were made for commercial products (most notable Ask, Vivisimo and Powerset). They all failed. Users are so used to the “exact keyword match” paradigm, the total control it provides them with, and its logic (together with its shortcomings) that a switch is nearly impossible, unless Google will drive such a change.

Until that happens, we’ll have to continue limiting innovations to small tweaks over the authorities…

If you liked my blog, you’d like this post. Trust me!

One of the sites that most impressed me when I first started browsing the web was called MovieCritic.com. You would rate a few movies you saw, then it would predict whether you’d like a new movie. It would even let you find one that matches both your taste and your girlfriend’s. Pure magic, for that time. For me that was the first demonstration of what we can achieve with the web as a medium.

MovieCritic is dead for a few years now, but recommender systems are now everywhere. NetFlix runs one of the most successful commercial implementations (Amazon another classic example, “People who bought this book…”), and two years ago they challenged researches to come up with a system that would perform 10% better than their own, in predicting users’ ratings. The best achieving team so far almost got there, and today I attended a talk in the Technion by Yehuda Koren, one of the team members and a researcher at Yahoo! Research Haifa lab.

Most methods follow the neighborhood-based model – find an item’s neighbours (in some representation), and predict based on their rating. This may be done in a user-user matching (find users like this user, then check their rating) or item-item (find items like the rated item, then predict based on how the user rated those items). One of the interesting approaches proposed by Koren’s team represented both users and movies in the same space, then looked for similarity in this unified space.

The most striking finding for me, however, was that winning strategies did not use anything from the movie’s “content” features. Genre, director, actors, length, etc. – all these did not produce any additional value beyond the plain statistical analysis and correlation of ratings and users, and are therefore not used at all. In fact, Koren claims that knowing that a certain user is a Tom Hanks fan makes no difference, we will infer this from the recommendations anyway (assuming there are enough of them of course).

I find that almost sad… Not being able to intelligently reason over the underlying logic exposed by an AI software is a tremendous drawback in my eyes, even if the overall prediction score is better. Telling the user “you may want to watch this movie because A and B and C” can benefit in more satisfaction by the user, understanding even the incorrect predictions, and possibly leading to a feedback cycle. Doing away with it is like showing web search results without keyword highlighting, no visible cue for the user why this result was returned (“…trust me, I know what’s the right answer for you!“).

Solving Checkers

I attended a talk today at the Technion by Jonathan Schaeffer from the University of Alberta in Canada, the person behind Chinook. Chinook is practically the world champion in the game of checkers, if only the world checkers federations would allow computer programs to compete.

Solving checkers, according to Schaeffer, turns out to be much more a laborious process of solving lots of hardware problems, rather than artificial intelligence algorithms. It took 19 years to build a database of board positions sufficiently large to solve from any given position (and there are 5 billion billion such positions, mind you). Since relying on a wrong piece of data could be catastrophic to the complex calculation, there was a lot of dealing with disk failures, network failures, grid calculation. Sounds a lot like running a large-scale search engine, only that search engines can afford to say “Oops…”.

In one case, Schaeffer, who started out in 1989, reached the 32-bit limit in his files. He started refactoring his code and databases to accommodate 64-bit pointer arithmetic, but ended up deciding it’s better to wait 3 more years for 64-bit to become mainstream. So in fact sitting idle for 3 years can do wonders to your projects. In another case, a bug discovered in 2005 was traced back to data created in 1992, which forced a re-calculation of all the data that relied on it. Bummer.

But what I truly found most interesting was the human story behind it all. Chinook’s major opponent was the late Marion Tinsley, world champion and undisputably the best human checkers player ever, who in his career of thousands of games lost only five. Schaeffer tells of hate letters he received after Chinook beat Tinsley, who died shortly after, and of his decision then to prove that a human simply can not beat a machine in the game.

The bottom line? Schaeffer has now indeed proven that the best an opponent of Chinook can achieve is a draw. This positions humans as those who could err and lose. A beautiful definition…