Category Archives: Uncategorized

The Broken Web

Dave Winer recently pointed out two trends that pose risk to user-created content on the web:

  • Over-reliance on url-shorteners. Fueled by twitter’s laconic style, more and more links to content are created using an indirection via url shortener services such as bit.ly and tr.im. The collapse of such a service may turn tons of links into broken links in an instant.
  • Centralized conversation platforms. Shifting the conversation away from their blogs, influencing content publishers chose to center on platforms such as twitter and FriendFeed. Besides the increased noise inherent to lifestreaming, there is increased risk in making your contributions (and having your readers contribute back) in a site run by a private company with no real commitment to its users.

In the past two weeks both these risks materialized to some extent. The url-shortener service tr.im shut down, and that 404-iceberg was avoided in the last minute by the owners’ decision to open-source it. Then Facebook acquired FriendFeed, and their PR said

“…FriendFeed.com will continue to operate normally for the time being as the teams determine the longer term plans for the product.”

Hmm, right… So Scoble’s blog still loves him, and is probably a safer publishing venue.

But why is this such a big deal anyway?Broken web of intrigue, CC by 'Looking for a Lighthouse'/Flickr

We tend to forget how much we have invested into such services until they break down (as was the case with ma.gnolia). The web’s strength is in storing and being able to search in the content produced by millions of earthlings. The impact of frailness of large amounts of content or links is significant. Especially for social search, that content could be vital (OK, perhaps except for that part about what you had for breakfast).

As always with such issues, the best solution is decentralization. For url shorteners, the ‘shortlink’ protocol was already suggested for site-maintained shorteners, and WordPress has already implemented it. My blog is already enabled, try http://wp.me/plBAi-8Q.  And then content decentralization is in our hands. Think about it the next time you post your thoughts into twitter rather than in your blog…

Friendly advice from your “Social Trust Graph”

While scanning for worthy Information Retrieval papers in the recent SIGIR 2009, I came across a paper titled “Learning to Recommend with Social Trust Ensemble“, by a team from the University of Hong Kong. This one is about recommender systems, but putting the social element into text analytics tasks is always interesting (me).

The premise is an interesting one – using your network of trust to improve classic (Collaborative Filtering) recommendations. The authors begin by observing that users’ decisions are the balance between their own tastes, and those of their trusted friends’ recommendations.

Figure 1 from "Learning to Recommend with Social Trust Ensemble" by Ma et al.

Then, they proceed to propose a model that blends analysis of classic user-item matrix where ratings of items by users are stored (the common tool of CF), with analysis of a “social trust graph” that links the user to other users, and through them to their opinions on the items.

This follows the intuition that when trying to draw a recommendation from behavior of other users (which basically is what CF does), some users’ opinions may be more important than others’, and the fact that classic CF ignores that, and treats all users as having identical importance.

The authors show results that out-perform classic CF on a dataset extracted from Epinions. That’s encouraging for any researcher interested in the contribution of the social signal into AI tasks.

free advice at renegade craft fair - CC Flickr/arimoore

However, some issues bother me with this research:

  1. Didn’t the netflix prize winning team approach (see previous post) “prove” that statistical analysis of the user-item matrix beats any external signal other teams tried to use? the answer here may be related to the sparseness of the Epinions data, which makes life very difficult for classic CF. Movie recommendations have much higher density than retail (Epinions’ domain).
  2. To evaluate, the authors sampled 80% or 90% of the ratings as training and the remaining as testing. But if you choose as training the data before the user started following someone, then test it after the user is following that someone, don’t you get a bit mixed up with cause and effect? I mean, if I follow someone and discover a product through his recommendation, there’s a high chance my opinion will also be influenced by his. So there’s no true independence between the training and test data…
  3. Eventually, the paper shows that combining two good methods (social trust graph and classic CF) outperforms each of the methods alone. The general idea of fusion or ensemble of methods is pretty much common knowledge for any Machine Learning researcher. The question should be (but it wasn’t) – does this specific ensemble of methods outperform any other ensemble? and does it fare better than the state of the art result for the same dataset?
own taste and his/her trusted friends’ favors.

The last point is of specific interest to me, having combined keyword-based retrieval with concept-based retrieval in my M.Sc. work. I could easily show that the resulting system outperformed each of the separate elements, but to counter the above questions, I further tested combining other similarly high performing methods to show performance gained there was much lower, and also showed that the combination could take a state of the art result and further improve on it.

Nevertheless, the idea of using opinions from people you know and trust (rather than authorities) in ranking recommendations is surely one that will gain more popularity, as social players start pushing ways to monetize the graphs they worked so hard to build…

Bart Simpson working at Google??

“Phone call for Al…Al Coholic…is there an Al Coholic here?”
“Wait a minute… Listen, you little yellow-bellied rat jackass, if I ever find out who you are, I’m gonna kill you!”

Sweet little Bart Simpson must have hacked his way into the training data the guys at Google Scholar are using. I was running a simple Google query for user manuals that Googlebot indexed at sears.com, and got these goodies in the results:

Google Scholar Bart SimpsonFor the perplexed readers, the image on the right is what the Google Scholar parser saw for the DVD result (click to enlarge), then assumed it’s an academic paper and desperately tried to find an author name. As Google freely admits, “…Automated extraction of information from articles in diverse fields can be tricky”. Yep.

sony-dvd-manual

It gets even better: since there are many such “academic papers” with the same author name, Google clusters them together, even when the manuals are for different products. Try one of those “All xxx versions” links, e.g. this one, all by our good friend O. Instructions. Interested students are encouraged to proceed and find out the etymology of other fascinating author names such as R. Parts and NO. Model.

And what about our old friend Al Coholic, you ask? well, Google Scholar tells us he did actually publish something! but wait – 1877? Annals of the New York Academy of Sciences? young Simpson, have you no shame boy!?

Owning the People Namespace

Chris Messina is an interesting guy to follow. Sort of an “NGO celebrity” on the web, he’s known as an advocate for open standards and efforts such as OpenID, DISO and Microformats, and in the past also SpreadFirefox.

One of the many issues Chris writes passionately about is our online identity. That little link I added to his name in the opening words of this post triggers an entire domain of debates, ideals and evil plans to take over the world. Should I have linked to his Facebook page? or Twitter? perhaps MySpace or even Google? all these companies beg us to choose them as our identity providers, so that we will let them be our companions when we visit other websites, thus helping their “social colonization” efforts.

So in a way, those companies are trying to become the global people namespace. On the web I may be http://facebook.com/ofer.egozi, or http://linkedin.com/in/oferegozi etc., and as Dave Morin of facebook tweeted, “/ is the new @ (hence their PR extravaganza on vanity urls). Our identity is associated with the domain on that url, much as our email domain.

An interesting corrolary I can suggest here is that the “commitment” of that company to your identity is reflected in the extra padding next to that ‘/’. Companies such as twitter and facebook say “profiles are not just another application for us, they ARE our application”, whereas others such as linkedin and google still interject a /in/ or /profiles/ in between, just in case something else becomes more important…

So why not use his Facebook then? with social networks being such a relatively new entities, we seem to forget the temporariness of a business organization. We also tend to forget that those network accounts are only as free as beer, and the organizations behind them can arbitrarily delete a user or change their policies any time, and your anchor on the web which you built over the years is suddenly at stake.Personal Anchor on the Web for Digital Identity

My own identity is this blog. I own the domain, I maintain an OpenID on it using WordPress.com, and I can always decide to modify that identity, take it elsewhere or remove it altogether. The control over that identity, how it’s portrayed and used remains with me, even if many other aspects (think social graph) are still locked up elsewhere. That’s a start.

Semantic Search using Wikipedia

Today I gave my master’s thesis talk in the Technion as part of my master’s duties. Actually, the non-buzzword title is “Concept-Based Information Retrieval using Explicit Semantic Analysis”, but that’s not a very click-worthy post title :-)…

The whole thing went far better than I expected – the room was packed, the slides flew smoothly (and quickly too, luckily Shaul convinced me to add some spare slides just in case), and I ended up with over 10 minutes of Q&A (an excellent sign for a talk that went well…)

Click to view on Slideshare

BTW – anybody has an idea how to embed slideshare into a hosted blog? doesn’t seem to work…

Apparently, Amazon invented the WWW

Tim Berners Lee – behind you!

Amazon has been selling stuff online since as far back as 1973, at least if you believe this:

amazon1973

In fact, Google lists over 51,000 pages with this date on Amazon. And mind you – it is the exact date September 4 1973, not a day less, nor a day more.

Of course, some geeks may claim this has to do with some Amazon programmers’ default value, but the POSIX time for this date is just a boring 115945200, not some fun number like 1234567890. I prefer to attribute this to some evil Bezosish conspiracy theory, now I just need to figure out what it was.

Suggestions?

The Opportunity in RSS Overload

Dare Obasanjo has an interesting post, with a good comments thread, on overflowing feed readers. He’s quoting from a post by Farhad Manjoo on Slate:

You know that sinking feeling you get when you open your e-mail and discover hundreds of messages you need to respond to…

Well, actually Dare’s post is from two weeks ago. The reason I got to read it only now is exactly that…

Yes, I know I don’t really need to ‘respond’ to subscriptions, and the answer should be – unsubscribe, or go on a feeds (or ‘follow’ edges) social diet. But these binary decisions are not always optimal, as I have plenty of feeds I subscribed to after hitting one or two posts I really liked, but that were not on that author’s main subject (if such exists at all). Thus I have to skim through many un-interesting (for me!) posts, many of them somehow always end up discussing twitter. In fact, that’s how most of my feeds look like (including the twitter part).

We need shades of grey between subscribed and unsubscribed. It would be great to have a feed reader that learns from how you use it. It should be quite clear which posts interest me – ones I took time to read, scroll through, press a link etc. – and which did not. Now train a classifier on that data, preferably per-feed (in addition to a general one), and get some sense of what I’m really looking for.
Mark All As Read Day - flickr/sidereal

Now, I don’t need this smart reader to delete the uninteresting ones, let’s not assume too much on its classification accuracy. Just find the right UI to mark the predicted-to-be-interesting items (or even assign them into a special virtual folder). Then I can read these first, and only if/when have time – read the rest.

I assign this to be my pet project in case I win the lottery next week and go into early retirement. Alternatively, if someone saw this implemented anywhere – let me know!

Update: a related follow-up post on a new filtering product I started using.

Google Labs is now Google

Quick, name this search engine!

public-google-labs

No, not Kumo. That’s Google’s recent launch, trying to compete with Twitter search (“Recent results”), to preempt Microsoft (clustering result types), to show a different, though quite ugly UI metaphor (“wonder wheel”), and generally to roll out a whole bunch of features that should have been Google Labs features before making (or not) their way into a public product. So what’s next? buttons next to search results moving them up or down with no opt-out?? Ah, wait, that waste of real estate is already there.

Flash Gordon Gets the Drop on Arch-Enemy Ming the Mericiless - Flickr/pupleslog

Someone is panicking. OPEN FIRE! ALL WEAPONS!!! DISPATCH WAR ROCKET AJAX!!! The same spirit that brought us the failure of knols, is bringing us yet further unnecessary novelty, but this time it’s a cacophony of features, each deserving a long Google Labs quarantine by itself.

I noticed that much of my recent blog posts have to do with Google criticism :-). I wrestle with that, there really ought to be more interesting stuff to blog about in the IR world, and there is also great stuff coming from Google (can you imagine the fantastic similar images feature is still in labs? can Google please apply this to the ridiculously useless “similar pages” link in main web search results??), but I truly think we see a trend. Google is dropping the ball, losing the clear and spotless logic we have seen in the past, and the sensible slow graduation of disruptive features from Google Labs. Sadly, though, it’s not clear if anyone is there, ready to pick that ball…

Google converts the converted

I love Google Chrome. It’s super fast, its default home page (showing most visited websites) and searching from the url box are  great, and the javascript experiments really knocked me out.

So Google must know this, as  Chrome does talk to the mothership quite often. Then why-oh-why, whenever Google embarks on a “Get Chrome” campaign and I happen to use IE (say for one of those sites that renders well only in IE), do they not spare us the converted? is it really that hard to put a flag on the Google uber-cookie that Chrome is already installed here?…

get-chrome

 

BTW – all you Firefox users are considered too sophisticated to buy it – this  promotion is not shown to FF users, only IE! 🙂

Mechanical Hype, revisited

aardvarkAs I wrote previously, I really like the idea behind Aardvark (previously known as Mechanical Zoo) and it’s a great social Q&A tool, but it simply is notsocial search” (and unlike TechCrunch,  RWW realize that). The Aardvark team still pushes with that terminology, I guess for a good reason given the financial climate, and disperses more of it in a white paper. Once they actually start searching in their aggregated Q&A repository to provide you with an available answer without bothering your network – that would become more of a search solution, rather than Q&A.

Having played with the product a bit, I also see an inherent flaw in the social premise here. Aardvark provides me with answers from friends, or friends-of-friends. Now, it’s more likely I’ll get answers from friends-of-friends, as there are simply a lot more of them. However, these would be people who don’t know me, and will not provide a personal answer that is tailored to my own individual needs.

Still, it’s a great way to make new friends. Not kidding – Aardvark strongly drives conversations, as Danny Sullivan also pointed out, and since this friend-of-friend was the one who responded to my question, I’d feel more comfortable discussing further. Presumably Aardvark will also track this, and practically add this person to my direct social graph.

 

Update: Max Ventilla of Aardvark commented in my previous post that indexing your graph and finding the right person to answer your query has, in fact, the ingredients of social search. He has a point there, but still that search ends in finding a person, not information, so it’s more of a people search. Still, I agree that in executing this task, the varkers face similar difficulties to those we faced in Delver, albeir on much smaller scale.