Category Archives: Uncategorized

The (Filtered) Web is Your Feed

A few months ago I was complaining here about my rss overload. A commenter suggested that I take a look at my6sense, a browser extension (now also iPhone app) that acts as a smart RSS reader, emphasizing the entries you should be reading. I wanted to give my6sense a go then, but the technical experience was lousy, and moreover – I was expected to migrate my rss reading to it. Too much of a hassle, I gave it up.

In the past few weeks I’ve been test-driving a new player – Genieo, which takes the basic my6sense idea a few steps further. Genieo installs an actual application, not just extension, that plugs into your browser. It tracks your rss feeds automatically, simply by looking for rss feeds in the pages you’re browsing, and learns your feeds without any setup work.

Genieo then goes further to discover feeds on pages you visit even if you’re not subscribed to them, turning your entire browsing history into one big rss feed.  It finally filters this massive pool of content using a semantic profile it builds for your interests, based on analyzing the text you’ve read so far.

For IR people this may sound a lot like Watson, Jay Budzik’s academic project turned contextual search turned an advertising technology acquisition. Watson approached this problem as a search problem: how would I formulate search queries that would run in the background, fetching me the most relevant documents that match the user’s current context? problem is, users are not constantly searching, and would get quickly annoyed by showing general search results when not asked for.

The good thing about an rss feed is that it explicitly says “this is a list of content items to be consumed from this source“, and its temporal nature provides a natural preference ranking (prefer recent items), so a heuristic of “users would be interested in recent and relevant items from feeds in pages they visited” works around the general search difficulty pretty well. Genieo circumvents the expected privacy outcry by running the entire logic on the client side, nothing of the analyzed data leaves your PC (privacy warriors would probably run sniffers to validate that).

In my personal experience, the quality of most results is excellent, and they are almost always posts that would interest me. Genieo quickly picked up my feed subscriptions from clicks I made in my reader to the full article in a browser window (from which it extracted the rss feed), and after a while I could see it gradually picking up on my favorite memes (search, social and others). I did not give up my rss reader for Genieo yet, and I also still have many little annoyances with it, but overall for an initial version, it works surprisingly well.

However, the target audience that is even more suited for Genieo is the not rss-savvy users like me, but the masses out there who don’t know and don’t care about reading feeds. They just want interesting news, and they don’t mind missing on the full list (a-la Dave Winer’s “River of News” concept). Such users will find tools like Genieo as useful as a personal news valet can be.

What is Facebook’s Endgame with Open Graph API?

On Thursday, Facebook outlined some of its platform roadmap plans for developers. One of the items on the long list was called the “Open Graph API“, and with such a name it was sure to raise some interest.

Details were scarce, but the general message coming out of Facebook is that the Open Graph API will allow any site to embed a Facebook page in it, allowing the site owner to set status messages, share links etc., without visiting Facebook itself, and more importantly without sending its visitors to Facebook.

That sounded like a feature aimed primarily at brands, or as Ethan Beard of Facebook presented it: “This will be good for brands like Coke.” Makes perfect sense, as these brands are already using Facebook as part of their social media efforts, but would prefer to have it done on their site rather than on Facebook itself.

Thinking deeper into where Facebook is heading, though, I would think there is a more major endgame to all this. We already know that Facebook wants us to consider it as our online identity. So it allows you to reuse that Facebook identity on other websites and sign in using Facebook Connect. That’s one side of the coin. And then the other side of it is, you have your own website or blog where you may publish thoughts, links and photos that you didn’t publish on Facebook. Facebook would clearly want to bridge that gap as well.

belongs-to-us

Half a year ago, Facebook adopted the emerging Activity Streams standard for publishing and consuming an individual’s lifestream events to lifestreaming frameworks, a standard promoted by open standards evangelist Chris Messina. So that fits in nicely into the puzzle now: wouldn’t it be nicer if you could publish all this non-Facebook activity into your Facebook’s page, which will now be embedded into your personal website, courtesy of the Open Graph API?

The API then is just the funnel through which your activity stream is published back into Facebook. You get to leverage the social graph you already defined and came to like on Facebook, and Facebook gets tighter integration with your life outside of Facebook, if you still had any. Smart move for Facebook.

Google Nails Down Social Search

Google’s Social Search is doing the walk, all the rest are just doing the talk. As soon as I activated the Social Search experiment, my next search yielded a social result. No setting up, showing how I am connected to that result (including friends of friends), showing as part of the standard web results…

google-social-searchContrast this with Microsoft’s poor attempt at “social search” by indexing tweets and status messages and showing them regardless of the actual searcher (example search, you’ve got to be on “United States” locale on bing to see it).

Then also contrast it with Facebook’s announcement back in August of its implementation of searching within friends’ posts - a less grandiose announcement that yet delivered far more social experience than Bing’s. Nevertheless, it’s a very limited experience and far from being a true information source for any serious search need.

So how does Google overcome the main obstaclecollecting your connections?

Google relies on its own sources and on open sources it can obtain by crawling the social graph. That is the true reason why Facebook is not part of Google’s graph (no XFN/FOAF marking on Facebook’s public pages). Google may be counting on Facebook’s inevitable opening up, and with Gmail’s rising popularity it becomes a reasonable alternative even for Facebook users like me.

Sadly, all this great news gave zero credit to Delver, where it all happened first

To Tweet or Not to Tweet (hint: that’s not the question)

I was catching up on my RSS overload the other day, when this side note in a post by Naaman on Social Media Multitaskers caught my attention:

“I find that I now blog thoughts that are too long to fit in a tweet; so feel free to follow my tweets…”

"I am the man. I suffered. I was there." CC by 'Kalense Kid'/Flickr

I’m not too much of a media multitasker myself, so I don’t experience this duality first hand, but I can imagine it: you get an interesting thought or experience, then you think is this major enough to develop into a blog post, for which I’ll go over here, or is it not that heavy / can’t be bothered, in which case I flutter my wings over there. Actually I do experience these, just that in the other case I simply drop it (and excuse me for not considering Facebook status updates an option, that’s stuff for another post…)

This should not have been a dilemma at all, had blogging platforms evolved to accommodate microblogging, which today is somehow seen as the centralized domain of a single commercial company. You really should be able to hop on your publishing platform, write that thought down, regardless of length, and fire it out. No need to figure out which channel to use, and whether the intended readers are indeed following you there. Similarly, your friends/readers should not have to register to your feeds on different platforms but rather consume one only, and rely on a powerful set of rules to filter your stream as they find fit.

posterous-mediumPosterous is a great (and fast growing) example of how easy it can be from the blogger’s perspective. Just post it (or rather, email it) and it will get published as needed (e.g. shortened for twitter). But it does not make it any easier on the consumer, who still needs to decide where to best follow this blogger (does he perhaps write additional blog posts directly on his blog that won’t show up on his twitter? or vice versa?’) and reduces the basic filtering capability that may have existed when different post types were distributed into the different services.

No need to reinvent the wheel here, blogging platforms are abundant, decentralized and perfectly fit to remain our publishing hub, with their developed CMS and the loose but well-defined social networks. What blogging platforms should do – heck, what Automattic should do to evolve, is:

  1. Conversation - support the realtime conversational nature of short posts, with the right UI and notifications mechanisms. The “P2“ microblogging-optimized theme released almost two years ago was a good start, sadly it still followed the line of thought of “blog or microblog, not both”. To move forward, Automattic need to realize that Twitter is not a personality, it’s a state of mind, hence also P2 can’t be a permanent theme, it should be a contextual theme.
  2. Publishingacquire Posterous. As simple as that. These guys got their fame by understanding the pains of publishing anytime anywhere, they know a thing or two on usability and persuasion, and they have great buzz. The latter is not luxury – a buzzed-up acquisition makes it very clear that this is a major strategy for you, a lot more than if you’d develop the same changes yourself.
  3. Consuming – that’s the tricky part… how do you embed Twitter and WordPress into the same stream, when each consumer has their own desired blend of it. We don’t want to invent a new technology, RSS is here to stay. We do want better ways of filtering our floods using better tagging coupled with more clever feed options. How exactly – I do hope there’s an entire team at Automattic working exactly on that…

    The Broken Web

    Dave Winer recently pointed out two trends that pose risk to user-created content on the web:

    • Over-reliance on url-shorteners. Fueled by twitter’s laconic style, more and more links to content are created using an indirection via url shortener services such as bit.ly and tr.im. The collapse of such a service may turn tons of links into broken links in an instant.
    • Centralized conversation platforms. Shifting the conversation away from their blogs, influencing content publishers chose to center on platforms such as twitter and FriendFeed. Besides the increased noise inherent to lifestreaming, there is increased risk in making your contributions (and having your readers contribute back) in a site run by a private company with no real commitment to its users.

    In the past two weeks both these risks materialized to some extent. The url-shortener service tr.im shut down, and that 404-iceberg was avoided in the last minute by the owners’ decision to open-source it. Then Facebook acquired FriendFeed, and their PR said

    “…FriendFeed.com will continue to operate normally for the time being as the teams determine the longer term plans for the product.”

    Hmm, right… So Scoble’s blog still loves him, and is probably a safer publishing venue.

    But why is this such a big deal anyway?Broken web of intrigue, CC by 'Looking for a Lighthouse'/Flickr

    We tend to forget how much we have invested into such services until they break down (as was the case with ma.gnolia). The web’s strength is in storing and being able to search in the content produced by millions of earthlings. The impact of frailness of large amounts of content or links is significant. Especially for social search, that content could be vital (OK, perhaps except for that part about what you had for breakfast).

    As always with such issues, the best solution is decentralization. For url shorteners, the ‘shortlink’ protocol was already suggested for site-maintained shorteners, and WordPress has already implemented it. My blog is already enabled, try http://wp.me/plBAi-8Q.  And then content decentralization is in our hands. Think about it the next time you post your thoughts into twitter rather than in your blog…

    Friendly advice from your “Social Trust Graph”

    While scanning for worthy Information Retrieval papers in the recent SIGIR 2009, I came across a paper titled “Learning to Recommend with Social Trust Ensemble“, by a team from the University of Hong Kong. This one is about recommender systems, but putting the social element into text analytics tasks is always interesting (me).

    The premise is an interesting one – using your network of trust to improve classic (Collaborative Filtering) recommendations. The authors begin by observing that users’ decisions are the balance between their own tastes, and those of their trusted friends’ recommendations.

    Figure 1 from "Learning to Recommend with Social Trust Ensemble" by Ma et al.

    Then, they proceed to propose a model that blends analysis of classic user-item matrix where ratings of items by users are stored (the common tool of CF), with analysis of a “social trust graph” that links the user to other users, and through them to their opinions on the items.

    This follows the intuition that when trying to draw a recommendation from behavior of other users (which basically is what CF does), some users’ opinions may be more important than others’, and the fact that classic CF ignores that, and treats all users as having identical importance.

    The authors show results that out-perform classic CF on a dataset extracted from Epinions. That’s encouraging for any researcher interested in the contribution of the social signal into AI tasks.

    free advice at renegade craft fair - CC Flickr/arimoore

    However, some issues bother me with this research:

    1. Didn’t the netflix prize winning team approach (see previous post) “prove” that statistical analysis of the user-item matrix beats any external signal other teams tried to use? the answer here may be related to the sparseness of the Epinions data, which makes life very difficult for classic CF. Movie recommendations have much higher density than retail (Epinions’ domain).
    2. To evaluate, the authors sampled 80% or 90% of the ratings as training and the remaining as testing. But if you choose as training the data before the user started following someone, then test it after the user is following that someone, don’t you get a bit mixed up with cause and effect? I mean, if I follow someone and discover a product through his recommendation, there’s a high chance my opinion will also be influenced by his. So there’s no true independence between the training and test data…
    3. Eventually, the paper shows that combining two good methods (social trust graph and classic CF) outperforms each of the methods alone. The general idea of fusion or ensemble of methods is pretty much common knowledge for any Machine Learning researcher. The question should be (but it wasn’t) – does this specific ensemble of methods outperform any other ensemble? and does it fare better than the state of the art result for the same dataset?
    own taste and his/her trusted friends’ favors.

    The last point is of specific interest to me, having combined keyword-based retrieval with concept-based retrieval in my M.Sc. work. I could easily show that the resulting system outperformed each of the separate elements, but to counter the above questions, I further tested combining other similarly high performing methods to show performance gained there was much lower, and also showed that the combination could take a state of the art result and further improve on it.

    Nevertheless, the idea of using opinions from people you know and trust (rather than authorities) in ranking recommendations is surely one that will gain more popularity, as social players start pushing ways to monetize the graphs they worked so hard to build…

    Bart Simpson working at Google??

    “Phone call for Al…Al Coholic…is there an Al Coholic here?”
    “Wait a minute… Listen, you little yellow-bellied rat jackass, if I ever find out who you are, I’m gonna kill you!”

    Sweet little Bart Simpson must have hacked his way into the training data the guys at Google Scholar are using. I was running a simple Google query for user manuals that Googlebot indexed at sears.com, and got these goodies in the results:

    Google Scholar Bart SimpsonFor the perplexed readers, the image on the right is what the Google Scholar parser saw for the DVD result (click to enlarge), then assumed it’s an academic paper and desperately tried to find an author name. As Google freely admits, “…Automated extraction of information from articles in diverse fields can be tricky”. Yep.

    sony-dvd-manual

    It gets even better: since there are many such “academic papers” with the same author name, Google clusters them together, even when the manuals are for different products. Try one of those “All xxx versions” links, e.g. this one, all by our good friend O. Instructions. Interested students are encouraged to proceed and find out the etymology of other fascinating author names such as R. Parts and NO. Model.

    And what about our old friend Al Coholic, you ask? well, Google Scholar tells us he did actually publish something! but wait – 1877? Annals of the New York Academy of Sciences? young Simpson, have you no shame boy!?

    Owning the People Namespace

    Chris Messina is an interesting guy to follow. Sort of an “NGO celebrity” on the web, he’s known as an advocate for open standards and efforts such as OpenID, DISO and Microformats, and in the past also SpreadFirefox.

    One of the many issues Chris writes passionately about is our online identity. That little link I added to his name in the opening words of this post triggers an entire domain of debates, ideals and evil plans to take over the world. Should I have linked to his Facebook page? or Twitter? perhaps MySpace or even Google? all these companies beg us to choose them as our identity providers, so that we will let them be our companions when we visit other websites, thus helping their “social colonization” efforts.

    So in a way, those companies are trying to become the global people namespace. On the web I may be http://facebook.com/ofer.egozi, or http://linkedin.com/in/oferegozi etc., and as Dave Morin of facebook tweeted, “/ is the new @ (hence their PR extravaganza on vanity urls). Our identity is associated with the domain on that url, much as our email domain.

    An interesting corrolary I can suggest here is that the “commitment” of that company to your identity is reflected in the extra padding next to that ‘/’. Companies such as twitter and facebook say “profiles are not just another application for us, they ARE our application”, whereas others such as linkedin and google still interject a /in/ or /profiles/ in between, just in case something else becomes more important…

    So why not use his Facebook then? with social networks being such a relatively new entities, we seem to forget the temporariness of a business organization. We also tend to forget that those network accounts are only as free as beer, and the organizations behind them can arbitrarily delete a user or change their policies any time, and your anchor on the web which you built over the years is suddenly at stake.Personal Anchor on the Web for Digital Identity

    My own identity is this blog. I own the domain, I maintain an OpenID on it using WordPress.com, and I can always decide to modify that identity, take it elsewhere or remove it altogether. The control over that identity, how it’s portrayed and used remains with me, even if many other aspects (think social graph) are still locked up elsewhere. That’s a start.

    Semantic Search using Wikipedia

    Today I gave my master’s thesis talk in the Technion as part of my master’s duties. Actually, the non-buzzword title is “Concept-Based Information Retrieval using Explicit Semantic Analysis”, but that’s not a very click-worthy post title :-)

    The whole thing went far better than I expected – the room was packed, the slides flew smoothly (and quickly too, luckily Shaul convinced me to add some spare slides just in case), and I ended up with over 10 minutes of Q&A (an excellent sign for a talk that went well…)

    Click to view on Slideshare

    BTW – anybody has an idea how to embed slideshare into a hosted blog? doesn’t seem to work…

    Apparently, Amazon invented the WWW

    Tim Berners Lee – behind you!

    Amazon has been selling stuff online since as far back as 1973, at least if you believe this:

    amazon1973

    In fact, Google lists over 51,000 pages with this date on Amazon. And mind you – it is the exact date September 4 1973, not a day less, nor a day more.

    Of course, some geeks may claim this has to do with some Amazon programmers’ default value, but the POSIX time for this date is just a boring 115945200, not some fun number like 1234567890. I prefer to attribute this to some evil Bezosish conspiracy theory, now I just need to figure out what it was.

    Suggestions?