The Alter Egozi

Facebook account is down. Is the Internet down?

February 6, 2010 · Leave a Comment

My Facebook account was “temporarily unavailable due to site maintenance” today.

Seems like I’m far from the first person this happened to. It’s common enough to make it into Facebook’s FAQ.

So – no big deal, right? just had to wait a while with uploading photos from today’s trip with the kids, a little annoyance, nothing more. Then I wanted to check in the meantime what’s up on another site. Guess what I used as a login there? yep, Facebook connect. No login for you!

Facebook may be getting away with it for now, as it seems like these “maintenance” downtimes didn’t create negative buzz for Facebook connect’s position as the identity of choice for many avid FB users. But watch as more of these incidents start raising awareness to the implication of relying on Facebook as an identity provider. We’ll then realize it’s another point of failure on our way to our favorite sites, and one with no simple workaround.

Truth be told, this is not a Facebook issue, it’s an issue for centralized identity providers. If WordPress.com were down, my OpenID identity would be down just the same. With unified identity comes a unified point of failure…

→ Leave a CommentCategories: Uncategorized
Tagged: , , ,

Yahoo Gives Up on Social Search

January 9, 2010 · 1 Comment

In an interview that strangely made headlines only in Indian tech blogs, Yahoo Research Labs’ Chief Prabhakar Raghavan declared that Yahoo will not replace its search with Bing. OK, the Yahoo-Microsoft deal is not really off, but the deal details turn out to imply that Yahoo will only use Microsoft search technology as the backend, and keep building its own smart front-end to it that will make use of Yahoo’s content assets. Raghavan says:

“Yahoo will not use Bing. Bing is a branded search engine that Microsoft is building on top of its search back-end and we will build our own search front-end on that same Microsoft back-end. It (using Bing) is not the case, at least as envisioned at the moment”

This actually makes perfect sense. Stop spending tons of resources on crawling and ranking in a futile war with Google, and focus on building the user experience over it, leveraging Yahoo’s advantage – content. Raghavan mentions scenarios that sound a lot like Yahoo shortcuts (that’s really old news) as one example of how to deliver a more complete experience over commodity search results.

The article then goes on to discuss the second focus for Yahoo, social applications, and mentions Microsoft’s tie-up with Facebook for access to social graph. Raghavan is quoted as saying:

“Social networks are not just a place to hang out, but to get things done. It predates the web.. I’m not sure where the sweet spot is, we’re still doing research on it”

Also makes perfect sense. With Google as a common enemy, and Microsoft a Facebook partner, Yahoo may be better positioned to deliver social applications that leverage the de-facto standard of Facebook graph, rather than push its own failed networks.

So why is my post title suggesting what it’s suggesting??


There is one catch in sub-contracting your search results: you are now limited with what you can do in search ranking. The best you can do is re-rank the set of results Microsoft’s technology supplied you with before presenting it to the user. As I’ve pointed out in the past when talking about Delver’s technology, social (graph-based) search is a game that cannot be played by reranking, since it’s a classic long tail problem. So when you can’t interfere with how search results are ranked, you also can’t deliver true social search, as Google recently did. One less social application Yahoo can build…

→ 1 CommentCategories: Uncategorized
Tagged: , ,

Searching for Faceted Search

December 26, 2009 · 2 Comments

Just finished reading Daniel Tunkelang’s recently published book on Faceted Search. I read Daniel’s blog (“The Noisy Channel“) regularly, and enjoy his good mix of IR practice with emphasis on Human-Computer Interaction (HCI). With faceted search tasks on the roadmap at work, I wanted to better educate myself on the topic, and this one looked like a good read, with the cover promising:

“… a self-contained treatment of the topic, with an extensive bibliography for those who would like to pursue particular aspects in more depth”

With 70 pages, the book reads quickly and smoothly. Daniel provides a fascinating intro to faceted search, from early taxonomies, to facets, to faceted navigation and on to faceted search. He adds an introductory chapter on IR, which is a worthwhile read even for IR professionals with some interesting insights. One is how ranked retrieval that we all grew so accustomed of, blurred the once clear border of relevant vs. non-relevant that set retrieval enforced. Daniel suggests that this issue is significant for faceted search, being a set-retrieval oriented task, and a pingback on his blog led me to a fascinating elaboration on this pain in another fine search blog (recommended read!).

With such elaborate introductory chapters and more on faceted search history, not much is left though for the actual chapters on research and practice, and as a reader I felt there could be a lot more there. But then, it is reasonable to leave a lot to the reader and just give a taste of the challenges, to be later explored by the curious reader from the bibliography.

However, that promise for extensive bibliography somewhat disappointed me… With 119 references, and only about a quarter being academic publications from the past 5 years, I felt a bit back to square one. I was hoping for more of a literature survey and pointers when discussing the techniques for those tough issues, such as how to choose the most informational facets for a given query or how to extract facets from unstructured fields. Daniel provide some useful tips on those, but reading more on these topics will require doing my own literature scan.

In any case, for a newcomer with little background in search in general and faceted in particular, this book is an excellent introduction. Those more versed with classic IR moving into faceted search, will find the book an interesting read but probably not sufficient as a full reference.

→ 2 CommentsCategories: Uncategorized
Tagged: ,

The (Filtered) Web is Your Feed

December 12, 2009 · 1 Comment

A few months ago I was complaining here about my rss overload. A commenter suggested that I take a look at my6sense, a browser extension (now also iPhone app) that acts as a smart RSS reader, emphasizing the entries you should be reading. I wanted to give my6sense a go then, but the technical experience was lousy, and moreover – I was expected to migrate my rss reading to it. Too much of a hassle, I gave it up.

In the past few weeks I’ve been test-driving a new player – Genieo, which takes the basic my6sense idea a few steps further. Genieo installs an actual application, not just extension, that plugs into your browser. It tracks your rss feeds automatically, simply by looking for rss feeds in the pages you’re browsing, and learns your feeds without any setup work.

Genieo then goes further to discover feeds on pages you visit even if you’re not subscribed to them, turning your entire browsing history into one big rss feed.  It finally filters this massive pool of content using a semantic profile it builds for your interests, based on analyzing the text you’ve read so far.

For IR people this may sound a lot like Watson, Jay Budzik’s academic project turned contextual search turned an advertising technology acquisition. Watson approached this problem as a search problem: how would I formulate search queries that would run in the background, fetching me the most relevant documents that match the user’s current context? problem is, users are not constantly searching, and would get quickly annoyed by showing general search results when not asked for.

The good thing about an rss feed is that it explicitly says “this is a list of content items to be consumed from this source“, and its temporal nature provides a natural preference ranking (prefer recent items), so a heuristic of “users would be interested in recent and relevant items from feeds in pages they visited” works around the general search difficulty pretty well. Genieo circumvents the expected privacy outcry by running the entire logic on the client side, nothing of the analyzed data leaves your PC (privacy warriors would probably run sniffers to validate that).

In my personal experience, the quality of most results is excellent, and they are almost always posts that would interest me. Genieo quickly picked up my feed subscriptions from clicks I made in my reader to the full article in a browser window (from which it extracted the rss feed), and after a while I could see it gradually picking up on my favorite memes (search, social and others). I did not give up my rss reader for Genieo yet, and I also still have many little annoyances with it, but overall for an initial version, it works surprisingly well.

However, the target audience that is even more suited for Genieo is the not rss-savvy users like me, but the masses out there who don’t know and don’t care about reading feeds. They just want interesting news, and they don’t mind missing on the full list (a-la Dave Winer’s “River of News” concept). Such users will find tools like Genieo as useful as a personal news valet can be.

→ 1 CommentCategories: Uncategorized
Tagged: , , , ,

What is Facebook’s Endgame with Open Graph API?

October 31, 2009 · Leave a Comment

On Thursday, Facebook outlined some of its platform roadmap plans for developers. One of the items on the long list was called the “Open Graph API“, and with such a name it was sure to raise some interest.

Details were scarce, but the general message coming out of Facebook is that the Open Graph API will allow any site to embed a Facebook page in it, allowing the site owner to set status messages, share links etc., without visiting Facebook itself, and more importantly without sending its visitors to Facebook.

That sounded like a feature aimed primarily at brands, or as Ethan Beard of Facebook presented it: “This will be good for brands like Coke.” Makes perfect sense, as these brands are already using Facebook as part of their social media efforts, but would prefer to have it done on their site rather than on Facebook itself.

Thinking deeper into where Facebook is heading, though, I would think there is a more major endgame to all this. We already know that Facebook wants us to consider it as our online identity. So it allows you to reuse that Facebook identity on other websites and sign in using Facebook Connect. That’s one side of the coin. And then the other side of it is, you have your own website or blog where you may publish thoughts, links and photos that you didn’t publish on Facebook. Facebook would clearly want to bridge that gap as well.

belongs-to-us

Half a year ago, Facebook adopted the emerging Activity Streams standard for publishing and consuming an individual’s lifestream events to lifestreaming frameworks, a standard promoted by open standards evangelist Chris Messina. So that fits in nicely into the puzzle now: wouldn’t it be nicer if you could publish all this non-Facebook activity into your Facebook’s page, which will now be embedded into your personal website, courtesy of the Open Graph API?

The API then is just the funnel through which your activity stream is published back into Facebook. You get to leverage the social graph you already defined and came to like on Facebook, and Facebook gets tighter integration with your life outside of Facebook, if you still had any. Smart move for Facebook.

→ Leave a CommentCategories: Uncategorized
Tagged: , ,

Google Nails Down Social Search

October 27, 2009 · 2 Comments

Google’s Social Search is doing the walk, all the rest are just doing the talk. As soon as I activated the Social Search experiment, my next search yielded a social result. No setting up, showing how I am connected to that result (including friends of friends), showing as part of the standard web results…

google-social-searchContrast this with Microsoft’s poor attempt at “social search” by indexing tweets and status messages and showing them regardless of the actual searcher (example search, you’ve got to be on “United States” locale on bing to see it).

Then also contrast it with Facebook’s announcement back in August of its implementation of searching within friends’ posts - a less grandiose announcement that yet delivered far more social experience than Bing’s. Nevertheless, it’s a very limited experience and far from being a true information source for any serious search need.

So how does Google overcome the main obstaclecollecting your connections?

Google relies on its own sources and on open sources it can obtain by crawling the social graph. That is the true reason why Facebook is not part of Google’s graph (no XFN/FOAF marking on Facebook’s public pages). Google may be counting on Facebook’s inevitable opening up, and with Gmail’s rising popularity it becomes a reasonable alternative even for Facebook users like me.

Sadly, all this great news gave zero credit to Delver, where it all happened first

→ 2 CommentsCategories: Uncategorized
Tagged: , , , , ,

To Tweet or Not to Tweet (hint: that’s not the question)

September 20, 2009 · Leave a Comment

I was catching up on my RSS overload the other day, when this side note in a post by Naaman on Social Media Multitaskers caught my attention:

“I find that I now blog thoughts that are too long to fit in a tweet; so feel free to follow my tweets…”

"I am the man. I suffered. I was there." CC by 'Kalense Kid'/Flickr

I’m not too much of a media multitasker myself, so I don’t experience this duality first hand, but I can imagine it: you get an interesting thought or experience, then you think is this major enough to develop into a blog post, for which I’ll go over here, or is it not that heavy / can’t be bothered, in which case I flutter my wings over there. Actually I do experience these, just that in the other case I simply drop it (and excuse me for not considering Facebook status updates an option, that’s stuff for another post…)

This should not have been a dilemma at all, had blogging platforms evolved to accommodate microblogging, which today is somehow seen as the centralized domain of a single commercial company. You really should be able to hop on your publishing platform, write that thought down, regardless of length, and fire it out. No need to figure out which channel to use, and whether the intended readers are indeed following you there. Similarly, your friends/readers should not have to register to your feeds on different platforms but rather consume one only, and rely on a powerful set of rules to filter your stream as they find fit.

posterous-mediumPosterous is a great (and fast growing) example of how easy it can be from the blogger’s perspective. Just post it (or rather, email it) and it will get published as needed (e.g. shortened for twitter). But it does not make it any easier on the consumer, who still needs to decide where to best follow this blogger (does he perhaps write additional blog posts directly on his blog that won’t show up on his twitter? or vice versa?’) and reduces the basic filtering capability that may have existed when different post types were distributed into the different services.

No need to reinvent the wheel here, blogging platforms are abundant, decentralized and perfectly fit to remain our publishing hub, with their developed CMS and the loose but well-defined social networks. What blogging platforms should do – heck, what Automattic should do to evolve, is:

  1. Conversation - support the realtime conversational nature of short posts, with the right UI and notifications mechanisms. The “P2“ microblogging-optimized theme released almost two years ago was a good start, sadly it still followed the line of thought of “blog or microblog, not both”. To move forward, Automattic need to realize that Twitter is not a personality, it’s a state of mind, hence also P2 can’t be a permanent theme, it should be a contextual theme.
  2. Publishingacquire Posterous. As simple as that. These guys got their fame by understanding the pains of publishing anytime anywhere, they know a thing or two on usability and persuasion, and they have great buzz. The latter is not luxury – a buzzed-up acquisition makes it very clear that this is a major strategy for you, a lot more than if you’d develop the same changes yourself.
  3. Consuming – that’s the tricky part… how do you embed Twitter and WordPress into the same stream, when each consumer has their own desired blend of it. We don’t want to invent a new technology, RSS is here to stay. We do want better ways of filtering our floods using better tagging coupled with more clever feed options. How exactly – I do hope there’s an entire team at Automattic working exactly on that…

    → Leave a CommentCategories: Uncategorized
    Tagged: , , , ,

    The Broken Web

    August 18, 2009 · 2 Comments

    Dave Winer recently pointed out two trends that pose risk to user-created content on the web:

    • Over-reliance on url-shorteners. Fueled by twitter’s laconic style, more and more links to content are created using an indirection via url shortener services such as bit.ly and tr.im. The collapse of such a service may turn tons of links into broken links in an instant.
    • Centralized conversation platforms. Shifting the conversation away from their blogs, influencing content publishers chose to center on platforms such as twitter and FriendFeed. Besides the increased noise inherent to lifestreaming, there is increased risk in making your contributions (and having your readers contribute back) in a site run by a private company with no real commitment to its users.

    In the past two weeks both these risks materialized to some extent. The url-shortener service tr.im shut down, and that 404-iceberg was avoided in the last minute by the owners’ decision to open-source it. Then Facebook acquired FriendFeed, and their PR said

    “…FriendFeed.com will continue to operate normally for the time being as the teams determine the longer term plans for the product.”

    Hmm, right… So Scoble’s blog still loves him, and is probably a safer publishing venue.

    But why is this such a big deal anyway?Broken web of intrigue, CC by 'Looking for a Lighthouse'/Flickr

    We tend to forget how much we have invested into such services until they break down (as was the case with ma.gnolia). The web’s strength is in storing and being able to search in the content produced by millions of earthlings. The impact of frailness of large amounts of content or links is significant. Especially for social search, that content could be vital (OK, perhaps except for that part about what you had for breakfast).

    As always with such issues, the best solution is decentralization. For url shorteners, the ’shortlink’ protocol was already suggested for site-maintained shorteners, and WordPress has already implemented it. My blog is already enabled, try http://wp.me/plBAi-8Q.  And then content decentralization is in our hands. Think about it the next time you post your thoughts into twitter rather than in your blog…

    → 2 CommentsCategories: Uncategorized
    Tagged: , ,

    Friendly advice from your “Social Trust Graph”

    July 28, 2009 · Leave a Comment

    While scanning for worthy Information Retrieval papers in the recent SIGIR 2009, I came across a paper titled “Learning to Recommend with Social Trust Ensemble“, by a team from the University of Hong Kong. This one is about recommender systems, but putting the social element into text analytics tasks is always interesting (me).

    The premise is an interesting one – using your network of trust to improve classic (Collaborative Filtering) recommendations. The authors begin by observing that users’ decisions are the balance between their own tastes, and those of their trusted friends’ recommendations.

    Figure 1 from "Learning to Recommend with Social Trust Ensemble" by Ma et al.

    Then, they proceed to propose a model that blends analysis of classic user-item matrix where ratings of items by users are stored (the common tool of CF), with analysis of a “social trust graph” that links the user to other users, and through them to their opinions on the items.

    This follows the intuition that when trying to draw a recommendation from behavior of other users (which basically is what CF does), some users’ opinions may be more important than others’, and the fact that classic CF ignores that, and treats all users as having identical importance.

    The authors show results that out-perform classic CF on a dataset extracted from Epinions. That’s encouraging for any researcher interested in the contribution of the social signal into AI tasks.

    free advice at renegade craft fair - CC Flickr/arimoore

    However, some issues bother me with this research:

    1. Didn’t the netflix prize winning team approach (see previous post) “prove” that statistical analysis of the user-item matrix beats any external signal other teams tried to use? the answer here may be related to the sparseness of the Epinions data, which makes life very difficult for classic CF. Movie recommendations have much higher density than retail (Epinions’ domain).
    2. To evaluate, the authors sampled 80% or 90% of the ratings as training and the remaining as testing. But if you choose as training the data before the user started following someone, then test it after the user is following that someone, don’t you get a bit mixed up with cause and effect? I mean, if I follow someone and discover a product through his recommendation, there’s a high chance my opinion will also be influenced by his. So there’s no true independence between the training and test data…
    3. Eventually, the paper shows that combining two good methods (social trust graph and classic CF) outperforms each of the methods alone. The general idea of fusion or ensemble of methods is pretty much common knowledge for any Machine Learning researcher. The question should be (but it wasn’t) – does this specific ensemble of methods outperform any other ensemble? and does it fare better than the state of the art result for the same dataset?
    own taste and his/her trusted friends’ favors.

    The last point is of specific interest to me, having combined keyword-based retrieval with concept-based retrieval in my M.Sc. work. I could easily show that the resulting system outperformed each of the separate elements, but to counter the above questions, I further tested combining other similarly high performing methods to show performance gained there was much lower, and also showed that the combination could take a state of the art result and further improve on it.

    Nevertheless, the idea of using opinions from people you know and trust (rather than authorities) in ranking recommendations is surely one that will gain more popularity, as social players start pushing ways to monetize the graphs they worked so hard to build…

    → Leave a CommentCategories: Uncategorized
    Tagged: , , , ,

    Bart Simpson working at Google??

    July 12, 2009 · Leave a Comment

    “Phone call for Al…Al Coholic…is there an Al Coholic here?”
    “Wait a minute… Listen, you little yellow-bellied rat jackass, if I ever find out who you are, I’m gonna kill you!”

    Sweet little Bart Simpson must have hacked his way into the training data the guys at Google Scholar are using. I was running a simple Google query for user manuals that Googlebot indexed at sears.com, and got these goodies in the results:

    Google Scholar Bart SimpsonFor the perplexed readers, the image on the right is what the Google Scholar parser saw for the DVD result (click to enlarge), then assumed it’s an academic paper and desperately tried to find an author name. As Google freely admits, “…Automated extraction of information from articles in diverse fields can be tricky”. Yep.

    sony-dvd-manual

    It gets even better: since there are many such “academic papers” with the same author name, Google clusters them together, even when the manuals are for different products. Try one of those “All xxx versions” links, e.g. this one, all by our good friend O. Instructions. Interested students are encouraged to proceed and find out the etymology of other fascinating author names such as R. Parts and NO. Model.

    And what about our old friend Al Coholic, you ask? well, Google Scholar tells us he did actually publish something! but wait – 1877? Annals of the New York Academy of Sciences? young Simpson, have you no shame boy!?

    → Leave a CommentCategories: Uncategorized
    Tagged: , ,