Mining Wikipedia, or: How I Learned to Stop Worrying and Love Statistical Algorithms

I took my first AI course during my first degree, in the early 90′s. Back then it was all about expert systems, genetic algorithms, search/planning (A* anyone?). It all seemed clear, smart, intuitive, intelligent…

Then, by the time I got to my second degree in the late 00′s, the AI world has changed by a lot. Statistical approaches took over by a storm, and massive amounts of data seemed to trump intuition and smart heuristics anytime.

It took me a while to adjust, I admit, but by the time I completed my thesis I came to appreciate the power of big data. I now can better see this as an evolution, with heuristics and inutions driving how we choose to analyze and process the data, even if afterwards it’s all “just” number-crunching.

So on this note, I gave a talk today at work on the topic of extracting semantic knowledge from Wikipedia, relating also to our work on ESA and to this being an illustration of the above. Enjoy!

 

The secret to Facebook’s growth?

Alteregozi.com has recently also been attacked by the wonderous Facebook profile spam comments (I kept two specimens here and here, but deleted many dozens more in the past weeks). At first, I was amused at this new type of spam comments, but after running a few searches I felt more of disgrace for being so late to the party, seeing mentions of these more than a year ago

So what’s the deal with these comments? they usually don’t include any links, not selling anything, and some are really good comments. If you’d look at the above two you’ll have a very hard time figuring out they are not real comments. Looks like some spammers harvest comments from legit blogs, and then classify your post to find the most similar comment to stick. What is the motivation?

I don’t have the answers myself, but two thoughts:

 

  1. One spam fighting blog claims that the motivation is to establish the credibility of these accounts, so that they can later be used to sell likes on Facebook itself. The plot thickens…
  2. I’ve never seen an account repeating. The amount of fake FB accounts being created is probably huge. How much of Facebook’s recent continued growth is attributed to such fake accounts? nothing you would hear about in Facebook’s earnings calls.

 

fakebook

Amazon, Apple, and Application Platforms

Apple is known for keeping a bustling legal department. Steve Jobs reportedly swore to “destroy Android“, the results of which Samsung has felt very well.

But Apple has more enemies to fight. It holds a complicated relationship with Amazon, who now produces the second most selling tablet after the iPad, claiming it already owns 22% of the US tablet market. That’s a lot of iPads that Apple isn’t selling, and so it readies its own iPad Mini in response.

A less familiar front in this battle is Apple’s “False Advertising” suit against Amazon with regard to the latter’s use of “App Store” for its Android-based application market. Amazon’s response ridiculed this claim, but this does raise the question – what exactly is Amazon’s app store all about?

Amazon’s Kindle store is one strange beast. Kindle apps are in fact re-purposed Android apps, with some added functionality. However, Amazon took care to clearly differentiate the Kindle’s UX and app store from the general Android market. So what is the justification for developing an extra Kindle app?

Every application development platform has its unique core capabilities, which developers can leverage for their own application. Developers get to apply their creative ideas on these assets, while the platform owner enjoys increased engagement for their users, with apps taking these capabilities to places the platform did not even imagine. Facebook’s application platform revolved around the social graph, a unique and very valuable data asset, and Apple provided access to the iPhone’s unique (at the time) features such as its accelerometers and gyroscope, GPS and camera.

Visiting the Amazon Kindle SDK site shows where Amazon feels it has the advantage: 1-click purchasing. This patented Amazon feature (a patent which Apple has actually licensed) can appeal to application developers who feel their application has premium features worth paying for, if only the payment was frictionless. Initial results seemed to validate that, and show excellent revenue per user on Amazon’s platform.

And so, Amazon’s platform says a lot about where Amazon feels its strength lies with the Kindle. Unlike Apple, Amazon builds its success in the tablets market on selling content, much less than selling devices. Hence, expect Kindle to continue beating the iPad on price even when iPad mini launches.

Out of Context

Sponsored Stories are a brilliant advertising model by Facebook. Just like  AdWords in 2000, it’s an example of a model that leverages the core value of the company for advertising, without compromising that value’s authenticity. If your friends liked Starbucks, it was of their own free will and in a public forum, so having Starbucks pay to show this more prominently and to other users can only make sense.

So why is it, then, that a simple amusing case of 55-gallon of lubricant made so many bad headlines for Facebook?

And Facebook has more fronts to fight in its battles for transformation into a revenue-driven company. Timeline may be great for brands, but it’s a magnet for popular revolt. Besides resenting the no-alternative approach Facebook took, why are users so upset about the actual Timeline view, which is surely more visually appealing than the boring wall?

I find the answer to both relates to context.

Out Of Context

For the Sponsored Stories it seems pretty clear. “Yes, I linked to a 55-gallon lubricant product, but I did so as a joke”, well then, Sentiment Analysis still has a long way to go with sarcasm despite some recent advance right here in the Hebrew university. Sarcasm is one extreme example, but that missing context could even just be that you’re no longer fan of that company you liked a month ago, and just didn’t get to unlike yet.

And what about Timeline? isn’t it great that all your previous statuses and photos are there, organized along your timeline and telling your story? well, it is, but only if you care to ensure that it tells the story that you really want to tell. The context of that story may depend on where we were, what we were up to at the time, who our friends were… some of this may not even be possible to reconstruct in the Timeline.

In addition, we are used to our stories dropping off the cliff of the page fold and disappearing into oblivion, so we don’t really care to update them or remove those we don’t feel so proud of anymore. Suddenly, they come back to haunt us with Timeline, and we have to scramble to adjust

And in a final associative thought: the tiled UX of Timeline does remind me of the Pinterest-mania that has taken hold on every new social curation site. So why does this look so so much fun on Pinterest? Context again. Pinterest has none of it, it’s a pure fun/discovery experience, each tile is independent and you’re not really trying to follow up a thread, or cover all that you’ve missed since your last visit. For a social network though, that would be, well, out of context.

Thoughts on Plus

So what’s the deal with Google+? is Google really taking on Facebook? is that a classic “me too” play, or something smarter?

It took me a while to figure out my opinion, but several interesting articles got the stars aligned just right for a split second to make some sense (until some new developments will soon de-align them again :-)).

Take a deep breath. OK, here it comes:

Google+ is Google’s take on Social.

Yes, I know, who would have thought?…
It’s just that Google’s definition of Social is a bit different.

At Facebook (and really, for most of us), Social is about conversations with people you actually know.
At Google, Social is the new alias for Personalization.

It’s pretty simple: Google’s business model has always meant the more I know about you, the better I can monetize through more targeted ads. At first, it was all about the search engine being where you always start your surfing, and Google was well seated. As traffic to social networks grew, culminating with Facebook overtaking Google on March 2010, it became increasingly clear that a larger portion of our information starts being served to us from social networks. Google was left out.

Why was that so important? Google still had tons of searches, an ever-growing email market share, and successful news aggregation and rss reader, among other assets. That’s quite a lot to know about us, isn’t it?

It turned out that the missing link often was the starting point. You would learn about the new thing, the new trend, the new gadget you want to get, while you were out of Google’s reach. By the time you got into the Google web, you may have already got your mind set on what you want to get and even where, making the Google ads a lot less effective.

The Follow versus Friend model is also a huge issue. It means that G+ is about self-publishing and positioning yourself, and not about conversations. That suits Google very well, and is not just a differentiation from FB. This model drives you to follow based on interest, building an interest graph rather than a social graph, and being a lot more useful to profiling you than your social connections.

That interest graph, in turn, makes sure your first encounter with those things that make you tick is inside the Google web. It also links back well to the fine assets that Google holds today, from your docs to your publishing tools. So when Google News announces those funny badges, and you may have thought “Heh, who would want to put these stinking badges on their profiles…” – think again. Their private nature is just fine for Google. It’s a way to ask you to validate your inferred interests: “So tell us, is that interest of yours in US politics that we have inferred from your news reading a real inherent interest, or was it just a transient interest that will melt away after the election?“. Again – big difference for profiling.

Finally, Google+ is positioned to be a professional network. Focusing on interests and having anyone able to follow you, will keep away the teens and lure the self-proclaimed professionals. In that sense, LinkedIn may have more of a reason for concern, at least as the content network it now tries to be. It’s quite likely that G+ does not even aim to unseat Facebook, only to dry it out of its professional appeal, and leave it with what we started with – party/kids photos and keeping track of what those old friends are up to.

I guess I already know what network I’ll be posting a link to this post to…

Farewell Academia

My Master’s thesis (presented here) was finally published in the April issue of TOIS. Good time to recap my second academic adventure.

Six years ago, when I considered graduate studies (10 years after graduating my B.Sc) I was CTO in a company that was at a crossroads, leading to very short term product and technology thinking. Looking for a change, I felt the academic world offered a space where deep, broad thinking was preferred over nearsighted goals. So I reduced my position at work, and took on studies back in the Technion.

I finished the needed courses in a year and a half, but the thesis took much longer. Friends warned me it’s difficult to context switch between work and research, not to mention family, and they were indeed right. Still, I wanted to feel the academic life again, and figure out if I wanted to pursue it full time and continue to a PhD.

The conclusion gradually distilled into a resounding No. I’ll stop at Master’s. One reason was my allergic reactions to too much maths, so prevalent in the Technion, but there was also something deeper. I realized that the user experience is where I’m at, and core computer science research is far from it, except perhaps HCI departments.

There is a significant gap between the cutting edge in academy and in practice. A paper may be worth publishing due to a statistically significant increment of 5% in relevancy (see the major interest around the Netflix prize), whereas actual users will barely feel the difference. On the other hand, stuff that is considered “commodity” in the academic world, can make big waves if implemented well in the industry, and for a good reason. Companies have built a major user following (and a fortune…) just by doing excellent and usable implementation of basic CS algorithms.

So if I have to choose between making a the research community happy, or making end users happy, I definitely choose the latter. Perhaps I’ll go back to do my PhD in another 10 years, but until then, it’s Farewell Academia!

Evaluating algorithms’ quality

As part of a “creativity dojo” we’ve had at work, I finally got to implement something I’ve long felt was needed in our QA – a framework for evaluating algorithms’ quality.

Living on the seam between algorithm development and product management in the past few years, I’ve come to appreciate the need to be able to evaluate not just that it works, but that it works well. A search engine may return results that contain the keywords, but are these the most relevant ones? a recommendation algorithm may return products that are related to the user in some way, but can they be considered “good” recommendations?

During my master’s studies I came to know the work done over at TREC, and was fascinated by the strong emphasis on what we developers often skim over – evaluating results’ quality statistically, and moreover analyzing the evaluation method itself, to ensure that it is sound. So with that approach in mind, I teamed with our talented QA team to create a working framework in 2 days. Here are some lessons and tips learned along the way, that could be useful for others trying to achieve a similar feat:

  1. Create a generic tool. TREC is mostly about search; however, with some imagination, most AI algorithms can be reduced to similar building blocks. Search, recommendation, classification – all could eventually be reduced to taking an input and returning a ranked list of results, on which the same quality metric can be applied. Code-wise, we used a generic scoring class, with a wrapping interface that has different implementations for different algos to provide the varying context.
  2. Use large data. This may sound trivial in the academic world, but when you’re in a QA state of mind, you sometimes tend to get used to creating small worlds that are easy to control. Not here. It’s very important to simulate real-life user scenarios by using data that’s similar to production, so we used out integration environment, which replicates from production data.
  3. Facilitate judging. Obtaining relevance judgments is crucial to getting useful tests. The customer here is a business owner / product manager, who may not appreciate the tedious task of rating results. We created a browser plugin that allows rating from within the actual results page, and accumulates those ratings in a per-test relevance file.
  4. Measure test staleness. The downside of using non-controlled data is that it moves the carpet from under your feet. Data may change over time and your test may become less relevant. We used Buckley’s Binary Preference (bPref) measure that functions well with incomplete judgments, and also introduced a weighted measure of how many unjudged results are found, to trigger a test failure when results become too unreliable (requiring another judging round).