Tag Archives: Research

Microsoft Israel ReCon 2015 (or: got to start blogging more often…)

Yes, two consecutive posts on the same annual event are not a good sign to my virtual activity level… point taken.

MSILSo 2 weeks ago, Microsoft Israel held its second ReCon conference on Recommendations and Personalization, turning its fine 2014 start into a tradition worth waiting for. This time it was more condensed than last year (good move!) and just as interesting. So here are three highlights I found worth reporting about:

Uri Barash of the hosting team gave the first keynote on Cortana integration in Windows 10, talking about the challenges and principles used. Microsoft places a high empasis on the user’s trust, hence Cortana does not use any interests that are not explicitly written in Cortana’s notebook, validated by the user. If indeed correct, that’s somewhat surprising, as it limits the recommendation quality and moreover – the discovery experience for the user, picking up potential interests from the user’s activity. I’d still presume that all these implicit interests are probably used behind the scenes, to optimize the content from explicit interests.

ibm_logoIBM Haifa Research Labs have been doing work for some years now on enterprise social networks, and mining connections and knowledge from such networks. In ReCon this year, Roy Levin presented a paper to be published in SIGIR’15, titled “Islands in the Stream: A Study of Item Recommendation within an Enterprise Social Stream“. In the paper, they discuss a feature for a personalized newsfeed included in IBM’s enterprise social network “IBM Connections”, and provide some background and the personalized ranking logic for the feed items.

They then move on to describe a survey they have made among users of the product, to analyze their opinions on specific items recommended for them in their newsfeed, similar to Facebook’s newsfeed surveys. Through these surveys, the IBM researchers attempted to identify correlations between various feed item factors, such as post and author popularity, post personalization score, how surprising an item may be to a user and how likely a user is to want such serevdipity, etc. The actual findings are in the paper, but what may actually be even more interesting is the deep dissection in the paper of the internal workings of the ranking model.

Outbrain-logoAnother interesting talk was by Roy Sasson, Chief Data Scientist at Outbrain. Roy delivered a fascinating talk about learning from lack of signals. He began with an outline of general measurement pitfalls, demonstrating them on Outbrain widgets when analyzing low numbers of of clicks on recommended items. Was the widget visible to the user? where was it positioned in the page (areas of blindness)? what items were next to the analyzed item? were they clicked? and so on.

Roy then proceeded to talk about what we may actually be able to learn from lack of sharing to social networks. We all know that content that gets shared a lot on social networks is considered viral, driving a lot of discussion and engagement. But what about content that gets practically no sharing at all? and more precisely, what kind of content gets a lot of views, but no sharing? Well, if you hadn’t guessed already, that will likely be content users are very interested to see, but would not admit to it, namely provocative and adult material. So in a way, leveraging this reverse correlation helped Outbrain automatically identify porn and other sensitive material. This was then not used to filter all of this content out – after all, users do want to view it… but it was used to make sure that the recommendation strip includes only 1-2 such items so they don’t take over the widget, making it seem like this is all Outbrain has to offer. Smart use of data indeed.

Microsoft Israel ReCon 2014

Microsoft Israel R&D Center held their first Recommendations Technology conference today, ReCon. With an interesting agenda and a location that’s just across the street from my office, I could not skip this one… here are some impressions from talks I found worth mentioning.

The first keynote speaker was Joseph Sirosh, who leads the Cloud Machine Learning team at Microsoft, recently joining from Amazon. Sirosh may have aimed low, not knowing what his audience will be like, but as a keynote this was quite a disappointing talk, full of simplistic statements and buzzwords. I guess he lost me when he stated quite decisively that the big difference about putting your service on the cloud is that it means it will get better the more people use it. Yeah.

Still, there were also some interesting observations he pointed out, worth mentioning:

  • If you’re running a personalization service, benchmarking against most popular items (i.e. Top sellers for commerce) is the best non-personalized option. Might sound trivial, but when coming from an 8-year Amazon VP, that’s a good validation
  • “You get what you measure”: what you choose to measure is what you’re optimizing, make sure it’s indeed your weakest links and the parts you want to improve
  • Improvement depends on being able to run a large number of experiments, especially when you’re in a good position already (the higher you are, the lower your gains, and the more experiments you’ll need to run to keep gaining)
  • When running these large numbers of experiments, good collaboration and knowledge sharing becomes critical, so different people don’t end up running the same experiments without knowing of each other’s past results

Elad Yom-Tov from Microsoft Research described work his team did on enhancing Collaborative Filtering using browse logs. They experimented with adding user browser logs (visited urls) and search queries to the CF matrix in various ways to help bootstrapping users with little data and to better identify short-term (recent) intent for these users.

An interesting observation they reached was that using the raw search queries as matrix columns worked better than trying to generalize or categorize them, although intuitively one would expect this would reduce the sparsity of such otherwise very long-tail attributes. It seems that the potential gain in reducing sparsity is offset by the loss of specificity and granularity of the original queries.

unique

Another related talk which outlined an interesting way to augment CF was by Haggai Roitman of IBM Research. Haggai suggested the feature of “user uniqueness” –  to what extent the user follows the crowd or deliberately looks for the esoteric choices, as a valuable signal in recommendations. This uniqueness would then determine whether to serve the user with results that are primarily popularity-based (e.g. CF) or personalized (e.g. content-based), or a mix of the two.

The second keynote was by Ronny Lempel of Yahoo! Labs in Haifa. Ronny talked about multi-user devices, in particular smart TVs, and how recommendations should take into account the user that is currently in front of the device (although this information is not readily available). The heuristic his team used was that the audience usually doesn’t change in consecutive programs watched, and so using the last program as context to recommending the next program will help model that unknown audience.

Their results indeed showed a significant improvement in recommendations effectiveness when using this context. Another interesting observation was that using a random item from the history, rather than the last one, actually made the recommendations perform worse than no context at all. That’s an interesting result, as it validates the assumption that approximating the right audience is valuable, and if you make recommendations to the parent watching in the evening based on the children’s watched programs in the afternoon, you are likely to make it worse than no such context at all.

Cortana

The final presentation was by Microsoft’s Hadas Bitran, who presented and demonstrated Windows Phone’s Cortana. Microsoft go out of their way to describe Cortana as friendly and non-creepy, and yet the introductory video from Microsoft Hadas presented somehow managed to include a scary robot (from Halo, I presume), dramatic music, and Cortana saying “Now learning about you”. Yep, not creepy at all.

Hadas did present Cortana’s context-keeping session, which looks pretty cool as questions she asked related to previous questions and answers, were followed through nicely by Cortana (all in a controlled demo, of course). Interestingly, this even seemed to work too well, as after getting Cortana’s list of suggested restaurants Hadas asked Cortana to schedule a spec review, and Cortana insisted again and again to book a table at the restaurant instead… nevertheless, I can say the demo actually made the option of buying a Windows Phone pass through my mind, so it does do the job.

All in all, it was an interesting and well-organized conference, with a good mix of academia and industry, a good match to IBM’s workshops. Let’s have many more of these!

Mining Wikipedia, or: How I Learned to Stop Worrying and Love Statistical Algorithms

I took my first AI course during my first degree, in the early 90’s. Back then it was all about expert systems, genetic algorithms, search/planning (A* anyone?). It all seemed clear, smart, intuitive, intelligent…

Then, by the time I got to my second degree in the late 00’s, the AI world has changed by a lot. Statistical approaches took over by a storm, and massive amounts of data seemed to trump intuition and smart heuristics anytime.

It took me a while to adjust, I admit, but by the time I completed my thesis I came to appreciate the power of big data. I now can better see this as an evolution, with heuristics and inutions driving how we choose to analyze and process the data, even if afterwards it’s all “just” number-crunching.

So on this note, I gave a talk today at work on the topic of extracting semantic knowledge from Wikipedia, relating also to our work on ESA and to this being an illustration of the above. Enjoy!

 

Farewell Academia

My Master’s thesis (presented here) was finally published in the April issue of TOIS. Good time to recap my second academic adventure.

Six years ago, when I considered graduate studies (10 years after graduating my B.Sc) I was CTO in a company that was at a crossroads, leading to very short term product and technology thinking. Looking for a change, I felt the academic world offered a space where deep, broad thinking was preferred over nearsighted goals. So I reduced my position at work, and took on studies back in the Technion.

I finished the needed courses in a year and a half, but the thesis took much longer. Friends warned me it’s difficult to context switch between work and research, not to mention family, and they were indeed right. Still, I wanted to feel the academic life again, and figure out if I wanted to pursue it full time and continue to a PhD.

The conclusion gradually distilled into a resounding No. I’ll stop at Master’s. One reason was my allergic reactions to too much maths, so prevalent in the Technion, but there was also something deeper. I realized that the user experience is where I’m at, and core computer science research is far from it, except perhaps HCI departments.

There is a significant gap between the cutting edge in academy and in practice. A paper may be worth publishing due to a statistically significant increment of 5% in relevancy (see the major interest around the Netflix prize), whereas actual users will barely feel the difference. On the other hand, stuff that is considered “commodity” in the academic world, can make big waves if implemented well in the industry, and for a good reason. Companies have built a major user following (and a fortune…) just by doing excellent and usable implementation of basic CS algorithms.

So if I have to choose between making a the research community happy, or making end users happy, I definitely choose the latter. Perhaps I’ll go back to do my PhD in another 10 years, but until then, it’s Farewell Academia!

Evaluating algorithms’ quality

As part of a “creativity dojo” we’ve had at work, I finally got to implement something I’ve long felt was needed in our QA – a framework for evaluating algorithms’ quality.

Living on the seam between algorithm development and product management in the past few years, I’ve come to appreciate the need to be able to evaluate not just that it works, but that it works well. A search engine may return results that contain the keywords, but are these the most relevant ones? a recommendation algorithm may return products that are related to the user in some way, but can they be considered “good” recommendations?

During my master’s studies I came to know the work done over at TREC, and was fascinated by the strong emphasis on what we developers often skim over – evaluating results’ quality statistically, and moreover analyzing the evaluation method itself, to ensure that it is sound. So with that approach in mind, I teamed with our talented QA team to create a working framework in 2 days. Here are some lessons and tips learned along the way, that could be useful for others trying to achieve a similar feat:

  1. Create a generic tool. TREC is mostly about search; however, with some imagination, most AI algorithms can be reduced to similar building blocks. Search, recommendation, classification – all could eventually be reduced to taking an input and returning a ranked list of results, on which the same quality metric can be applied. Code-wise, we used a generic scoring class, with a wrapping interface that has different implementations for different algos to provide the varying context.
  2. Use large data. This may sound trivial in the academic world, but when you’re in a QA state of mind, you sometimes tend to get used to creating small worlds that are easy to control. Not here. It’s very important to simulate real-life user scenarios by using data that’s similar to production, so we used out integration environment, which replicates from production data.
  3. Facilitate judging. Obtaining relevance judgments is crucial to getting useful tests. The customer here is a business owner / product manager, who may not appreciate the tedious task of rating results. We created a browser plugin that allows rating from within the actual results page, and accumulates those ratings in a per-test relevance file.
  4. Measure test staleness. The downside of using non-controlled data is that it moves the carpet from under your feet. Data may change over time and your test may become less relevant. We used Buckley’s Binary Preference (bPref) measure that functions well with incomplete judgments, and also introduced a weighted measure of how many unjudged results are found, to trigger a test failure when results become too unreliable (requiring another judging round).

Vanity Press or Monopoly Busters?

A few months ago, I got an email from “iConcept Press” inviting me to write a book chapter in their IR journal based on my AAAI paper. I ignored it, like I ignored another email in a similar vein from another “publishing house”, and found at least one blogger who was just as suspicious at this seemingly mass solicitation.

You see, in the academy we are conditioned to believe that the lower chances of acceptance, the better the venue for publishing, so if you’re willing to accept me to your club right from the start – huh, forget it!

A couple of weeks ago I got another mail from them. This time, the happy bunch invited me to be a reviewer on one of their books. Now, that was really amusing – if not a writer, then I’d be a reviewer? pathetic, I thought. But is the picture really this simple?

It was interesting, first, to see that they do actually use a peer-review system, even if perhaps not a super-duper double-blind system. And then I started wondering, is that conditioning for favoring low-acceptance publications really still relevant in the self-publishing era?

I remember when I published my first paper on AAAI, I was quite outraged at the idea that you have to pay, then to give away all copyrights, and then be used as a money bait for readers, as the publication meant I could not give free access to my own readers, unless I pay again. In a time when publishing your words on the web is such a common privilege, that seems plain wrong.

Back in the times when publishing was a costly process, high selection rate guaranteed that subscribers won’t waste their money sponsoring the print of low-quality papers. Furthermore, anything not printed had a very low chance of getting read by other researchers, not to mention cited, and so readers relied on editors to indeed include only the best. Nowadays, papers are read mostly online, and if your paper is accessible to search engines, that suffices – whoever finds your research useful will read and cite it. This Wikipedia entry has the whole story in a nutshell.

So as for myself – I still did not publish or review in iConcept press, but I am now less dismissive of this somewhat disruptive industry; not because it will win over the established venues, but because it will accelerate the move towards decentralized and online publishing, better fit for our era.

Web(MD) 2.0

Just when I thought that the uses for recommendation systems were already exhausted…

CureTogether is a site that lets you enter your medical conditions (strictly anonymous, only aggregated data are public), and get recommended for… other “co-morbid” conditions you may have. In other words, “people who have your disease usually also have that one too, perhaps you have it too?

Beyond the obvious jokes, this truly has potential. You don’t only get “recommended” for conditions, but rather also for treatments and causes. We all know that sometimes we have our own personal treatment that works only for us. What if it works for people in our profile, and sharing that profile, anonymously, will help similar people as well? so far this direction is not explicit enough in how the site works, possibly for lack of sufficient data, but you can infer it as you go through the questionnaires.

The data mining aspect of having a resource such as CureTogether’s database is naturally extremely valuable. CureTogether’s founders share some of their findings on their blog. The power of applying computer science analytics and experimentation methodologies – sharpened by web-derived needs – to social sciences and others, reminded me of Ben Schneiderman’s talk on “Science 2.0. The idea that computer science can contribute methodologies that stretch beyond the confines of computing machines is a mind-boggling one, at least for me.

But would you trust collaborative filtering with your health? it’s no wonder that the main popular conditions on the site are far from life threatening, and the popular ones are such with unclear causes and treatments, such as migraines, back pains and allergies. Still, the benefit on these alone will probably be sufficient for most users to justify signing up.

Searching for Faceted Search

Just finished reading Daniel Tunkelang’s recently published book on Faceted Search. I read Daniel’s blog (“The Noisy Channel“) regularly, and enjoy his good mix of IR practice with emphasis on Human-Computer Interaction (HCI). With faceted search tasks on the roadmap at work, I wanted to better educate myself on the topic, and this one looked like a good read, with the cover promising:

“… a self-contained treatment of the topic, with an extensive bibliography for those who would like to pursue particular aspects in more depth”

With 70 pages, the book reads quickly and smoothly. Daniel provides a fascinating intro to faceted search, from early taxonomies, to facets, to faceted navigation and on to faceted search. He adds an introductory chapter on IR, which is a worthwhile read even for IR professionals with some interesting insights. One is how ranked retrieval that we all grew so accustomed of, blurred the once clear border of relevant vs. non-relevant that set retrieval enforced. Daniel suggests that this issue is significant for faceted search, being a set-retrieval oriented task, and a pingback on his blog led me to a fascinating elaboration on this pain in another fine search blog (recommended read!).

With such elaborate introductory chapters and more on faceted search history, not much is left though for the actual chapters on research and practice, and as a reader I felt there could be a lot more there. But then, it is reasonable to leave a lot to the reader and just give a taste of the challenges, to be later explored by the curious reader from the bibliography.

However, that promise for extensive bibliography somewhat disappointed me… With 119 references, and only about a quarter being academic publications from the past 5 years, I felt a bit back to square one. I was hoping for more of a literature survey and pointers when discussing the techniques for those tough issues, such as how to choose the most informational facets for a given query or how to extract facets from unstructured fields. Daniel provide some useful tips on those, but reading more on these topics will require doing my own literature scan.

In any case, for a newcomer with little background in search in general and faceted in particular, this book is an excellent introduction. Those more versed with classic IR moving into faceted search, will find the book an interesting read but probably not sufficient as a full reference.

Friendly advice from your “Social Trust Graph”

While scanning for worthy Information Retrieval papers in the recent SIGIR 2009, I came across a paper titled “Learning to Recommend with Social Trust Ensemble“, by a team from the University of Hong Kong. This one is about recommender systems, but putting the social element into text analytics tasks is always interesting (me).

The premise is an interesting one – using your network of trust to improve classic (Collaborative Filtering) recommendations. The authors begin by observing that users’ decisions are the balance between their own tastes, and those of their trusted friends’ recommendations.

Figure 1 from "Learning to Recommend with Social Trust Ensemble" by Ma et al.

Then, they proceed to propose a model that blends analysis of classic user-item matrix where ratings of items by users are stored (the common tool of CF), with analysis of a “social trust graph” that links the user to other users, and through them to their opinions on the items.

This follows the intuition that when trying to draw a recommendation from behavior of other users (which basically is what CF does), some users’ opinions may be more important than others’, and the fact that classic CF ignores that, and treats all users as having identical importance.

The authors show results that out-perform classic CF on a dataset extracted from Epinions. That’s encouraging for any researcher interested in the contribution of the social signal into AI tasks.

free advice at renegade craft fair - CC Flickr/arimoore

However, some issues bother me with this research:

  1. Didn’t the netflix prize winning team approach (see previous post) “prove” that statistical analysis of the user-item matrix beats any external signal other teams tried to use? the answer here may be related to the sparseness of the Epinions data, which makes life very difficult for classic CF. Movie recommendations have much higher density than retail (Epinions’ domain).
  2. To evaluate, the authors sampled 80% or 90% of the ratings as training and the remaining as testing. But if you choose as training the data before the user started following someone, then test it after the user is following that someone, don’t you get a bit mixed up with cause and effect? I mean, if I follow someone and discover a product through his recommendation, there’s a high chance my opinion will also be influenced by his. So there’s no true independence between the training and test data…
  3. Eventually, the paper shows that combining two good methods (social trust graph and classic CF) outperforms each of the methods alone. The general idea of fusion or ensemble of methods is pretty much common knowledge for any Machine Learning researcher. The question should be (but it wasn’t) – does this specific ensemble of methods outperform any other ensemble? and does it fare better than the state of the art result for the same dataset?
own taste and his/her trusted friends’ favors.

The last point is of specific interest to me, having combined keyword-based retrieval with concept-based retrieval in my M.Sc. work. I could easily show that the resulting system outperformed each of the separate elements, but to counter the above questions, I further tested combining other similarly high performing methods to show performance gained there was much lower, and also showed that the combination could take a state of the art result and further improve on it.

Nevertheless, the idea of using opinions from people you know and trust (rather than authorities) in ranking recommendations is surely one that will gain more popularity, as social players start pushing ways to monetize the graphs they worked so hard to build…

Semantic Search using Wikipedia

Today I gave my master’s thesis talk in the Technion as part of my master’s duties. Actually, the non-buzzword title is “Concept-Based Information Retrieval using Explicit Semantic Analysis”, but that’s not a very click-worthy post title :-)…

The whole thing went far better than I expected – the room was packed, the slides flew smoothly (and quickly too, luckily Shaul convinced me to add some spare slides just in case), and I ended up with over 10 minutes of Q&A (an excellent sign for a talk that went well…)

Click to view on Slideshare

BTW – anybody has an idea how to embed slideshare into a hosted blog? doesn’t seem to work…