Tag Archives: Data Science

The Data-Product-Scientist-Manager

What’s the difference between Machine Learning, Artificial Intelligence, Deep Learning, and Data Science? The huge buzz around these concepts in recent years makes it seem as if they could be used interchangeably.

Several months ago I gave a meetup talk, going through a history of machine learning algorithms through the prism of learning algorithms for game playing (video, in Hebrew). As the talk did not require prior knowledge, I began with a quick intro to artificial intelligence (AI) and machine learning (ML) –

  • AI is the science of building machines that mimic what we humans perceive as intelligent behavior
  • ML is a branch within AI, concerned with algorithms that learn their “intelligence” from data (rather than explicit coding)
  • Deep learning is a very particular method within ML, which uses artificial neural networks and huge volumes of data

At the time of building the talk, I was struggling with making a reference to data science (DS) as well, to help my audience make sense of all the current hype terms. But is DS a discipline within AI? within ML? is it not simply a fancy name for Statistics?… I ended up leaving it out.

A few days ago, I stumbled upon an episode of the excellent ML podcast Talking Machines, where Neil Lawrence made an observation that put it all into place for me. Lawrence posited that DS arises from the new streams of data we collect in this era, that are generated in huge volumes out of sensors and interactions, and its mission is to extract value from these. In other words – “here, we have all of this data, what can we do with it?”

This may seem like a petty technicality, but it makes all the difference. In classic experimentation, scientists would make a hypothesis, collect data to validate or invalidate it, and then run the needed statistical analysis. With DS, there is no such prior hypotheses. So the core of DS becomes coming up with these hypotheses, and then validating them right there and then, in the data we already have. But whose job is it to come up with hypotheses?

There are several relevant roles out there to consider:

  • Data (and business) analysts have a strong understanding of how to wrangle data and query it, and will run analysis (either on-demand or on their own) against clear business objectives. But their role and state of mind is not to find new such objectives, or disruptive new ways to reach them
  • Data and ML engineers build the technology and libraries on which the data is collected and crunched. They love to see their systems used to generate powerful insights and capabilities, but see themselves as the infrastructure for generating and validating these hypotheses, rather than as the users
  • Data scientists apply their strong statistics and ML skills to the above data infrastructure to build models, enabling new capabilities and user-value out of validated hypotheses. But a model is not built in vacuum; they need a clear mission, derived from a validated hypothesis (or even yet-to-be-validated one)
  • Product managers are the classic hypothesis-creator types. They analyze the market, meet with customers, dive into analytics and business data, and then they create product hypotheses, collected into roadmaps. But they hardly use the above “big” data infrastructure for generating hypotheses, mostly due to tech know-how gaps

What we need for data to be fully leveraged is a new role, a hybrid of the latter two. The data science product manager is a data scientist with the instincts and user-centric thinking of the product manager, or a product manager with the data exploration intuitions of a data scientist. Which skills will this require?

  • Strong data instincts, the ability and desire to explore data both assisted and unassisted, applying intuition to identify adhoc patterns and trends
  • User-centric thinking, seeing the users and real-life scenarios behind the data almost like Neo in The Matrix
  • Technical acumen, though not necessarily coding. Today’s DS and ML tools are becoming more and more commoditized, and require less and less writing from scratch
  • Very strong prioritization capabilities; creating hypotheses from data may be easy, almost too easy. Hence the need to further explore only the most promising ones, turning them into a potential roadmap.
  • Ability to work closely with the data team and “speak their language” to quickly validate, understand the productization cost, and estimate ROI for a large list of such hypotheses

While this role could still be fulfilled by a strong partnership between two individuals working in tandem (PM and data scientist), it is clear that a single individual possessing all of these skills will achieve results far more efficiently. Indeed, as a quick search on LinkedIn shows, the combined role is emerging and exploding in demand.

Advertisements

Mining Wikipedia, or: How I Learned to Stop Worrying and Love Statistical Algorithms

I took my first AI course during my first degree, in the early 90’s. Back then it was all about expert systems, genetic algorithms, search/planning (A* anyone?). It all seemed clear, smart, intuitive, intelligent…

Then, by the time I got to my second degree in the late 00’s, the AI world has changed by a lot. Statistical approaches took over by a storm, and massive amounts of data seemed to trump intuition and smart heuristics anytime.

It took me a while to adjust, I admit, but by the time I completed my thesis I came to appreciate the power of big data. I now can better see this as an evolution, with heuristics and inutions driving how we choose to analyze and process the data, even if afterwards it’s all “just” number-crunching.

So on this note, I gave a talk today at work on the topic of extracting semantic knowledge from Wikipedia, relating also to our work on ESA and to this being an illustration of the above. Enjoy!