Tag Archives: Data Science

What Job Postings Can Tell Us about Data Product Managers

Data product managers are a hot commodity in the job market these days. More and more companies realize the importance of data for their business and the need for building a strategy and a clear roadmap around data. Product managers are naturally part of that effort, and so the demand for Data Product Managers is on the rise.

Recently I happened to get into discussions that got me thinking more about what practically differentiates Data PMs from any other PM. Is it just about their product domains? are there specific new skills? is it perhaps the tools that are being used? maybe even their academic background?

Being a data enthusiast myself, it only made sense to turn to actual data to answer these questions. And so the journey begins…

the data

First things first – data. At first, I considered using LinkedIn profiles of actual Data PMs as a dataset. By comparing profiles of Data PMs to non-Data PMs, one would expect to see the main differences stand out. However, even when leaving the scraping and privacy aspects aside, the self-written role descriptions on LinkedIn are too personalized and not easily comparable.

And so I turned to job descriptions. These are written with a clear focus on what the role requires, are better structured, and also easier to obtain. There was still no available dataset to use, but this seemed like an easier one to solve. Indeed.com provides a pretty good aggregation of links to job postings, which I could then follow and scrape the content. I could also easily select companies that offer both a Data PM and a non-Data PM positions, to hopefully have a better signal to noise ratio, though admittedly this may cause a slight bias for larger companies (having two or more PM openings).

The resulting dataset includes 100 recently posted job postings, by 50 different companies, where for each company one position was for a Data PM and the other for non-Data PM. For this purpose, I’ve defined “Data PM” as having the word “Data” in the title; this seemed like a reasonable starting point, but I’d welcome any feedback as to what such a definition may have missed in the bigger picture.


An interesting exploration step we can take as we get started, is to look at the role titles we actually got in this sample. Taking the 50 Data PM postings and removing all the standard PM title keywords (including “Senior”, “Associate”, “Product Manager”, “Product Management” etc.), provides this distribution of title fragments which indicates what these data products seem to be about. As the graph below shows, around two thirds of the postings focus on a short list of general names: Data Science, Data Platform, Data Products, or simply Data. The “Other” third is made up of a long list of more specific terms all across the data pipeline – Data Integration, Ingestion, Modeling, Foundation, Analysis, Strategy and more. It seems safe to assume therefore, that in most cases Data PMs own the entire data domain in their organization, end to end.

data pms titlesNext, we’ll process the job posting text itself. The general approach I took was to compare the two large groups of 50 Data postings to the 50 non-Data postings, and look for features that best separate these two classes. We’ll start by tokenizing, removing stopwords and stemming the terms (all using nltk), then parsing the resulting texts to extract n-grams and their frequency in each class. For each n-gram, we’ll register the number of postings it was found in, the ratio of Data to non-Data counts, and the Information Gain measure when using this n-gram to separate the two classes. We’ll also remove low-count n-grams (having less than 5% matches), as well as ones with zero or very low information gain.


Now we’re finally ready to see the actual results… so what are the keywords and key phrases that differentiate a Data PM from other PMs?

Top 10 terms by their Information Gain

The above table shows the top terms by their information gain, or how well they separate the two groups. Examining the full list indicates that organizations view Data PMs as professionals that build data platforms, work with teams of data scientists and data engineers, and derive or work with data models. It’s worth noting that the term data in itself is not so unique anymore to Data PMs, and actually has a high frequency also in non-Data PM job descriptions (hence the low “Data Ratio” value), while other top terms and phrases appear in small numbers of postings, but when they do – they are highly informative.

When we look for names of data tools in the list, we will find very few that made it high in the list. SQL stands out as a tool that got mentioned in 16 Data postings vs. 3 non-Data, Tableau appears in 9 Data postings vs. 1 non-Data posting, while Python appears in only 6 postings but all are for Data PMs. All of these terms combined cover about 40% of the Data postings, illustrating the expectation from many Data PMs to be able to access and manipulate data, from the basic SQL to actual coding.

Flipping the list around to a high-ratio of non-Data to Data postings, we can learn what terms highly predict non-Data postings. Not surprisingly, we’ll find user-facing keywords such as engage, delight and experience, but more interestingly there are quite a few classic product skills and terms, such as product backlog, product definition, portfolio and launch. That can be interpreted as an indication that Data PMs are assumed to be experienced PMs who have already mastered product management basics, and so the posting focuses on the Data-specific aspects.

What about degree requirements? In general, degree requirements, whether undergraduate or graduate (as well as MBA), all do not seem to have any particular significance for Data PMs, showing zero information gain. Statistics, on the other hand, whether a degree or just having background in, is a clear attribute of Data postings, with 30% mentioning it versus only 2% in non-Data postings.

While this small dataset may not be large enough to be a true sample, it does give an interesting snapshot for how the job market views the role of a Data PM right now. If you have any further insights or comments, I’d love to hear your thoughts in the comments!

The Data-Product-Scientist-Manager

What’s the difference between Machine Learning, Artificial Intelligence, Deep Learning and Data Science? The huge buzz around these concepts in recent years makes it seem as if they could be used interchangeably.

Several months ago, I gave a meetup talk, going through a history of machine-learning algorithms through the prism of learning algorithms for game playing (video, in Hebrew). As the talk did not require prior knowledge, I began with a quick intro to artificial intelligence (AI) and machine learning (ML) –

  • AI is the science of building machines that mimic what we humans perceive as intelligent behavior
  • ML is a branch within AI, concerned with algorithms that learn their “intelligence” from data (rather than explicit coding)
  • Deep Learning is a very particular method within ML, which uses artificial neural networks and huge volumes of data

At the time of building the talk, I was struggling with making a reference to data science (DS) as well, to help my audience make sense of all the current hype terms.

But is DS a discipline within AI? Within ML? Is it not simply a fancy name for Statistics? I ended up leaving it out.

A few days ago, I stumbled upon an episode of the excellent ML podcast Talking Machines, where Neil Lawrence made an observation that put it all into place for me. Lawrence posited that DS arises from the new streams of data we collect in this era, which are generated in huge volumes out of sensors and interactions, and its mission is to extract value from these. In other words – “Here, we have all of this data, what can we do with it?”

This may seem like a petty technicality, but it makes all the difference. In classic experimentation, scientists would make a hypothesis, collect data to validate or invalidate it, and then run the needed statistical analysis.

With DS, there is no such prior hypotheses. So the core of DS becomes coming up with these hypotheses, and then validating them right there and then in the data we already have. But whose job is it to come up with hypotheses?

There are several relevant roles out there to consider:

  • Data (and business) analysts have a strong understanding of how to wrangle data and query it, and will run analysis (either on-demand or on their own) against clear business objectives. But their role and state of mind is not to find new such objectives or disruptive new ways to reach them
  • Data and ML engineers build the technology and libraries on which the data is collected and crunched. They love to see their systems used to generate powerful insights and capabilities, but see themselves as the infrastructure for generating and validating these hypotheses, rather than as the users
  • Data scientists apply their strong statistics and ML skills to the above data infrastructure to build models, enabling new capabilities and user-value out of validated hypotheses. But a model is not built in a vacuum: they need a clear mission, derived from a validated hypothesis (or even a yet-to-be-validated one)
  • Product managers are the classic hypothesis-creator types. They analyze the market, meet with customers, dive into analytics and business data, and then create product hypotheses, collected into roadmaps. But they hardly use the above “big” data infrastructure for generating hypotheses, mostly due to tech know-how gaps

What we need for data to be fully leveraged is a new role, a hybrid of the latter two. The data science product manager is a data scientist with the instincts and user-centric thinking of the product manager, or a product manager with the data exploration intuitions of a data scientist. Which skills will this require?

  • Strong data instincts, the ability and desire to explore data both assisted and unassisted, applying intuition to identify ad-hoc patterns and trends
  • User-centric thinking, seeing the users and real-life scenarios behind the data, almost like Neo in “The Matrix”
  • Technical acumen, though not necessarily coding. Today’s DS and ML tools are becoming more and more commoditized, and require less and less writing from scratch
  • Very strong prioritization capabilities – creating hypotheses from data may be easy, almost too easy. Hence the need to further explore only the most promising ones, turning them into a potential roadmap.
  • Ability to work closely with the data team and “speak their language” to quickly validate, understand the productization cost, and estimate ROI for a large list of such hypotheses

While this role could still be fulfilled by a strong partnership between two individuals working in tandem (PM and data scientist), it is clear that a single individual possessing all of these skills will achieve results far more efficiently. Indeed, as a quick search on LinkedIn shows, the combined role is emerging and exploding in demand.

Mining Wikipedia, or: How I Learned to Stop Worrying and Love Statistical Algorithms

I took my first AI course during my first degree, in the early 90’s. Back then it was all about expert systems, genetic algorithms, search/planning (A* anyone?). It all seemed clear, smart, intuitive, intelligent…

Then, by the time I got to my second degree in the late 00’s, the AI world has changed by a lot. Statistical approaches took over by a storm, and massive amounts of data seemed to trump intuition and smart heuristics anytime.

It took me a while to adjust, I admit, but by the time I completed my thesis I came to appreciate the power of big data. I now can better see this as an evolution, with heuristics and inutions driving how we choose to analyze and process the data, even if afterwards it’s all “just” number-crunching.

So on this note, I gave a talk today at work on the topic of extracting semantic knowledge from Wikipedia, relating also to our work on ESA and to this being an illustration of the above. Enjoy!