What Does It Mean to Be a Data Scientist?
At Dice, we’ve had a data-science team for two years. As research and development for the firm, we’ve worked on a number of different projects; although many haven’t hit the site yet (stay tuned), a few of our earlier projects have rolled out over the past 12 months.
For example, last January we replaced the “More Jobs Like This” section on the jobs pages with a custom-built recommender solution. More recently, on the employer side of the site, we released a “More Candidates Like This” section. Also on the employer side, in the candidate search functionality, your search terms will result in suggestions for related skills based on our mining of resume and jobs data.
At this point, we’d like to share with you some of the work we’ve been doing, as well as some of the interesting conversations we’ve had about the data-science space. At times we may get technical, show some neat visualizations of our data, or wax lyrical about current and emerging trends in the industry. Interested readers can expect a post from us at least once a month at around the same time.
As this is the first post in the series, I’m going to define what it means to be a data scientist, at Dice anyways.
Check out the latest data science jobs.
To be honest, I often don’t tell people I am a data scientist. It’s not that I don’t enjoy my job (I do!) nor that I’m not proud of what we’ve achieved (I am); it’s just that most people don’t really understand what you mean when you say you’re a data scientist, or they assume it’s some fancy jargon for something else (just as we’d use “refuse collector” in place of “bin man” (I’m British) or “garbage man”). I have to admit, I have the same reaction to a lot of modern jobs: I don’t feel I really understand what an actuary does, other than assess risk, which is like saying, “I work with data.”
If I do answer, I normally follow up with analogies. In the modern world, data science is pervasive; it impacts a lot of what we do, particularly online. So I talk about how I work on recommender systems, citing Netflix and Amazon as examples, and work on enhancing our search engine. The latter always involves a reference to Google, the pinnacle of search sophistication. However, that belies a lot of the more exciting and important work that we’ve been doing and will be doing.
I’m not the first to try and define “data scientist,” and I won’t be the last. What do others have to say? In 2012, in an effort to grab headlines, Harvard Business Review proclaimed “Data Scientist: The Sexiest Job of the 21st Century.” That article spawned a popular meme, leading to a previous employer referring to me as “that sexy data scientist.” The article portrays the data scientist as an intrepid explorer of data, seeking out hidden insights from within a corporation’s datasets, in order to revitalize or transform their employer’s business. Meanwhile IBM, one of the first companies to truly embrace data science, provides a less romanticized and more pragmatic description.
IBM stresses that a data scientist, above all else, must have a strong business acumen and be able to effectively communicate ideas to core decision makers in the organization.
A number of common themes stand out as you read through the different descriptions of the profession online:
High Business Impact
It is often said that data scientists can have a disproportionately large business impact in comparison to their numbers. The goal of the data scientist is to garner deep insights from a corporation’s data to help drive future decision-making processes. If done right, this can help chart the course of a company’s future development.
The Data Scientist as Polymath
While a lot of modern technology professions require a large range of skills (see The Rise and Fall of the Full-Stack Developer), the data scientist may have the most diverse skill set. A typical data scientist has knowledge of statistics, strong math skills (in particular linear algebra and probability theory), and the ability to work with data visualization techniques and tools (such as D3.js orTableau), SQL, several Big Data technologies (such as MongoDB and Hadoop), and cloud platforms such as AWS; in addition, he or she is an adept programmer, and has a good knowledge and understanding of business.
A Shortage of Qualified Data Scientists
Such diverse and wide-ranging requirements can in part explain the shortage of actual data scientists in the modern labor market. Within the Dice offices, where we pay particular attention to popular skill sets in low supply, we’ve dubbed data scientists with extensive skill sets “pink unicorns.” The shortage was previously blamed on the lack of courses that teach all the skills necessary to become a data scientist, although today there are a lot of Master’s courses and boot camps with the express purpose of teaching data scientists everything from machine learning and Hadoop to statistical analysis.
So now that we’ve defined what a data scientist is, what are some important (and often overlooked) qualities of a good data scientist?
A Scientific Mindset
Probably the most important attribute of a data scientist is the possession of a scientific mindset. It’s been said that any subject with “Science” in its title is not a science, but I don’t think that applies to data science. While it’s important to know the key algorithms and their limitations, it’s nearly impossible to reliably predict which approach will be most effective on your data without running a series of experiments. The experimental method is also important when digging into your data: You need to spot patterns, formulate hypotheses and then test them by formulating queries, running statistical analyses or visualizing the data in some way.
A scientific theory is reliant on empirical evidence to be tested. Likewise, as a data scientist, I only believe in what our data tells us. I am suspicious of theories about our business or our customers that our data does not support. All good scientists are skeptics at heart; they require strong empirical evidence to be convinced about a theory. Likewise, as a data scientist, I’ve learned to be suspicious of models that are too accurate, or individual variables that are too predictive. Most of the time, it means some subtle data leakage has occurred, or there’s a bug in your code.
Solid Programming Skills
Predictive modeling and statistical analysis are important tools in a data scientist’s toolbox. However, first and foremost a data scientist must be a competent programmer. It’s often said that a data scientist spends the majority of his or her time cleaning and preparing data; while I feel this “fact” is a little exaggerated, and very dependent on the data available, being able to program well is a very important skill. Everything that a data scientist does, from predictive modeling to data visualization and automating experiments, requires computer programming. We once interviewed a candidate who was exceptional at creating predictive models; however, we had to turn the person down, as they were unable to write code to process a flat file that was in a simple but non-standard format.
Promotes Data Science
As a data scientist, it’s important that you are an evangelist for your profession. Despite the awareness of the value of Big Data, it’s still hard for businesspeople to truly understand the scope and power of data science and what it can do for their organization. There are several different reasons for this: First, data science is still a very new profession and not practiced much in smaller companies, so it’s hard for some people to understand what it can do if they haven’t worked with a data science team before. Also, the sort of solutions data science can produce differ widely by industry; what works for one company doesn’t necessarily translate to another.
In addition, the dynamic nature of a predictive model can be hard to understand in comparison to the implementation of some business logic or a user interface. Models make mistakes, often ones that are obvious to a person, and businesspeople can have a hard time understanding that. Thus, it’s important that a data scientist educate the business about data science and how it can be used to effectively solve business problems, and where its limitations lie. For this, domain knowledge is very important, as mentioned earlier.
Uses the Right Tools
There are a plethora of tools for data science, from machine learning to statistical analysis and crunching large datasets. It can be very tempting to spend a lot of time researching different tools, and using the coolest new toys to solve a particular problem. However, it’s important to actually get some work done, and there’s only so much time you can spend evaluating tools: You need to be selective, and listen to what other people in the industry recommend for similar problems.
The technology industry is as much driven by fads as the fashion world, and there is a tendency to try to use new technologies for problems they aren’t suited to handle. The best and most commonly stated example of this is Hadoop. A lot of companies seem to be under the impression that if you’re not using Hadoop, then you are not doing data science. The reality is that a lot of businesses don’t have the amount of data that warrants a Hadoop cluster. For those that do, it may still not be the best tool out there; certain tasks, for instance certain machine learning algorithms, have to be executed in a serial manner and cannot take full advantage of MapReduce.
Similarly, Hadoop is not a good tool for running complex queries, which is one of the reasons that Google has moved away from the pure MapReduce paradigm they invented into more complicated systems such as Spanner. At Dice, we find Amazon’s RedShift more than competent for most of our Big Data-processing needs, and also leverage Apache Spark for some of the most processing-intensive tasks.
In this post, I’ve hopefully given you a taster for what it means to be a data scientist, and drawn attention to some often-overlooked qualities of a good data scientist. In the future posts, we’ll start to explore some of the underlying trends in the industry, show some interesting insights into our data and delve deep into some of the technical solutions we’ve developed using our data to solve real problems.