Data science has received recent attention in the technical research and business strategy since; however, there is an opportunity for increased research and improvements on the data science research process itself. Through the research methods described in this paper, we believe there is potential for the application of design thinking to the data science process in an effort to formalize and improve the research project process.
Thus, this paper will focus on three core areas of such theory. The first is a background of the data science research process and an identification of the common pitfalls data scientists face. The second is an explanation of how design thinking principles can be applied to data science. The third is a proposed new process for data science research projects based on the aforementioned findings. The paper will conclude with an analysis of implications for both data science individuals and teams and suggestions for future research to validate the proposed framework.
Data science is arguably one of the most popular jobs of the century; yet, the characteristics of the job remain uncertain (HBR). The lack of formal training available in university programs, unclear role requirements, and breadth of the position has led to both ambiguity on how to become a good data scientist, and an idolatry of those that are able to do it all — colloquially deemed “unicorns”.
Academic advancements in the field of data science have traditionally focused on the development of new statistical analysis techniques, machine learning models, and neural networks. Little research has been performed on the process itself, with the most prominent being the KDD process — a framework for knowledge discovery in databases — proposed in 1997.
Beyond being out of date and lacking answers to common challenges in today’s data science research process, the KDD framework also focuses primarily on the data mining step within the process without expanding its depth to the research process as a whole.
Process research is quite popular in other fields; most notably, in design — a profession that has developed substantial literature on how to solve problems using the mindset of a designer. Researchers and prominent designers have shared the core methods and reasoning patterns that is used in design work, resulting in the popularization of the term “design thinking” and its related practices within corporations and institutions. Furthermore, many of the methods and principles of design thinking are widely applicable, and have already been applied to fields such as education, healthcare and writing studies.
The lack of formality in the role of a data scientist, a lack of existing literature, and interdisciplinary overlap with the field of design research encourages further research into the data science process as a whole.
Thus, this paper aims to identify the common pitfalls of the data science research process and propose a new framework for solving data science problems using the principles and process of design thinking by highlighting the strengths of design thinking as it relates to the data science research process. This paper will also explain the methods used to reach such conclusions, explain potential implications, and identify future research opportunities.
The data for this paper was gathered through both professional first-hand experience, and a thorough review and analysis of industry research and academic literature. The following provides a summary of methods used, with more detailed sources listed in the References:
- Systematic review of published papers on the subjects of design thinking and data science using Google Scholar and search terms such as Design Thinking Process, Design Thinking Applications, Data Science Process, Data Science Challenges, Research Mistakes, and Data Science Research
- Review of blogs and content published by subject matter experts at highly-regarded institutions such as Stanford D School, IDEO, Springboard, and O’Reilly
- Review of concepts in popular books on the topics including Change by Design, The Elements of Data Analytic Style, and Storytelling with Data
- Other process frameworks beyond design thinking were considered in attempts to solve the common shortcomings of the data science research process. Amongst those considered were lean and agile frameworks. In the end, design thinking was chosen due to its direct application as a general research process, over other frameworks that more closely resemble a development process.
Overview of Data Science Research Process
It is first important to establish a distinction between a data science question and a data science research project for this paper. The former is defined as a well-defined question, or set of questions, that have been provided with the aim of reaching an answer quickly. In contrast, the data science research project encompasses larger endeavors in which the goal and answer is developed simultaneously along the way.
The two can be categorized by length of time required for completion — a question is answered quickly, while a project evolves and is completed over a longer period of time. Another distinction lies in the ambiguity of the task: a question is simply that; whereas a project typically begins more vaguely and thus results in many challenges. The following sections detail the process for completing a data science project and the associated common pitfalls.
Frame the problem
It is common in any research process to being by framing the problem, stating one’s hypothesis, and developing a strategic approach to answering the questions that are posed. Following this line of thinking, it is often common to approach research with the limitations of what data is available in mind, rather than starting with a strategic question or direction; however, the rise of big data has fundamentally changed the way one can approach such research.
As the available quantity, quality and variety of data has increased, effectively eliminating the need to start a research process with the limitations of the data, the approach to framing the data science project has not evolved. This becomes most troublesome amongst inexperienced data scientists who often begin with the mindset of what data is available and struggle to uncover meaningful insights, rather than working backwards from what questions would be most strategically valuable to focus their research on.
Get the data
The subsequent step in the data science research process — gathering, cleaning and preparing the data — is often overlooked in importance. Data integrity is a large concern in the research sphere and it begins with a full understanding of the source of the data.
Without developing a full understanding of the data source, it can be difficult to articulate the generalizability and accuracy of results, the impact of any findings, or the theoretical basis for the models developed to an audience.
Explore the data
Data exploration, also known as data mining, is the process of uncovering valuable insights from large datasets, often with the assistance of advanced statistical analysis and visualization. Beyond surface-level techniques, such as running descriptive statistics of variables and checking for correlations, this point in the process can stump a data scientist as they struggle with what questions to ask of the data.
To be successful in this step, substantial knowledge of the research question and creativity to move beyond elementary exploration to valuable insights are required.
Perform in-depth analysis
Once the data has been explored, the process typically turns to either further in-depth analysis or model building, depending on the scope of the project.
The common downfall at this point exists when a data scientist becomes engrossed so deeply in the project that they lose sight of the end goal and either get stuck down a rabbit hole or produce an outcome that is not immediately actionable or valuable to the project’s stakeholders.
Click here to continue reading Rachel Wood’s article.