A plain outline of CRISP-DM, an iterative methodology for the execution of data science projects.
CRISP-DM stands for Cross Industry Standard Process for Data Mining, and was developed in 1996 by the European Union, as an initiative to define the hierarchy and phases of a data mining project, with their respective tasks, and the relationships between these tasks.
Hence, it provides a standardized description of the life cycle of a standard data analysis project, analogous to software engineering software development life cycle models.
As it was developed in the 90’s, it is modeled as an iterative and adaptive approach, whose phases are detailed in the diagram below:
The initial phase focuses on understanding the project objectives and the business domain, as with most software methodologies created in the 90’s, like the Unified Process.
The first goal of a data scientist is to understand, from a business perspective, what the client really wants to achieve. Often the client has many competing goals and constraints that must be properly balanced. Therefore, the data scientist must uncover important factors, understand the extent of the problem, determine measurable goals and even assess whether data science is a feasible solution.
A possible consequence of neglecting this step is spending a great amount of effort trying to produce answers to incorrect questions.
The data understanding phase begins with the initial data collection and continues with activities to become familiar with the data, identify quality problems, discover preliminary knowledge about the data, and/or discover interesting subsets to form hypotheses about the hidden information.
The data preparation phase covers all the activities necessary to build the final data set (the data that will be used in the modeling tools) from the initial raw data. Tasks include selecting tables, records, and attributes, as well as data transformation and cleansing for the tools they model.
Depending on the quality of the data, it might be one of the longer phases of the project.
This is the most glamorous part of a data science project, which is responsible for meeting the project objectives.
In this phase, the modeling techniques that are relevant to the problem are selected and applied (the more the better), and their parameters are calibrated to optimal values. There are typically several techniques for the same type of data science problem. Though, some of these techniques have specific requirements on the form of the data.
Therefore, in most projects you end up going back to the data preparation phase in order to add new features or fix some data issues.
At this stage in the project, one or more models have been built that appear to be of sufficient quality from a data analysis perspective.
Before proceeding to the final deployment of the model, it is important to thoroughly evaluate it and review the steps executed to create it, as well as compare the obtained model with the business objectives. A key objective is to determine if there are any important business issues that have not been sufficiently considered. At the end of this phase, a decision on the application of the results of the data analysis process should be obtained.
Generally, the creation of the model is not the end of the project. Even if the objective of the model is to increase the knowledge of the data, the knowledge obtained will have to be organized and presented so that the client can use it.
Depending on the requirements, the development phase can be as simple as generating a report or as complex as performing a periodic and even automated data analysis process in an organization.
Even though CRISP-DM includes a deployment phase, it does not include all the activities required to maintain a product. In particular, it lacks specific processes for data governance, such as data collection, storage, versioning, security and privacy, including regulatory and legal compliance, which in the case of personal data has become fundamental.
Another implicit assumption of the methodology is the availability of quality data, understating the activities to obtain them. In the 90’s, data projects were aimed at achieving specific objectives, quite defined at the beginning. This is no longer the case. Nowadays, with the availability of large volumes of data, the process of data collection and validation becomes quite a challenge.
There are also no phases specific to the software development of the product, such as aspects related to the software architecture or the interaction with the user. In general, the development of a data-based product requires incorporating methodological aspects related with software development beyond data analysis and modeling.
To summarize, CRISP-DM provides a uniform framework, with guidelines and phases to deal with different business problems. To that effect, we can interpret CRISP-DM as an adaptation of traditional project management methodologies to the context of Data Science.
With this knowledge, we can determine whether a new project to be carried out can be tackled using this traditional approach or if a more agile methodology is preferable.