In this blog, we are introducing two well-known Data Science methodologies for project management, namely, CRISP-DM (Cross-Industry Standard Process for Data Mining) and Microsoft TDSP (Team Data Science Process). Here at Skim Technologies, we adopted TDSP as a guiding Data Science methodology to help us build great products for our clients as it places more emphasis on client satisfaction.
The first version of CRISP-DM was proposed in 1999 as a result of a concentrated effort to identify and set out industry guidelines for the data mining process. Since then, several refinements and extensions were proposed. As of 2014, CRISP-DM was the most widely used methodology for analytics, data mining, and data science projects. In October 2016, Microsoft introduced a new data science methodology called TDSP to leverage well-known frameworks and tools such as git version control.
The aim of both these methodologies is to provide Data Science teams with a systematic approach, built on the industry’s best practices, to guide and structure Data Science projects, improve team collaboration, enhance learning and ultimately, ensure quality and efficient results throughout the project development and delivery of data-driven solutions.
Cross-Industry Standard Process for Data Mining (CRISP-DM)
CRISP-DM is a 6-step planning methodology, with each step comprising a sequence of events. As represented in the image below, some of the steps are iterative, often requiring returning to previous tasks. This reflects the non-linear data science workflow.
The six steps represented here are:
- Business Understanding: focuses on understanding the project objectives and requirements from a business perspective, and then translating this information into a Data Science problem definition.
- Data Understanding: focuses on collecting and becoming familiar with the data; this is relevant to identify data quality problems, discover first insights into the data and form hypotheses.
- Data Preparation: aims to transform the raw data into a final dataset that can be used as input to modelling techniques (e.g., Machine Learning algorithms).
Modeling: involves applying different modelling techniques to the dataset in order to generate a set of candidate models.
- Evaluation: once the models have been built, they need to be tested to ensure they generalise against unseen data and that all key business objectives have been considered (e.g., the final model needs to be fair, human-interpretable, and achieve an accuracy X% higher than the client’s current solution). The outcome of this stage is the champion model.
- Deployment: the champion model is deployed into production so it can be used to make predictions on unseen data. All the data preparation steps are included so that the model will treat the new raw data in the same way as during the model development.
In October 2016, Microsoft introduced the Team Data Science Process as an Agile, iterative Data Science methodology built on Microsoft’s (and other companies) best practices, in order to facilitate the successful implementation of Data Science projects.
The process comprises four key components:
- Data Science Lifecycle definition
- Standardized project structure
- Infrastructure and resources for Data Science projects
- Tools and utilities necessary for the project execution
In this blog post, we will overview the first component: the data science lifecycle.
Data Science Lifecycle
TDSP provides a lifecycle to structure the development of data science projects, outlining all the steps that are usually taken when executing a project. Due to the R&D nature of Data Science projects,
TDSP Lifecycle is made up of 5 stages:
- Business Understanding
- Data Acquisition & Understanding
- Customer Acceptance
1. Business Understanding: this stage involves the identification of the business problem, the definition of the business goals and the identification of the key business variables the analysis needs to predict. The metrics that will be used to assess the success of the project are also defined in this stage. Another important step includes surveying the available data sources and understanding the kind of data that is relevant for answering the questions underlying the project goals. This analysis will help determine if data collection or additional data sources will be needed.
2. Data Acquisition and Understanding: being data the key ingredient of any data science project, the second stage revolves around data. It is essential to assess the current state of the data (how messy and unreliable is it?), its size and quality, before moving on to the modelling stage. In this stage, the data is explored, preprocessed and cleaned. This is essential not only to help data scientists build an initial data understanding, but also to avoid propagating errors downstream and increase the chances of obtaining a reliable and accurate model. This stage also aims at finding patterns in the data to guide the choice of the most appropriate modelling techniques to use. At the end of this stage, the data scientists usually have a better idea of whether the existing data is sufficient, if they might need to find new data sources to augment the initial dataset, or if the data is appropriate to help answer the questions underlying the project goals.
3. Modelling: in this stage, feature engineering is performed on the cleaned dataset in order to generate a new, improved, dataset that facilitates model training. Feature engineering usually relies on the insights obtained from the data exploration step and on the domain expertise of the data scientist. After ensuring the dataset is comprised of (mostly) informative features, several models are trained and evaluated, and the best one is selected to be deployed.
4. Deployment: this stage involves deploying the data pipeline and the winner model to
5. Customer acceptance: the last stage of TDSP, for which no CRISP-DM equivalent is available, is customer acceptance. This involves two important tasks, namely: (i) system validation and (ii) project hand-off. The goal of system validation is to confirm that the deployed model meet the client’s needs and expectations, whereas the project hand-off includes handing-off the project to the person responsible of running the system in production, as well as delivering any project reports and documentation.
Data Science projects at Skim
At Skim we chose the TDSP
We combine this with other agile methodologies such as Kanban, and constantly iterate and improve on our approach ensuring we always deliver excellence in each of our projects.