written by
Vectice

The Ultimate Guide to Data Science Frameworks

10 min read

Data science is an ever-evolving field that uses various tools and techniques to extract meaningful insights from data. As organizations rely more heavily on their data to make decisions, having a framework in place to aid with the analysis has become increasingly important.

In this ultimate guide, we will take a look at some of the most popular data science frameworks that are widely used in the industry. We will discuss the CRISP-DM, KDD, and TDSP frameworks and explore how they can help to make data science a more effective tool for business decision-making. By the end of this guide, you should have a better understanding of which framework best suits your organization's needs. So let's get started!

1. Cross-Industry Standard Process for Data Mining

The Cross-Industry Standard Process for Data Mining (CRISP-DM) is the standard model for building a data science project. Its stages are:

1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment

Let’s say we are a travel agency with thousands of daily customer service queries. We are building a chatbot to resolve simple queries without human input. After each solution provided by the chatbot, we send a quick survey asking the customer to rank their satisfaction from 1-10. If the ranking is lower than 8, customers are able to provide written feedback. The 6 stages would be defined as following:

Business Understanding:
We want to build and improve a chatbot that automatically resolves customer service queries.

Note: defining the business problem first is essential to having a successful analytics project. As Zipporah Luna writes, “not knowing how to frame the business problem is a problem itself”. She adds that some data science teams will start a project by focusing on the data set and what model to use, without addressing the original business need.

Data Understanding:
we’ll collect data from survey responses. All data will be organized by category of issue that was resolved, along with a score of how satisfied the customer was with the provided solution.

Data Preparation:
Each data record contains three values: the category of issue, a numerical ranking of the customers satisfaction and possibly a text comment. The first two are easy to store and access, and require minimal preparation and cleaning. The written feedback will be used to assess outliers and improve the chatbot.

Modeling:
The model we’re building is the chatbot: an automated service which can direct customers to the right support articles or execute simple tasks like issuing refunds or updating on a case status. This model will have to be continuously updated based on the feedback we’re collecting from customers. This is what makes our project iterative.

Data Evaluation:
We can plot the satisfaction rankings over time and evaluate if the quality of our chatbot improves or declines. We can pinpoint which issues are poorly solved by our chatbot by isolating negative ratings, evaluating written feedback and making improvements.

Deployment:
The chatbot should be deployed once we have a basic working system without producing errors, and should be continuously improved to keep serving the original business need. This is what iteration means. Let’s talk more in detail about iteration.

The arrow from Evaluation to Business Understanding enables iteration.
https://www.datascience-pm.com/crisp-dm-2/

Rajiv Shah writes that “the iterative approach in data science starts with emphasizing the importance of getting to a first model quickly, rather than starting with the variables and features. Once the first model is built, the work then steadily focuses on continual improvement.”

Iteration is a common approach in computer science methodologies and has multiple benefits:
- Iteration allows for continuous improvement of a product or service.
- Iteration also helps to stay relevant in a dynamic environment.
- Iteration lets developers fine-tune models before deploying them on a massive scale.

Machine Learning is a constant iterative process, and understanding this concept will help understand many fundamental techniques of ML and AI.

Niwratti Kasture emphasizes that “you are never guaranteed to have a perfect model until it has gone through a significant amount of iterations to learn the various scenarios from the data.” He nuances by adding that a “perfect model” doesn’t exist and instead, models should go through the Machine Learning cycle until a desired confidence level is achieved.‍

2. Knowledge Discovery of Databases

In our second episode of Deep Dive Methodologies, we are looking at the Knowledge Discovery of Databases (KDD). While CRISP-DM encompasses the entire data science lifecycle, KDD focuses on the data mining stage. It comprises 5 stages:

1. Data Selection
2. Data Pre-Processing
3. Data Transformation
4. Data Mining
5. Evaluation

Rashmi Karan writes: “The purpose of KDD is the interpretation of patterns, models, and a deep analysis of the information that an organization has gathered to make better decisions.” In other words, we want to perform statistical analysis on large data sets to gain actionable insights.

Let’s say we’re a data science team working for a health insurance company in North America. We have access to a huge dataset of customer records and we’re asked to segment them.

By dividing customers into groups, we can offer tailored insurance packages. We can create targeted advertising to reach the right audiences. Customer segmentation also allows us to correlate different health categories with associated risks.

Since this task is mainly focused on data mining, we will use the KDD methodology. We will use our hypothetical scenario to explain each stage:

Data Selection:
Since we have access to massive customer records, we can select the data points most relevant for the segmentations we want to create. Possible factors include age, sex, weight, profession, income, medical history, etc.

Data Pre-Processing:
Before we can segment customers, we must pre-process our records by formatting them in a coherent structure and fixing inaccuracies and incorrect values.

Data Transformation:
Since we are dealing with many variables, we might need to perform transformations such as dimensionality reduction. According to Nilesh Barla: “dimensionality reduction is reducing the number of features in a dataset.” This is important because a model with too many features becomes too complex to work with. Nilesh continues: “the higher the number of features, the more difficult it is to model them. This is known as the curse of dimensionality”.

Data Mining:
Once we have collected, processed, and transformed our data points, we can start with the most important stage: data mining. This is where we extract insights from our data set. One powerful technique that helps with segmentation is known as “clustering”.

“Clustering is an unsupervised data mining (machine learning) technique used for grouping the data elements without advance knowledge of the group definitions.” writes Srinivasan Sundararajan. He adds that “these groupings are useful for exploring data, identifying anomalies in the data, and creating predictions.” This is true whether you’re working in healthcare, banking, e-commerce, gaming, and more.


Evaluation:
Once we clustered our customer records, we can evaluate if we are satisfied with the number of groups and types of features we used to segment them. Perhaps we use one segmentation to advertise new insurance packages, but very few people respond. In that case, we can create new segmentations and evaluate if that generates more responses to our advertising campaign.

KDD is a powerful methodology for data mining projects and covers every stage in detail. While it’s more oriented towards data engineers, a data science team must also include experts in mathematics, computer science, and business. Our next methodology helps to define team roles.‍

3. Team Data Science Process (TDSP)

We already covered the data science process in detail, so let’s talk about data team roles. The Team Data Science Process (TDSP) is a methodology created by Microsoft. It’s based on CRISP-DM, with an additional focus on team responsibilities. Let’s take a closer look.

Data science team roles fall within three categories:

1. Mathematics/Statistics
2. Computer Science
3. Business Domain Knowledge

Every data science project should contain a good balance between these domains. Projects will usually start out from a business need, mathematicians will create a model and computer engineers will build the model, deploy it and feed the results back to the business experts.

Source: https://www.datascience-pm.com/tdsp/

Let’s define a few roles and responsibilities:

Data Engineer:
According to Saurav Dhungana, data engineers are often the first people to start the project. They are “generally someone with good programming and hardware skills, and can build your data infrastructure.” Once this infrastructure is built, raw data will flow to the data analysts.

Data Analyst:
Data analysts are mostly responsible for the technical analysis of data sets. They will be working hands-on with data to collect, process and transform it into usable records. Data analysts will then use a combination of programming, statistics and machine learning to derive actionable insights.

Sara Metwalli adds: “they are often in charge of preparing the data for communication with the project's business side by preparing reports”. This demonstrates that different roles must be comfortable across disciplines, regardless of their specialization.

Business Analyst:
Business analysts will study the results from data analysts and recommend the best decisions to maximize business targets. They serve as a bridge between the engineers and stakeholders and are important communicators during a project.

Database Administrator:
These are engineers responsible for maintaining, securing, and providing access to databases. Responsibilities include creating backups and recovery tools, security and authentication, troubleshooting and maintenance. The number of database administrators will grow as the company scales, or may even be outsourced by a database provider.

Data Scientist:
This is the broadest role and encompasses the entire spectrum of a data project. Data scientists have a wide understanding of every aspect of the job, and they are often tasked with leading a team. Saurav Dhungana adds that they will collaborate with domain experts to deliver results to stakeholders, which is the ultimate goal of a data project.

Machine Learning Engineer:
ML engineers will execute machine learning algorithms such as classification, regression and clustering. The main difference between these techniques compared to data analysts is that ML algorithms allow the computer to learn over time, thus improving its performance. Sara Metwalli continues: “machine learning engineers need to have strong statistics and programming skills in addition to some knowledge of the fundamentals of software engineering”.

These are some of the most common roles working on data projects. As Jyosmitha Munnangi notes in her article, some companies might require you to have expertise in all domains. This is particularly true for startups with fewer resources and smaller teams. As companies expand, teams can grow to include more diverse skillsets for specific problems.

So if you’re wondering which type of role you should focus on, we recommend the following: specialize in something you’re naturally good at and acquire a general knowledge of other roles. This will make you invaluable as an employee at both startups and large organizations!

In this guide, we discussed the three most popular data science frameworks found in the industry: CRISP-DM, KDD, and TDSP. We explored how these frameworks can help organizations make better decisions by providing guidance throughout the data analysis process. We also examined how each framework works and which one is best suited to different organizations' needs.

Want to explore other frameworks?

Check out Deliver AI Faster With The 2x3 Methodology written by Colleen Qiu, former VP of data science at Tesla and Metromile.

Learn More About Vectice

SOURCES:

To learn more about CRISP-DM and the iterative process in data science:
Zipporah Luna: CRISP-DM Phase 1: Business Understanding
Rajiv Shah: Measure Once, Cut Twice: Moving Towards Iteration in Data Science
Niwratti Kasture: Machine Learning — Why it is an iterative process?


To learn more about the KDD methodology:
Srinivasan Sundararajan: Patient Segmentation Using Data Mining Techniques
Rashmi Karan: Knowledge Discovery in Databases (KDD) in Data Mining
Nilesh Barla: Dimensionality Reduction for Machine Learning

To learn more about TDSP and different roles in data teams:
Saurav Dhungana: On Building Effective Data Science Teams
Sara Metwalli: 10 Different Data Science Job Titles and What They Mean
Jyosmitha Munnangi: What are 12 different job roles & responsibilities in Data Science?