Assessing data quality (KPMG Virtual Internship - Part 1)
Sprocket Central Pty Ltd , a medium size bikes & cycling accessories organisation, has approached Tony Smith (Partner) in KPMG’s Lighthouse & Innovation Team. Sprocket Central Pty Ltd is keen to learn more about KPMG’s expertise in its Analytics, Information & Modelling team.
Smith discusses KPMG’s expertise in this space (you can read more here). In particular, he speaks about how the team can effectively analyse the datasets to help Sprocket Central Pty Ltd grow its business.
Primarily, Sprocket Central Pty Ltd needs help with its customer and transactions data. The organisation has a large dataset relating to its customers, but their team is unsure how to effectively analyse it to help optimise its marketing strategy.
However, in order to support the analysis, you speak to the Associate Director for some ideas and she advised that “the importance of optimising the quality of customer datasets cannot be underestimated. The better the quality of the dataset, the better chance you will be able to use it drive company growth.”
The client provided KPMG with 3 datasets:
- Customer Demographic
- Customer Addresses
- Transactions data in the past 3 months
You decide to start the preliminary data exploration and identify ways to improve the quality of Sprocket Central Pty Ltd’s data.
I- The task at hand
You arrive at your desk after the initial client meeting. You have a voicemail on your phone which contains the following instructions.
“Hi there – Welcome again to the team! The client has asked our team to assess the quality of their data; as well as make recommendations on ways to clean the underlying data and mitigate these issues. Can you please take a look at the datasets we’ve received and draft an email to them identifying the data quality issues and how this may impact our analysis going forward?
I will send through an example of a typical data quality framework that can be used as a guide. Remember to consider the join keys between the tables too. Thanks again for your help.”
Subsequently, you receive the below email
Hi there,
As per voicemail, please find the 3 datasets attached from Sprocket Central Pty Ltd:
Customer Demographic Customer Addresses Transaction data in the past three months Can you please review the data quality to ensure that it is ready for our analysis in phase two. Remember to take note of any assumptions or issues we need to go back to the client on. As well as recommendations going forward to mitigate current data quality concerns.
I’ve also attached a data quality framework as a guideline. Let me know if you have any questions.
Thanks for your help.
Kind Regards
Your Manager
Hence the task at hand is to draft an email to the client identifying the data quality issues and strategies to mitigate these issues. Refer to ‘Data Quality Framework Table’ and resources below for criteria and dimensions which you should consider.
You may use programs like Excel, Google Sheets, Tableau, Power BI to start. Feel free to use Python, R Programming Language, MatLab and other data analytics tools that you know of.
II- Import the necessary packages
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
demographics = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "CustomerDemographic", header=0)
addresses = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "CustomerAddress",header = 1)
transactions = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "Transactions", header = 1)
cust_list = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "NewCustomerList", header = 1)
C:\Users\fawaz\AppData\Local\Temp\ipykernel_17128\2520818905.py:1: FutureWarning: Inferring datetime64[ns] from data containing strings is deprecated and will be removed in a future version. To retain the old behavior explicitly pass Series(data, dtype=datetime64[ns])
demographics = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "CustomerDemographic", header=0)
C:\Users\fawaz\AppData\Local\Temp\ipykernel_17128\2520818905.py:4: FutureWarning: Inferring datetime64[ns] from data containing strings is deprecated and will be removed in a future version. To retain the old behavior explicitly pass Series(data, dtype=datetime64[ns])
cust_list = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "NewCustomerList", header = 1)
Now that all the data is loaded into pandas DataFrame formats, we will examin them in the lense of data quality to uncover any data issues we may encounter.
these issues may come under the following:
- Accuracy: Correct values
- Completeness: Data fields with values
- Consistency: Values free from contradictions
- Currency: Values up to date
- Relevency: Data items with Value Meta-Data
- Validity: Data containing allowable Values
- Uniqueness: Records that are duplicated
1- Demographics DataFrame
Let’s first generate data dictionary to clarify the column names and data types.
column_names = list(demographics.columns)
data_types = demographics.dtypes.tolist()
description = [""] * len(column_names) # Add descriptions for each column if available
# Create the data dictionary DataFrame
data_dictionary = pd.DataFrame({
'Column Name': column_names,
'Data Type': data_types,
'Description': description
})
# Display the data dictionary
print(data_dictionary)
Column Name Data Type Description
0 customer_id int64
1 first_name object
2 last_name object
3 gender object
4 past_3_years_bike_related_purchases int64
5 DOB datetime64[ns]
6 job_title object
7 job_industry_category object
8 wealth_segment object
9 deceased_indicator object
10 default object
11 owns_car object
12 tenure float64
column_names = list(demographics.columns)
# Display the column names
print("Column names:", column_names)
Column names: ['customer_id', 'first_name', 'last_name', 'gender', 'past_3_years_bike_related_purchases', 'DOB', 'job_title', 'job_industry_category', 'wealth_segment', 'deceased_indicator', 'default', 'owns_car', 'tenure']
Let’s display the top rows of the data frame so we can have a quick visual assessmeent of the data.
demographics.head()
customer_id | first_name | last_name | gender | past_3_years_bike_related_purchases | DOB | job_title | job_industry_category | wealth_segment | deceased_indicator | default | owns_car | tenure | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Laraine | Medendorp | F | 93 | 1953-10-12 | Executive Secretary | Health | Mass Customer | N | "' | Yes | 11.0 |
1 | 2 | Eli | Bockman | Male | 81 | 1980-12-16 | Administrative Officer | Financial Services | Mass Customer | N | <script>alert('hi')</script> | Yes | 16.0 |
2 | 3 | Arlin | Dearle | Male | 61 | 1954-01-20 | Recruiting Manager | Property | Mass Customer | N | 2018-02-01 00:00:00 | Yes | 15.0 |
3 | 4 | Talbot | NaN | Male | 33 | 1961-10-03 | NaN | IT | Mass Customer | N | () { _; } >_[$($())] { touch /tmp/blns.shellsh... | No | 7.0 |
4 | 5 | Sheila-kathryn | Calton | Female | 56 | 1977-05-13 | Senior Editor | NaN | Affluent Customer | N | NIL | Yes | 8.0 |
let’s get the dataframe info.
demographics.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customer_id 4000 non-null int64
1 first_name 4000 non-null object
2 last_name 3875 non-null object
3 gender 4000 non-null object
4 past_3_years_bike_related_purchases 4000 non-null int64
5 DOB 3913 non-null datetime64[ns]
6 job_title 3494 non-null object
7 job_industry_category 3344 non-null object
8 wealth_segment 4000 non-null object
9 deceased_indicator 4000 non-null object
10 default 3698 non-null object
11 owns_car 4000 non-null object
12 tenure 3913 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(2), object(9)
memory usage: 406.4+ KB
We can observe that there are many
n_lines, n_columns = demographics.shape
print(f"the number of lines in this DF is {n_lines}")
print(f"the number of columns in this DF is {n_columns}")
the number of lines in this DF is 4000
the number of columns in this DF is 13
from pandas_profiling import ProfileReport
# Generate the profiling report
profile = ProfileReport(demographics)
# Save the report as an HTML file
profile.to_file('demographics_report.html')
# Download the HTML file
import webbrowser
webbrowser.open('demographics_report.html')
# Display the report inline in Google Colab
profile.to_notebook_iframe()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
num_duplicates = demographics.duplicated().sum()
num_missing = demographics.isna().sum()
print("Number of duplicate records:", num_duplicates)
print("Number of missing records:")
print("--------------------------")
print(num_missing)
print("--------------------------")
print("% of missing records:")
print(round(num_missing/demographics.shape[0]*100,2))
Number of duplicate records: 0
Number of missing records:
--------------------------
customer_id 0
first_name 0
last_name 125
gender 0
past_3_years_bike_related_purchases 0
DOB 87
job_title 506
job_industry_category 656
wealth_segment 0
deceased_indicator 0
default 302
owns_car 0
tenure 87
dtype: int64
--------------------------
% of missing records:
customer_id 0.00
first_name 0.00
last_name 3.12
gender 0.00
past_3_years_bike_related_purchases 0.00
DOB 2.17
job_title 12.65
job_industry_category 16.40
wealth_segment 0.00
deceased_indicator 0.00
default 7.55
owns_car 0.00
tenure 2.17
dtype: float64
demographics['DOB'].describe()
C:\Users\fawaz\AppData\Local\Temp\ipykernel_17128\2807883312.py:1: FutureWarning: Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
demographics['DOB'].describe()
count 3913
unique 3448
top 1978-01-30 00:00:00
freq 7
first 1843-12-21 00:00:00
last 2002-03-11 00:00:00
Name: DOB, dtype: object
From our observation of the demographics section of the dataset, we can observe the following:
-
Accuracy: Correct values
The “default” column seems to be mostly incorrect with very un-recognizable formats. We would like to confirm this column satatus. Otherwise, it is better to drop it altoguether if unnecessary.
-
Completeness: Data fields with values
There are 1763 missing values in this data frame, which represent 3.4% of total cells. especially:
- last_name has 125 (3.1%) missing values
- DOB has 87 (2.2%) missing values
- job_title has 506 (12.7%) missing values
- job_industry_category has 656 (16.4%) missing values
- default has 302 (7.5%) missing values
- tenure has 87 (2.2%) missing values
-
Validity: Data containing allowable Values We noticed that one of the clients’ date of birth was 1843-12-21. It could be worth getting the right data or dropping this data line if not impactful for the analysis.
-
Uniqueness:
- There are no duplicated records in this dataframe
- The customer_id column has unique values, which is a good indication of the data quality.
Missing values by column - Demographics Data Frame
2- Customer Addresses DataFrame
Let’s first generate data dictionary to clarify the column names and data types.
column_names = list(addresses.columns)
data_types = addresses.dtypes.tolist()
description = [""] * len(column_names) # Add descriptions for each column if available
# Create the data dictionary DataFrame
data_dictionary = pd.DataFrame({
'Column Name': column_names,
'Data Type': data_types,
'Description': description
})
# Display the data dictionary
print(data_dictionary)
Column Name Data Type Description
0 customer_id int64
1 address object
2 postcode int64
3 state object
4 country object
5 property_valuation int64
column_names = list(addresses.columns)
# Display the column names
print("Column names:", column_names)
Column names: ['customer_id', 'address', 'postcode', 'state', 'country', 'property_valuation']
Let’s display the top rows of the data frame so we can have a quick visual assessmeent of the data.
addresses.head()
customer_id | address | postcode | state | country | property_valuation | |
---|---|---|---|---|---|---|
0 | 1 | 060 Morning Avenue | 2016 | New South Wales | Australia | 10 |
1 | 2 | 6 Meadow Vale Court | 2153 | New South Wales | Australia | 10 |
2 | 4 | 0 Holy Cross Court | 4211 | QLD | Australia | 9 |
3 | 5 | 17979 Del Mar Point | 2448 | New South Wales | Australia | 4 |
4 | 6 | 9 Oakridge Court | 3216 | VIC | Australia | 9 |
let’s get the dataframe info.
addresses.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3999 entries, 0 to 3998
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customer_id 3999 non-null int64
1 address 3999 non-null object
2 postcode 3999 non-null int64
3 state 3999 non-null object
4 country 3999 non-null object
5 property_valuation 3999 non-null int64
dtypes: int64(3), object(3)
memory usage: 187.6+ KB
We can observe that there are many
n_lines, n_columns = addresses.shape
print(f"the number of lines in this DF is {n_lines}")
print(f"the number of columns in this DF is {n_columns}")
the number of lines in this DF is 3999
the number of columns in this DF is 6
from pandas_profiling import ProfileReport
# Generate the profiling report
profile = ProfileReport(addresses)
# Save the report as an HTML file
profile.to_file('addresses_report.html')
# Download the HTML file
webbrowser.open('addresses_report.html')
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
True
num_duplicates = addresses.duplicated().sum()
num_missing = addresses.isna().sum()
print("Number of duplicate records:", num_duplicates)
print("Number of missing records:")
print("--------------------------")
print(num_missing)
print("--------------------------")
print("% of missing records:")
print(round(num_missing/addresses.shape[0]*100,2))
Number of duplicate records: 0
Number of missing records:
--------------------------
customer_id 0
address 0
postcode 0
state 0
country 0
property_valuation 0
dtype: int64
--------------------------
% of missing records:
customer_id 0.0
address 0.0
postcode 0.0
state 0.0
country 0.0
property_valuation 0.0
dtype: float64
From our observation of the adresses section of the dataset, we can observe the following:
-
Uniqueness:
- There are no duplicated records in this dataframe
- The country is “Australia” for all clients. This information should be dropped for the rest of the analysis.
3- Transactions Addresses DataFrame
Let’s first generate data dictionary to clarify the column names and data types.
column_names = list(transactions.columns)
data_types = transactions.dtypes.tolist()
description = [""] * len(column_names) # Add descriptions for each column if available
# Create the data dictionary DataFrame
data_dictionary = pd.DataFrame({
'Column Name': column_names,
'Data Type': data_types,
'Description': description
})
# Display the data dictionary
print(data_dictionary)
Column Name Data Type Description
0 1 int64
1 2 int64
2 2950 int64
3 2017-02-25 00:00:00 datetime64[ns]
4 False float64
5 Approved object
6 Solex object
7 Standard object
8 medium object
9 medium.1 object
10 71.49 float64
11 53.62 float64
12 41245 float64
column_names = list(transactions.columns)
# Display the column names
print("Column names:", column_names)
Column names: [1, 2, 2950, datetime.datetime(2017, 2, 25, 0, 0), False, 'Approved', 'Solex', 'Standard', 'medium', 'medium.1', 71.49, 53.62, 41245]
transactions.head()
1 | 2 | 2950 | 2017-02-25 00:00:00 | False | Approved | Solex | Standard | medium | medium.1 | 71.49 | 53.62 | 41245 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 3 | 3120 | 2017-05-21 | 1.0 | Approved | Trek Bicycles | Standard | medium | large | 2091.47 | 388.92 | 41701.0 |
1 | 3 | 37 | 402 | 2017-10-16 | 0.0 | Approved | OHM Cycles | Standard | low | medium | 1793.43 | 248.82 | 36361.0 |
2 | 4 | 88 | 3135 | 2017-08-31 | 0.0 | Approved | Norco Bicycles | Standard | medium | medium | 1198.46 | 381.10 | 36145.0 |
3 | 5 | 78 | 787 | 2017-10-01 | 1.0 | Approved | Giant Bicycles | Standard | medium | large | 1765.30 | 709.48 | 42226.0 |
4 | 6 | 25 | 2339 | 2017-03-08 | 1.0 | Approved | Giant Bicycles | Road | medium | medium | 1538.99 | 829.65 | 39031.0 |
transactions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 1 19999 non-null int64
1 2 19999 non-null int64
2 2950 19999 non-null int64
3 2017-02-25 00:00:00 19999 non-null datetime64[ns]
4 False 19639 non-null float64
5 Approved 19999 non-null object
6 Solex 19802 non-null object
7 Standard 19802 non-null object
8 medium 19802 non-null object
9 medium.1 19802 non-null object
10 71.49 19999 non-null float64
11 53.62 19802 non-null float64
12 41245 19802 non-null float64
dtypes: datetime64[ns](1), float64(4), int64(3), object(5)
memory usage: 2.0+ MB
n_lines, n_columns = transactions.shape
print(f"the number of lines in this DF is {n_lines}")
print(f"the number of columns in this DF is {n_columns}")
the number of lines in this DF is 19999
the number of columns in this DF is 13
from pandas_profiling import ProfileReport
# Generate the profiling report
profile = ProfileReport(transactions)
# Save the report as an HTML file
profile.to_file('transactions_report.html')
# Download the HTML file
webbrowser.open('transactions_report.html')
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
C:\Users\fawaz\anaconda3\lib\site-packages\multimethod\__init__.py:315: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
return func(*args, **kwargs)
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
True
num_duplicates = transactions.duplicated().sum()
num_missing = transactions.isna().sum()
print("Number of duplicate records:", num_duplicates)
print("Number of missing records:")
print("--------------------------")
print(num_missing)
print("--------------------------")
print("% of missing records:")
print(round(num_missing/transactions.shape[0]*100, 2))
Number of duplicate records: 0
Number of missing records:
--------------------------
1 0
2 0
2950 0
2017-02-25 00:00:00 0
False 360
Approved 0
Solex 197
Standard 197
medium 197
medium.1 197
71.49 0
53.62 197
41245 197
dtype: int64
--------------------------
% of missing records:
1 0.00
2 0.00
2950 0.00
2017-02-25 00:00:00 0.00
False 1.80
Approved 0.00
Solex 0.99
Standard 0.99
medium 0.99
medium.1 0.99
71.49 0.00
53.62 0.99
41245 0.99
dtype: float64
From our observation of the transactions section of the dataset, we can observe the following:
-
Completeness: Data fields with values
There are 1542 missing values in this data frame, which represent 0.6% of total cells. especially:
- online_order has 360 (1.8%) missing values
- product_id has 1378 (6.9%) zeros
- Brand, product_line, product_class, product_size, standard_cost & product_first_sold_date all have the same amount of missing values (197) accounting for 0.98% for the data in each column. Upon visual examination of the data, it seems like these are all related. It could be worth investigating the 197 occurences and why they happened, and retrieve the related data if possible
-
Uniqueness:
There are no duplicated records in this dataframe
Missing values by column - Transaction Data Frame
4- Findings summary
Addresses data
From our observation of the transactions section of the dataset, we can observe the following:
-
Completeness:
There are 1542 missing values in this data frame, which represent 0.6% of total cells. especially:
- online_order has 360 (1.8%) missing values
- product_id has 1378 (6.9%) zeros
- Brand, product_line, product_class, product_size, standard_cost & product_first_sold_date all have the same amount of missing values (197) accounting for 0.98% for the data in each column. Upon visual examination of the data, it seems like these are all related. It could be worth investigating the 197 occurences and why they happened, and retrieve the related data if possible.
Demographics data
-
Accuracy:
The “default” column seems to be mostly incorrect with very un-recognizable formats. We would like to confirm this column satatus. Otherwise, it is better to drop it altoguether if unnecessary.
-
Completeness:
There are 1763 missing values in this data frame, which represent 3.4% of total cells. especially:
- last_name has 125 (3.1%) missing values
- DOB has 87 (2.2%) missing values
- job_title has 506 (12.7%) missing values
- job_industry_category has 656 (16.4%) missing values
- default has 302 (7.5%) missing values
- tenure has 87 (2.2%) missing values
-
Validity: We noticed that one of the clients’ date of birth was 1843-12-21. It could be worth getting the right data or dropping this data line if not impactful for the analysis.
-
Uniqueness:
- There are no duplicated records in this dataframe
- The customer_id column has unique values, which is a good indication of the data quality.
Transactions data
-
Completeness:
There are 1542 missing values in this data frame, which represent 0.6% of total cells. especially:
- online_order has 360 (1.8%) missing values
- product_id has 1378 (6.9%) zeros
- Brand, product_line, product_class, product_size, standard_cost & product_first_sold_date all have the same amount of missing values (197) accounting for 0.98% for the data in each column. Upon visual examination of the data, it seems like these are all related. It could be worth investigating the 197 occurences and why they happened, and retrieve the related data if possible.
-
Uniqueness:
There are no duplicated records in this dataframe
5- The deliverable: Draft email to client
As a conclusion to our data analysis journey with the Sprocket Central Pty Ltd datasets, we identified several data quality issues that need to be addressed before proceeding with in-depth analysis and decision making. Following our findings, we’ve composed a comprehensive email to the client, detailing these issues and providing actionable recommendations to improve the overall quality of their data.
This step not only ensures the reliability of our subsequent analysis but also demonstrates our commitment to providing a thorough, high-quality service that supports the client’s goals. The draft email, presented below, embodies our findings and recommendations:
Subject: Data Quality Assessment Report and Recommendations for Sprocket Central Pty Ltd
Dear Team,
I hope this message finds you well.
We have conducted a preliminary analysis on the datasets you’ve provided. This analysis allows us to assess the quality of the data and make recommendations for data cleaning to enhance its usability for your marketing strategy. Below is a brief summary of our findings based on the Data Quality Framework guide we use:
1. Transactions Data
-
Completeness: There were some instances of missing values and zeroes, particularly in the ‘online_order’ and ‘product_id’ columns. Also, there are a number of columns (Brand, product_line, product_class, product_size, standard_cost, and product_first_sold_date) with identical numbers of missing values (197). It appears these could be related and warrant further investigation.
-
Uniqueness: This dataset does not contain any duplicate records, which is a positive aspect in terms of data quality.
2. Demographics Data
Accuracy: The ‘default’ column contains unrecognizable formats and could be incorrect. We suggest confirming the status of this column and considering its removal if it’s not required.
-
Completeness: A number of missing values were identified, especially in the ‘last_name’, ‘DOB’, ‘job_title’, ‘job_industry_category’, ‘default’, and ‘tenure’ columns.
-
Validity: One client’s date of birth is listed as 1843-12-21. This might be an error and should be addressed.
-
Uniqueness: No duplicates were found, and the ‘customer_id’ column contains unique values which is beneficial for the analysis.
3. Addresses Data
- Completeness: As with the transactions data, there are missing values in ‘online_order’ and ‘product_id’, as well as the same number of missing values in a set of related columns.
Given these findings, we suggest taking the following steps to mitigate current data quality concerns:
Investigate the cause of missing values in related columns in the transactions and addresses data. Try to retrieve the missing information if possible. Verify the status of the ‘default’ column in the demographics data. If it’s not required or not providing accurate data, consider removing it. Review the DOB data for validity. Correct any errors or remove data points that are incorrect and will not significantly impact the analysis. Fill missing values in the demographic data. For instance, consider reaching out to customers to fill their job title and industry if these data points are important for your marketing strategy. We hope these suggestions will guide you in making your data more reliable and suitable for analysis. We are ready to assist further if needed.
Best Regards,
Fawaz Alsaif
By following these recommendations, Sprocket Central Pty Ltd can significantly enhance their data quality, paving the way for effective, data-driven marketing strategies.