Consulting Data Quality EDA Business Context

Assessing data quality (KPMG Virtual Internship - Part 1)

17 minute read

Sprocket Central Pty Ltd , a medium size bikes & cycling accessories organisation, has approached Tony Smith (Partner) in KPMG’s Lighthouse & Innovation Team. Sprocket Central Pty Ltd is keen to learn more about KPMG’s expertise in its Analytics, Information & Modelling team.

Smith discusses KPMG’s expertise in this space (you can read more here). In particular, he speaks about how the team can effectively analyse the datasets to help Sprocket Central Pty Ltd grow its business.

Primarily, Sprocket Central Pty Ltd needs help with its customer and transactions data. The organisation has a large dataset relating to its customers, but their team is unsure how to effectively analyse it to help optimise its marketing strategy.

However, in order to support the analysis, you speak to the Associate Director for some ideas and she advised that “the importance of optimising the quality of customer datasets cannot be underestimated. The better the quality of the dataset, the better chance you will be able to use it drive company growth.”

The client provided KPMG with 3 datasets:

Customer Demographic
Customer Addresses
Transactions data in the past 3 months

You decide to start the preliminary data exploration and identify ways to improve the quality of Sprocket Central Pty Ltd’s data.

I- The task at hand

You arrive at your desk after the initial client meeting. You have a voicemail on your phone which contains the following instructions.

“Hi there – Welcome again to the team! The client has asked our team to assess the quality of their data; as well as make recommendations on ways to clean the underlying data and mitigate these issues. Can you please take a look at the datasets we’ve received and draft an email to them identifying the data quality issues and how this may impact our analysis going forward?

I will send through an example of a typical data quality framework that can be used as a guide. Remember to consider the join keys between the tables too. Thanks again for your help.”

Subsequently, you receive the below email

Hi there,

As per voicemail, please find the 3 datasets attached from Sprocket Central Pty Ltd:

Customer Demographic Customer Addresses Transaction data in the past three months Can you please review the data quality to ensure that it is ready for our analysis in phase two. Remember to take note of any assumptions or issues we need to go back to the client on. As well as recommendations going forward to mitigate current data quality concerns.

I’ve also attached a data quality framework as a guideline. Let me know if you have any questions.

Thanks for your help.

Kind Regards

Your Manager

Hence the task at hand is to draft an email to the client identifying the data quality issues and strategies to mitigate these issues. Refer to ‘Data Quality Framework Table’ and resources below for criteria and dimensions which you should consider.

You may use programs like Excel, Google Sheets, Tableau, Power BI to start. Feel free to use Python, R Programming Language, MatLab and other data analytics tools that you know of.

II- Import the necessary packages

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

demographics = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "CustomerDemographic", header=0)
addresses = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "CustomerAddress",header = 1)
transactions = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "Transactions", header = 1)
cust_list = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "NewCustomerList", header = 1)

C:\Users\fawaz\AppData\Local\Temp\ipykernel_17128\2520818905.py:1: FutureWarning: Inferring datetime64[ns] from data containing strings is deprecated and will be removed in a future version. To retain the old behavior explicitly pass Series(data, dtype=datetime64[ns])
  demographics = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "CustomerDemographic", header=0)
C:\Users\fawaz\AppData\Local\Temp\ipykernel_17128\2520818905.py:4: FutureWarning: Inferring datetime64[ns] from data containing strings is deprecated and will be removed in a future version. To retain the old behavior explicitly pass Series(data, dtype=datetime64[ns])
  cust_list = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "NewCustomerList", header = 1)

Now that all the data is loaded into pandas DataFrame formats, we will examin them in the lense of data quality to uncover any data issues we may encounter.

these issues may come under the following:

Accuracy: Correct values
Completeness: Data fields with values
Consistency: Values free from contradictions
Currency: Values up to date
Relevency: Data items with Value Meta-Data
Validity: Data containing allowable Values
Uniqueness: Records that are duplicated

1- Demographics DataFrame

Let’s first generate data dictionary to clarify the column names and data types.

column_names = list(demographics.columns)
data_types = demographics.dtypes.tolist()
description = [""] * len(column_names)  # Add descriptions for each column if available

# Create the data dictionary DataFrame
data_dictionary = pd.DataFrame({
    'Column Name': column_names,
    'Data Type': data_types,
    'Description': description
})

# Display the data dictionary

print(data_dictionary)

                            Column Name       Data Type Description
                         customer_id           int64            
                          first_name          object            
                           last_name          object            
                              gender          object            
 past_3_years_bike_related_purchases           int64            
                                 DOB  datetime64[ns]            
                           job_title          object            
               job_industry_category          object            
                      wealth_segment          object            
                  deceased_indicator          object            
                            default          object            
                           owns_car          object            
                             tenure         float64            

column_names = list(demographics.columns)

# Display the column names
print("Column names:", column_names)

Column names: ['customer_id', 'first_name', 'last_name', 'gender', 'past_3_years_bike_related_purchases', 'DOB', 'job_title', 'job_industry_category', 'wealth_segment', 'deceased_indicator', 'default', 'owns_car', 'tenure']

Let’s display the top rows of the data frame so we can have a quick visual assessmeent of the data.

demographics.head()

	customer_id	first_name	last_name	gender	past_3_years_bike_related_purchases	DOB	job_title	job_industry_category	wealth_segment	deceased_indicator	default	owns_car	tenure
0	1	Laraine	Medendorp	F	93	1953-10-12	Executive Secretary	Health	Mass Customer	N	"'	Yes	11.0
1	2	Eli	Bockman	Male	81	1980-12-16	Administrative Officer	Financial Services	Mass Customer	N	<script>alert('hi')</script>	Yes	16.0
2	3	Arlin	Dearle	Male	61	1954-01-20	Recruiting Manager	Property	Mass Customer	N	2018-02-01 00:00:00	Yes	15.0
3	4	Talbot	NaN	Male	33	1961-10-03	NaN	IT	Mass Customer	N	() { _; } >_[$($())] { touch /tmp/blns.shellsh...	No	7.0
4	5	Sheila-kathryn	Calton	Female	56	1977-05-13	Senior Editor	NaN	Affluent Customer	N	NIL	Yes	8.0

let’s get the dataframe info.

demographics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 13 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   customer_id                          4000 non-null   int64         
 1   first_name                           4000 non-null   object        
 2   last_name                            3875 non-null   object        
 3   gender                               4000 non-null   object        
 4   past_3_years_bike_related_purchases  4000 non-null   int64         
 5   DOB                                  3913 non-null   datetime64[ns]
 6   job_title                            3494 non-null   object        
 7   job_industry_category                3344 non-null   object        
 8   wealth_segment                       4000 non-null   object        
 9   deceased_indicator                   4000 non-null   object        
 10  default                              3698 non-null   object        
 11  owns_car                             4000 non-null   object        
 12  tenure                               3913 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(2), object(9)
memory usage: 406.4+ KB

We can observe that there are many

n_lines, n_columns = demographics.shape
print(f"the number of lines in this DF is {n_lines}")
print(f"the number of columns in this DF is {n_columns}")

the number of lines in this DF is 4000
the number of columns in this DF is 13

from pandas_profiling import ProfileReport

# Generate the profiling report
profile = ProfileReport(demographics)

# Save the report as an HTML file
profile.to_file('demographics_report.html')

# Download the HTML file
import webbrowser
webbrowser.open('demographics_report.html')

# Display the report inline in Google Colab
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

num_duplicates = demographics.duplicated().sum()
num_missing = demographics.isna().sum()
print("Number of duplicate records:", num_duplicates)
print("Number of missing records:")
print("--------------------------")
print(num_missing)
print("--------------------------")
print("% of missing records:")
print(round(num_missing/demographics.shape[0]*100,2))

Number of duplicate records: 0
Number of missing records:
--------------------------
customer_id                              0
first_name                               0
last_name                              125
gender                                   0
past_3_years_bike_related_purchases      0
DOB                                     87
job_title                              506
job_industry_category                  656
wealth_segment                           0
deceased_indicator                       0
default                                302
owns_car                                 0
tenure                                  87
dtype: int64
--------------------------
% of missing records:
customer_id                             0.00
first_name                              0.00
last_name                               3.12
gender                                  0.00
past_3_years_bike_related_purchases     0.00
DOB                                     2.17
job_title                              12.65
job_industry_category                  16.40
wealth_segment                          0.00
deceased_indicator                      0.00
default                                 7.55
owns_car                                0.00
tenure                                  2.17
dtype: float64

demographics['DOB'].describe()

C:\Users\fawaz\AppData\Local\Temp\ipykernel_17128\2807883312.py:1: FutureWarning: Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
  demographics['DOB'].describe()





count                    3913
unique                   3448
top       1978-01-30 00:00:00
freq                        7
first     1843-12-21 00:00:00
last      2002-03-11 00:00:00
Name: DOB, dtype: object

From our observation of the demographics section of the dataset, we can observe the following:

Accuracy: Correct values

The “default” column seems to be mostly incorrect with very un-recognizable formats. We would like to confirm this column satatus. Otherwise, it is better to drop it altoguether if unnecessary.
Completeness: Data fields with values

There are 1763 missing values in this data frame, which represent 3.4% of total cells. especially:
- last_name has 125 (3.1%) missing values
- DOB has 87 (2.2%) missing values
- job_title has 506 (12.7%) missing values
- job_industry_category has 656 (16.4%) missing values
- default has 302 (7.5%) missing values
- tenure has 87 (2.2%) missing values
Validity: Data containing allowable Values We noticed that one of the clients’ date of birth was 1843-12-21. It could be worth getting the right data or dropping this data line if not impactful for the analysis.
Uniqueness:
- There are no duplicated records in this dataframe
- The customer_id column has unique values, which is a good indication of the data quality.

Missing values by column - Demographics Data Frame

missing values - demographics.jpg

2- Customer Addresses DataFrame

Let’s first generate data dictionary to clarify the column names and data types.

column_names = list(addresses.columns)
data_types = addresses.dtypes.tolist()
description = [""] * len(column_names)  # Add descriptions for each column if available

# Create the data dictionary DataFrame
data_dictionary = pd.DataFrame({
    'Column Name': column_names,
    'Data Type': data_types,
    'Description': description
})

# Display the data dictionary

print(data_dictionary)

          Column Name Data Type Description
       customer_id     int64            
           address    object            
          postcode     int64            
             state    object            
           country    object            
property_valuation     int64            

column_names = list(addresses.columns)

# Display the column names
print("Column names:", column_names)

Column names: ['customer_id', 'address', 'postcode', 'state', 'country', 'property_valuation']

Let’s display the top rows of the data frame so we can have a quick visual assessmeent of the data.

addresses.head()

	customer_id	address	postcode	state	country	property_valuation
0	1	060 Morning Avenue	2016	New South Wales	Australia	10
1	2	6 Meadow Vale Court	2153	New South Wales	Australia	10
2	4	0 Holy Cross Court	4211	QLD	Australia	9
3	5	17979 Del Mar Point	2448	New South Wales	Australia	4
4	6	9 Oakridge Court	3216	VIC	Australia	9

let’s get the dataframe info.

addresses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3999 entries, 0 to 3998
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         3999 non-null   int64 
 1   address             3999 non-null   object
 2   postcode            3999 non-null   int64 
 3   state               3999 non-null   object
 4   country             3999 non-null   object
 5   property_valuation  3999 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 187.6+ KB

We can observe that there are many

n_lines, n_columns = addresses.shape
print(f"the number of lines in this DF is {n_lines}")
print(f"the number of columns in this DF is {n_columns}")

the number of lines in this DF is 3999
the number of columns in this DF is 6

from pandas_profiling import ProfileReport

# Generate the profiling report
profile = ProfileReport(addresses)

# Save the report as an HTML file
profile.to_file('addresses_report.html')

# Download the HTML file
webbrowser.open('addresses_report.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

True

num_duplicates = addresses.duplicated().sum()
num_missing = addresses.isna().sum()

print("Number of duplicate records:", num_duplicates)
print("Number of missing records:")
print("--------------------------")
print(num_missing)

print("--------------------------")
print("% of missing records:")
print(round(num_missing/addresses.shape[0]*100,2))

Number of duplicate records: 0
Number of missing records:
--------------------------
customer_id           0
address               0
postcode              0
state                 0
country               0
property_valuation    0
dtype: int64
--------------------------
% of missing records:
customer_id           0.0
address               0.0
postcode              0.0
state                 0.0
country               0.0
property_valuation    0.0
dtype: float64

From our observation of the adresses section of the dataset, we can observe the following:

Uniqueness:
- There are no duplicated records in this dataframe
- The country is “Australia” for all clients. This information should be dropped for the rest of the analysis.

3- Transactions Addresses DataFrame

Let’s first generate data dictionary to clarify the column names and data types.

column_names = list(transactions.columns)
data_types = transactions.dtypes.tolist()
description = [""] * len(column_names)  # Add descriptions for each column if available

# Create the data dictionary DataFrame
data_dictionary = pd.DataFrame({
    'Column Name': column_names,
    'Data Type': data_types,
    'Description': description
})

# Display the data dictionary

print(data_dictionary)

            Column Name       Data Type Description
                   1           int64            
                   2           int64            
                2950           int64            
 2017-02-25 00:00:00  datetime64[ns]            
               False         float64            
            Approved          object            
               Solex          object            
            Standard          object            
              medium          object            
            medium.1          object            
              71.49         float64            
              53.62         float64            
              41245         float64            

column_names = list(transactions.columns)

# Display the column names
print("Column names:", column_names)

Column names: [1, 2, 2950, datetime.datetime(2017, 2, 25, 0, 0), False, 'Approved', 'Solex', 'Standard', 'medium', 'medium.1', 71.49, 53.62, 41245]

transactions.head()

	1	2	2950	2017-02-25 00:00:00	False	Approved	Solex	Standard	medium	medium.1	71.49	53.62	41245
0	2	3	3120	2017-05-21	1.0	Approved	Trek Bicycles	Standard	medium	large	2091.47	388.92	41701.0
1	3	37	402	2017-10-16	0.0	Approved	OHM Cycles	Standard	low	medium	1793.43	248.82	36361.0
2	4	88	3135	2017-08-31	0.0	Approved	Norco Bicycles	Standard	medium	medium	1198.46	381.10	36145.0
3	5	78	787	2017-10-01	1.0	Approved	Giant Bicycles	Standard	medium	large	1765.30	709.48	42226.0
4	6	25	2339	2017-03-08	1.0	Approved	Giant Bicycles	Road	medium	medium	1538.99	829.65	39031.0

transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   1                    19999 non-null  int64         
 1   2                    19999 non-null  int64         
 2   2950                 19999 non-null  int64         
 3   2017-02-25 00:00:00  19999 non-null  datetime64[ns]
 4   False                19639 non-null  float64       
 5   Approved             19999 non-null  object        
 6   Solex                19802 non-null  object        
 7   Standard             19802 non-null  object        
 8   medium               19802 non-null  object        
 9   medium.1             19802 non-null  object        
 10  71.49                19999 non-null  float64       
 11  53.62                19802 non-null  float64       
 12  41245                19802 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(3), object(5)
memory usage: 2.0+ MB

n_lines, n_columns = transactions.shape
print(f"the number of lines in this DF is {n_lines}")
print(f"the number of columns in this DF is {n_columns}")

the number of lines in this DF is 19999
the number of columns in this DF is 13

from pandas_profiling import ProfileReport

# Generate the profiling report
profile = ProfileReport(transactions)

# Save the report as an HTML file
profile.to_file('transactions_report.html')

# Download the HTML file
webbrowser.open('transactions_report.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

C:\Users\fawaz\anaconda3\lib\site-packages\multimethod\__init__.py:315: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  return func(*args, **kwargs)

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

True

num_duplicates = transactions.duplicated().sum()
num_missing = transactions.isna().sum()

print("Number of duplicate records:", num_duplicates)
print("Number of missing records:")
print("--------------------------")
print(num_missing)
print("--------------------------")
print("% of missing records:")
print(round(num_missing/transactions.shape[0]*100, 2))

Number of duplicate records: 0
Number of missing records:
--------------------------
1                        0
2                        0
2950                     0
2017-02-25 00:00:00      0
False                  360
Approved                 0
Solex                  197
Standard               197
medium                 197
medium.1               197
71.49                    0
53.62                  197
41245                  197
dtype: int64
--------------------------
% of missing records:
1                      0.00
2                      0.00
2950                   0.00
2017-02-25 00:00:00    0.00
False                  1.80
Approved               0.00
Solex                  0.99
Standard               0.99
medium                 0.99
medium.1               0.99
71.49                  0.00
53.62                  0.99
41245                  0.99
dtype: float64

From our observation of the transactions section of the dataset, we can observe the following:

Completeness: Data fields with values

There are 1542 missing values in this data frame, which represent 0.6% of total cells. especially:
- online_order has 360 (1.8%) missing values
- product_id has 1378 (6.9%) zeros
- Brand, product_line, product_class, product_size, standard_cost & product_first_sold_date all have the same amount of missing values (197) accounting for 0.98% for the data in each column. Upon visual examination of the data, it seems like these are all related. It could be worth investigating the 197 occurences and why they happened, and retrieve the related data if possible
Uniqueness:

There are no duplicated records in this dataframe

Missing values by column - Transaction Data Frame

missing values - transactions.jpg

4- Findings summary

Addresses data

From our observation of the transactions section of the dataset, we can observe the following:

Completeness:

There are 1542 missing values in this data frame, which represent 0.6% of total cells. especially:
- online_order has 360 (1.8%) missing values
- product_id has 1378 (6.9%) zeros
- Brand, product_line, product_class, product_size, standard_cost & product_first_sold_date all have the same amount of missing values (197) accounting for 0.98% for the data in each column. Upon visual examination of the data, it seems like these are all related. It could be worth investigating the 197 occurences and why they happened, and retrieve the related data if possible.

Demographics data

Accuracy:

The “default” column seems to be mostly incorrect with very un-recognizable formats. We would like to confirm this column satatus. Otherwise, it is better to drop it altoguether if unnecessary.
Completeness:

There are 1763 missing values in this data frame, which represent 3.4% of total cells. especially:
- last_name has 125 (3.1%) missing values
- DOB has 87 (2.2%) missing values
- job_title has 506 (12.7%) missing values
- job_industry_category has 656 (16.4%) missing values
- default has 302 (7.5%) missing values
- tenure has 87 (2.2%) missing values
Validity: We noticed that one of the clients’ date of birth was 1843-12-21. It could be worth getting the right data or dropping this data line if not impactful for the analysis.
Uniqueness:
- There are no duplicated records in this dataframe
- The customer_id column has unique values, which is a good indication of the data quality.

Transactions data

Completeness:

There are 1542 missing values in this data frame, which represent 0.6% of total cells. especially:
- online_order has 360 (1.8%) missing values
- product_id has 1378 (6.9%) zeros
- Brand, product_line, product_class, product_size, standard_cost & product_first_sold_date all have the same amount of missing values (197) accounting for 0.98% for the data in each column. Upon visual examination of the data, it seems like these are all related. It could be worth investigating the 197 occurences and why they happened, and retrieve the related data if possible.
Uniqueness:

There are no duplicated records in this dataframe

5- The deliverable: Draft email to client

As a conclusion to our data analysis journey with the Sprocket Central Pty Ltd datasets, we identified several data quality issues that need to be addressed before proceeding with in-depth analysis and decision making. Following our findings, we’ve composed a comprehensive email to the client, detailing these issues and providing actionable recommendations to improve the overall quality of their data.

This step not only ensures the reliability of our subsequent analysis but also demonstrates our commitment to providing a thorough, high-quality service that supports the client’s goals. The draft email, presented below, embodies our findings and recommendations:

Subject: Data Quality Assessment Report and Recommendations for Sprocket Central Pty Ltd

Dear Team,

I hope this message finds you well.

We have conducted a preliminary analysis on the datasets you’ve provided. This analysis allows us to assess the quality of the data and make recommendations for data cleaning to enhance its usability for your marketing strategy. Below is a brief summary of our findings based on the Data Quality Framework guide we use:

1. Transactions Data

Completeness: There were some instances of missing values and zeroes, particularly in the ‘online_order’ and ‘product_id’ columns. Also, there are a number of columns (Brand, product_line, product_class, product_size, standard_cost, and product_first_sold_date) with identical numbers of missing values (197). It appears these could be related and warrant further investigation.
Uniqueness: This dataset does not contain any duplicate records, which is a positive aspect in terms of data quality.

2. Demographics Data

Accuracy: The ‘default’ column contains unrecognizable formats and could be incorrect. We suggest confirming the status of this column and considering its removal if it’s not required.

Completeness: A number of missing values were identified, especially in the ‘last_name’, ‘DOB’, ‘job_title’, ‘job_industry_category’, ‘default’, and ‘tenure’ columns.
Validity: One client’s date of birth is listed as 1843-12-21. This might be an error and should be addressed.
Uniqueness: No duplicates were found, and the ‘customer_id’ column contains unique values which is beneficial for the analysis.

3. Addresses Data

Completeness: As with the transactions data, there are missing values in ‘online_order’ and ‘product_id’, as well as the same number of missing values in a set of related columns.

Given these findings, we suggest taking the following steps to mitigate current data quality concerns:

Investigate the cause of missing values in related columns in the transactions and addresses data. Try to retrieve the missing information if possible. Verify the status of the ‘default’ column in the demographics data. If it’s not required or not providing accurate data, consider removing it. Review the DOB data for validity. Correct any errors or remove data points that are incorrect and will not significantly impact the analysis. Fill missing values in the demographic data. For instance, consider reaching out to customers to fill their job title and industry if these data points are important for your marketing strategy. We hope these suggestions will guide you in making your data more reliable and suitable for analysis. We are ready to assist further if needed.

Best Regards,

Fawaz Alsaif

By following these recommendations, Sprocket Central Pty Ltd can significantly enhance their data quality, paving the way for effective, data-driven marketing strategies.