Assessing data quality (KPMG Virtual Internship - Part 1)

Sprocket Central Pty Ltd , a medium size bikes & cycling accessories organisation, has approached Tony Smith (Partner) in KPMG’s Lighthouse & Innovation Team. Sprocket Central Pty Ltd is keen to learn more about KPMG’s expertise in its Analytics, Information & Modelling team.

Smith discusses KPMG’s expertise in this space (you can read more here). In particular, he speaks about how the team can effectively analyse the datasets to help Sprocket Central Pty Ltd grow its business.

Primarily, Sprocket Central Pty Ltd needs help with its customer and transactions data. The organisation has a large dataset relating to its customers, but their team is unsure how to effectively analyse it to help optimise its marketing strategy.

However, in order to support the analysis, you speak to the Associate Director for some ideas and she advised that “the importance of optimising the quality of customer datasets cannot be underestimated. The better the quality of the dataset, the better chance you will be able to use it drive company growth.”

The client provided KPMG with 3 datasets:

  • Customer Demographic
  • Customer Addresses
  • Transactions data in the past 3 months

You decide to start the preliminary data exploration and identify ways to improve the quality of Sprocket Central Pty Ltd’s data.

I- The task at hand

You arrive at your desk after the initial client meeting. You have a voicemail on your phone which contains the following instructions.

“Hi there – Welcome again to the team! The client has asked our team to assess the quality of their data; as well as make recommendations on ways to clean the underlying data and mitigate these issues. Can you please take a look at the datasets we’ve received and draft an email to them identifying the data quality issues and how this may impact our analysis going forward?

I will send through an example of a typical data quality framework that can be used as a guide. Remember to consider the join keys between the tables too. Thanks again for your help.”

Subsequently, you receive the below email


Hi there,

As per voicemail, please find the 3 datasets attached from Sprocket Central Pty Ltd:

Customer Demographic Customer Addresses Transaction data in the past three months Can you please review the data quality to ensure that it is ready for our analysis in phase two. Remember to take note of any assumptions or issues we need to go back to the client on. As well as recommendations going forward to mitigate current data quality concerns.

I’ve also attached a data quality framework as a guideline. Let me know if you have any questions.

Thanks for your help.

Kind Regards

Your Manager



Hence the task at hand is to draft an email to the client identifying the data quality issues and strategies to mitigate these issues. Refer to ‘Data Quality Framework Table’ and resources below for criteria and dimensions which you should consider.

You may use programs like Excel, Google Sheets, Tableau, Power BI to start. Feel free to use Python, R Programming Language, MatLab and other data analytics tools that you know of.

II- Import the necessary packages

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

demographics = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "CustomerDemographic", header=0)
addresses = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "CustomerAddress",header = 1)
transactions = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "Transactions", header = 1)
cust_list = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "NewCustomerList", header = 1)
C:\Users\fawaz\AppData\Local\Temp\ipykernel_17128\2520818905.py:1: FutureWarning: Inferring datetime64[ns] from data containing strings is deprecated and will be removed in a future version. To retain the old behavior explicitly pass Series(data, dtype=datetime64[ns])
  demographics = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "CustomerDemographic", header=0)
C:\Users\fawaz\AppData\Local\Temp\ipykernel_17128\2520818905.py:4: FutureWarning: Inferring datetime64[ns] from data containing strings is deprecated and will be removed in a future version. To retain the old behavior explicitly pass Series(data, dtype=datetime64[ns])
  cust_list = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = "NewCustomerList", header = 1)

Now that all the data is loaded into pandas DataFrame formats, we will examin them in the lense of data quality to uncover any data issues we may encounter.

these issues may come under the following:

  1. Accuracy: Correct values
  2. Completeness: Data fields with values
  3. Consistency: Values free from contradictions
  4. Currency: Values up to date
  5. Relevency: Data items with Value Meta-Data
  6. Validity: Data containing allowable Values
  7. Uniqueness: Records that are duplicated

1- Demographics DataFrame

Let’s first generate data dictionary to clarify the column names and data types.

column_names = list(demographics.columns)
data_types = demographics.dtypes.tolist()
description = [""] * len(column_names)  # Add descriptions for each column if available

# Create the data dictionary DataFrame
data_dictionary = pd.DataFrame({
    'Column Name': column_names,
    'Data Type': data_types,
    'Description': description
})

# Display the data dictionary

print(data_dictionary)
                            Column Name       Data Type Description
0                           customer_id           int64            
1                            first_name          object            
2                             last_name          object            
3                                gender          object            
4   past_3_years_bike_related_purchases           int64            
5                                   DOB  datetime64[ns]            
6                             job_title          object            
7                 job_industry_category          object            
8                        wealth_segment          object            
9                    deceased_indicator          object            
10                              default          object            
11                             owns_car          object            
12                               tenure         float64            
column_names = list(demographics.columns)

# Display the column names
print("Column names:", column_names)
Column names: ['customer_id', 'first_name', 'last_name', 'gender', 'past_3_years_bike_related_purchases', 'DOB', 'job_title', 'job_industry_category', 'wealth_segment', 'deceased_indicator', 'default', 'owns_car', 'tenure']

Let’s display the top rows of the data frame so we can have a quick visual assessmeent of the data.

demographics.head()
customer_id first_name last_name gender past_3_years_bike_related_purchases DOB job_title job_industry_category wealth_segment deceased_indicator default owns_car tenure
0 1 Laraine Medendorp F 93 1953-10-12 Executive Secretary Health Mass Customer N "' Yes 11.0
1 2 Eli Bockman Male 81 1980-12-16 Administrative Officer Financial Services Mass Customer N <script>alert('hi')</script> Yes 16.0
2 3 Arlin Dearle Male 61 1954-01-20 Recruiting Manager Property Mass Customer N 2018-02-01 00:00:00 Yes 15.0
3 4 Talbot NaN Male 33 1961-10-03 NaN IT Mass Customer N () { _; } >_[$($())] { touch /tmp/blns.shellsh... No 7.0
4 5 Sheila-kathryn Calton Female 56 1977-05-13 Senior Editor NaN Affluent Customer N NIL Yes 8.0

let’s get the dataframe info.

demographics.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 13 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   customer_id                          4000 non-null   int64         
 1   first_name                           4000 non-null   object        
 2   last_name                            3875 non-null   object        
 3   gender                               4000 non-null   object        
 4   past_3_years_bike_related_purchases  4000 non-null   int64         
 5   DOB                                  3913 non-null   datetime64[ns]
 6   job_title                            3494 non-null   object        
 7   job_industry_category                3344 non-null   object        
 8   wealth_segment                       4000 non-null   object        
 9   deceased_indicator                   4000 non-null   object        
 10  default                              3698 non-null   object        
 11  owns_car                             4000 non-null   object        
 12  tenure                               3913 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(2), object(9)
memory usage: 406.4+ KB

We can observe that there are many

n_lines, n_columns = demographics.shape
print(f"the number of lines in this DF is {n_lines}")
print(f"the number of columns in this DF is {n_columns}")
the number of lines in this DF is 4000
the number of columns in this DF is 13
from pandas_profiling import ProfileReport

# Generate the profiling report
profile = ProfileReport(demographics)

# Save the report as an HTML file
profile.to_file('demographics_report.html')

# Download the HTML file
import webbrowser
webbrowser.open('demographics_report.html')

# Display the report inline in Google Colab
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]



Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]
num_duplicates = demographics.duplicated().sum()
num_missing = demographics.isna().sum()
print("Number of duplicate records:", num_duplicates)
print("Number of missing records:")
print("--------------------------")
print(num_missing)
print("--------------------------")
print("% of missing records:")
print(round(num_missing/demographics.shape[0]*100,2))
Number of duplicate records: 0
Number of missing records:
--------------------------
customer_id                              0
first_name                               0
last_name                              125
gender                                   0
past_3_years_bike_related_purchases      0
DOB                                     87
job_title                              506
job_industry_category                  656
wealth_segment                           0
deceased_indicator                       0
default                                302
owns_car                                 0
tenure                                  87
dtype: int64
--------------------------
% of missing records:
customer_id                             0.00
first_name                              0.00
last_name                               3.12
gender                                  0.00
past_3_years_bike_related_purchases     0.00
DOB                                     2.17
job_title                              12.65
job_industry_category                  16.40
wealth_segment                          0.00
deceased_indicator                      0.00
default                                 7.55
owns_car                                0.00
tenure                                  2.17
dtype: float64
demographics['DOB'].describe()
C:\Users\fawaz\AppData\Local\Temp\ipykernel_17128\2807883312.py:1: FutureWarning: Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
  demographics['DOB'].describe()





count                    3913
unique                   3448
top       1978-01-30 00:00:00
freq                        7
first     1843-12-21 00:00:00
last      2002-03-11 00:00:00
Name: DOB, dtype: object

From our observation of the demographics section of the dataset, we can observe the following:

  • Accuracy: Correct values

    The “default” column seems to be mostly incorrect with very un-recognizable formats. We would like to confirm this column satatus. Otherwise, it is better to drop it altoguether if unnecessary.

  • Completeness: Data fields with values

    There are 1763 missing values in this data frame, which represent 3.4% of total cells. especially:

    • last_name has 125 (3.1%) missing values
    • DOB has 87 (2.2%) missing values
    • job_title has 506 (12.7%) missing values
    • job_industry_category has 656 (16.4%) missing values
    • default has 302 (7.5%) missing values
    • tenure has 87 (2.2%) missing values
  • Validity: Data containing allowable Values We noticed that one of the clients’ date of birth was 1843-12-21. It could be worth getting the right data or dropping this data line if not impactful for the analysis.

  • Uniqueness:

    • There are no duplicated records in this dataframe
    • The customer_id column has unique values, which is a good indication of the data quality.

Missing values by column - Demographics Data Frame

missing values - demographics.jpg

2- Customer Addresses DataFrame

Let’s first generate data dictionary to clarify the column names and data types.

column_names = list(addresses.columns)
data_types = addresses.dtypes.tolist()
description = [""] * len(column_names)  # Add descriptions for each column if available

# Create the data dictionary DataFrame
data_dictionary = pd.DataFrame({
    'Column Name': column_names,
    'Data Type': data_types,
    'Description': description
})

# Display the data dictionary

print(data_dictionary)
          Column Name Data Type Description
0         customer_id     int64            
1             address    object            
2            postcode     int64            
3               state    object            
4             country    object            
5  property_valuation     int64            
column_names = list(addresses.columns)

# Display the column names
print("Column names:", column_names)
Column names: ['customer_id', 'address', 'postcode', 'state', 'country', 'property_valuation']

Let’s display the top rows of the data frame so we can have a quick visual assessmeent of the data.

addresses.head()
customer_id address postcode state country property_valuation
0 1 060 Morning Avenue 2016 New South Wales Australia 10
1 2 6 Meadow Vale Court 2153 New South Wales Australia 10
2 4 0 Holy Cross Court 4211 QLD Australia 9
3 5 17979 Del Mar Point 2448 New South Wales Australia 4
4 6 9 Oakridge Court 3216 VIC Australia 9

let’s get the dataframe info.

addresses.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3999 entries, 0 to 3998
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         3999 non-null   int64 
 1   address             3999 non-null   object
 2   postcode            3999 non-null   int64 
 3   state               3999 non-null   object
 4   country             3999 non-null   object
 5   property_valuation  3999 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 187.6+ KB

We can observe that there are many

n_lines, n_columns = addresses.shape
print(f"the number of lines in this DF is {n_lines}")
print(f"the number of columns in this DF is {n_columns}")
the number of lines in this DF is 3999
the number of columns in this DF is 6
from pandas_profiling import ProfileReport

# Generate the profiling report
profile = ProfileReport(addresses)

# Save the report as an HTML file
profile.to_file('addresses_report.html')

# Download the HTML file
webbrowser.open('addresses_report.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]



Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]





True
num_duplicates = addresses.duplicated().sum()
num_missing = addresses.isna().sum()

print("Number of duplicate records:", num_duplicates)
print("Number of missing records:")
print("--------------------------")
print(num_missing)

print("--------------------------")
print("% of missing records:")
print(round(num_missing/addresses.shape[0]*100,2))
Number of duplicate records: 0
Number of missing records:
--------------------------
customer_id           0
address               0
postcode              0
state                 0
country               0
property_valuation    0
dtype: int64
--------------------------
% of missing records:
customer_id           0.0
address               0.0
postcode              0.0
state                 0.0
country               0.0
property_valuation    0.0
dtype: float64

From our observation of the adresses section of the dataset, we can observe the following:

  • Uniqueness:

    • There are no duplicated records in this dataframe
    • The country is “Australia” for all clients. This information should be dropped for the rest of the analysis.

3- Transactions Addresses DataFrame

Let’s first generate data dictionary to clarify the column names and data types.

column_names = list(transactions.columns)
data_types = transactions.dtypes.tolist()
description = [""] * len(column_names)  # Add descriptions for each column if available

# Create the data dictionary DataFrame
data_dictionary = pd.DataFrame({
    'Column Name': column_names,
    'Data Type': data_types,
    'Description': description
})

# Display the data dictionary

print(data_dictionary)
            Column Name       Data Type Description
0                     1           int64            
1                     2           int64            
2                  2950           int64            
3   2017-02-25 00:00:00  datetime64[ns]            
4                 False         float64            
5              Approved          object            
6                 Solex          object            
7              Standard          object            
8                medium          object            
9              medium.1          object            
10                71.49         float64            
11                53.62         float64            
12                41245         float64            
column_names = list(transactions.columns)

# Display the column names
print("Column names:", column_names)
Column names: [1, 2, 2950, datetime.datetime(2017, 2, 25, 0, 0), False, 'Approved', 'Solex', 'Standard', 'medium', 'medium.1', 71.49, 53.62, 41245]
transactions.head()
1 2 2950 2017-02-25 00:00:00 False Approved Solex Standard medium medium.1 71.49 53.62 41245
0 2 3 3120 2017-05-21 1.0 Approved Trek Bicycles Standard medium large 2091.47 388.92 41701.0
1 3 37 402 2017-10-16 0.0 Approved OHM Cycles Standard low medium 1793.43 248.82 36361.0
2 4 88 3135 2017-08-31 0.0 Approved Norco Bicycles Standard medium medium 1198.46 381.10 36145.0
3 5 78 787 2017-10-01 1.0 Approved Giant Bicycles Standard medium large 1765.30 709.48 42226.0
4 6 25 2339 2017-03-08 1.0 Approved Giant Bicycles Road medium medium 1538.99 829.65 39031.0
transactions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   1                    19999 non-null  int64         
 1   2                    19999 non-null  int64         
 2   2950                 19999 non-null  int64         
 3   2017-02-25 00:00:00  19999 non-null  datetime64[ns]
 4   False                19639 non-null  float64       
 5   Approved             19999 non-null  object        
 6   Solex                19802 non-null  object        
 7   Standard             19802 non-null  object        
 8   medium               19802 non-null  object        
 9   medium.1             19802 non-null  object        
 10  71.49                19999 non-null  float64       
 11  53.62                19802 non-null  float64       
 12  41245                19802 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(3), object(5)
memory usage: 2.0+ MB
n_lines, n_columns = transactions.shape
print(f"the number of lines in this DF is {n_lines}")
print(f"the number of columns in this DF is {n_columns}")
the number of lines in this DF is 19999
the number of columns in this DF is 13
from pandas_profiling import ProfileReport

# Generate the profiling report
profile = ProfileReport(transactions)

# Save the report as an HTML file
profile.to_file('transactions_report.html')

# Download the HTML file
webbrowser.open('transactions_report.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


C:\Users\fawaz\anaconda3\lib\site-packages\multimethod\__init__.py:315: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  return func(*args, **kwargs)



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]



Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]





True
num_duplicates = transactions.duplicated().sum()
num_missing = transactions.isna().sum()

print("Number of duplicate records:", num_duplicates)
print("Number of missing records:")
print("--------------------------")
print(num_missing)
print("--------------------------")
print("% of missing records:")
print(round(num_missing/transactions.shape[0]*100, 2))
Number of duplicate records: 0
Number of missing records:
--------------------------
1                        0
2                        0
2950                     0
2017-02-25 00:00:00      0
False                  360
Approved                 0
Solex                  197
Standard               197
medium                 197
medium.1               197
71.49                    0
53.62                  197
41245                  197
dtype: int64
--------------------------
% of missing records:
1                      0.00
2                      0.00
2950                   0.00
2017-02-25 00:00:00    0.00
False                  1.80
Approved               0.00
Solex                  0.99
Standard               0.99
medium                 0.99
medium.1               0.99
71.49                  0.00
53.62                  0.99
41245                  0.99
dtype: float64

From our observation of the transactions section of the dataset, we can observe the following:

  • Completeness: Data fields with values

    There are 1542 missing values in this data frame, which represent 0.6% of total cells. especially:

    • online_order has 360 (1.8%) missing values
    • product_id has 1378 (6.9%) zeros
    • Brand, product_line, product_class, product_size, standard_cost & product_first_sold_date all have the same amount of missing values (197) accounting for 0.98% for the data in each column. Upon visual examination of the data, it seems like these are all related. It could be worth investigating the 197 occurences and why they happened, and retrieve the related data if possible
  • Uniqueness:

    There are no duplicated records in this dataframe

Missing values by column - Transaction Data Frame

missing values - transactions.jpg

4- Findings summary

Addresses data

From our observation of the transactions section of the dataset, we can observe the following:

  • Completeness:

    There are 1542 missing values in this data frame, which represent 0.6% of total cells. especially:

    • online_order has 360 (1.8%) missing values
    • product_id has 1378 (6.9%) zeros
    • Brand, product_line, product_class, product_size, standard_cost & product_first_sold_date all have the same amount of missing values (197) accounting for 0.98% for the data in each column. Upon visual examination of the data, it seems like these are all related. It could be worth investigating the 197 occurences and why they happened, and retrieve the related data if possible.

Demographics data

  • Accuracy:

    The “default” column seems to be mostly incorrect with very un-recognizable formats. We would like to confirm this column satatus. Otherwise, it is better to drop it altoguether if unnecessary.

  • Completeness:

    There are 1763 missing values in this data frame, which represent 3.4% of total cells. especially:

    • last_name has 125 (3.1%) missing values
    • DOB has 87 (2.2%) missing values
    • job_title has 506 (12.7%) missing values
    • job_industry_category has 656 (16.4%) missing values
    • default has 302 (7.5%) missing values
    • tenure has 87 (2.2%) missing values
  • Validity: We noticed that one of the clients’ date of birth was 1843-12-21. It could be worth getting the right data or dropping this data line if not impactful for the analysis.

  • Uniqueness:

    • There are no duplicated records in this dataframe
    • The customer_id column has unique values, which is a good indication of the data quality.

Transactions data

  • Completeness:

    There are 1542 missing values in this data frame, which represent 0.6% of total cells. especially:

    • online_order has 360 (1.8%) missing values
    • product_id has 1378 (6.9%) zeros
    • Brand, product_line, product_class, product_size, standard_cost & product_first_sold_date all have the same amount of missing values (197) accounting for 0.98% for the data in each column. Upon visual examination of the data, it seems like these are all related. It could be worth investigating the 197 occurences and why they happened, and retrieve the related data if possible.
  • Uniqueness:

    There are no duplicated records in this dataframe

5- The deliverable: Draft email to client

As a conclusion to our data analysis journey with the Sprocket Central Pty Ltd datasets, we identified several data quality issues that need to be addressed before proceeding with in-depth analysis and decision making. Following our findings, we’ve composed a comprehensive email to the client, detailing these issues and providing actionable recommendations to improve the overall quality of their data.

This step not only ensures the reliability of our subsequent analysis but also demonstrates our commitment to providing a thorough, high-quality service that supports the client’s goals. The draft email, presented below, embodies our findings and recommendations:


Subject: Data Quality Assessment Report and Recommendations for Sprocket Central Pty Ltd

Dear Team,

I hope this message finds you well.

We have conducted a preliminary analysis on the datasets you’ve provided. This analysis allows us to assess the quality of the data and make recommendations for data cleaning to enhance its usability for your marketing strategy. Below is a brief summary of our findings based on the Data Quality Framework guide we use:

1. Transactions Data

  • Completeness: There were some instances of missing values and zeroes, particularly in the ‘online_order’ and ‘product_id’ columns. Also, there are a number of columns (Brand, product_line, product_class, product_size, standard_cost, and product_first_sold_date) with identical numbers of missing values (197). It appears these could be related and warrant further investigation.

  • Uniqueness: This dataset does not contain any duplicate records, which is a positive aspect in terms of data quality.

2. Demographics Data

Accuracy: The ‘default’ column contains unrecognizable formats and could be incorrect. We suggest confirming the status of this column and considering its removal if it’s not required.

  • Completeness: A number of missing values were identified, especially in the ‘last_name’, ‘DOB’, ‘job_title’, ‘job_industry_category’, ‘default’, and ‘tenure’ columns.

  • Validity: One client’s date of birth is listed as 1843-12-21. This might be an error and should be addressed.

  • Uniqueness: No duplicates were found, and the ‘customer_id’ column contains unique values which is beneficial for the analysis.

3. Addresses Data

  • Completeness: As with the transactions data, there are missing values in ‘online_order’ and ‘product_id’, as well as the same number of missing values in a set of related columns.

Given these findings, we suggest taking the following steps to mitigate current data quality concerns:

Investigate the cause of missing values in related columns in the transactions and addresses data. Try to retrieve the missing information if possible. Verify the status of the ‘default’ column in the demographics data. If it’s not required or not providing accurate data, consider removing it. Review the DOB data for validity. Correct any errors or remove data points that are incorrect and will not significantly impact the analysis. Fill missing values in the demographic data. For instance, consider reaching out to customers to fill their job title and industry if these data points are important for your marketing strategy. We hope these suggestions will guide you in making your data more reliable and suitable for analysis. We are ready to assist further if needed.

Best Regards,

Fawaz Alsaif


By following these recommendations, Sprocket Central Pty Ltd can significantly enhance their data quality, paving the way for effective, data-driven marketing strategies.