how to [...]

Setting up data quality reports with pandas (in no time)

One major discipline of data preparation is dealing with data quality issues. Before you can do anything meaningful with your data, you’ve to prepare the dataset. Why not starting with a data quality (short DQ) report.

Using pandas to setup a data quality report

Pandas is a reliable buddy on your journey to insights. In this example we’ll use a flatfile but can easily replace it with a database connection. For a better understanding: I used the Happiness Report 2020 dataset. The column titles of the flatfiles become the rows of the report – if you look at the picture above: how many values of the „data field“ country name are present, missing, unique… guess, you got it.

import pandas as pd

data_file = 'whatever.csv'
data = pd.read_csv(

The code above will read the data from the file. Then we’ll setup a few Data Frames which represent our DQ KPIs. By joining the DataFrames we’ll generate an easy-to-read DQ Report.

Adding the data types of our data:

data_types = pd.DataFrame(
    columns=['Data Type']

Checking for missing data:

missing_data = pd.DataFrame(
    columns=['Missing Values']

Check if the values are unique:

unique_values = pd.DataFrame(
    columns=['Unique Values']
for row in list(data.columns.values):
    unique_values.loc[row] = [data[row].nunique()]

If the min or max values are relevant, we can add:

maximum_values = pd.DataFrame(
    columns=['Maximum Value']
for row in list(data.columns.values):
    maximum_values.loc[row] = [data[row].max()]

Now, generate the overview by df.join(other_df):

dq_report = data_type.join(missing_data).join(unique_values).join(maximum_values)

What’s next?

  1. Add more useful data quality KPIs
  2. Find someone who takes care of the issues 😉

Just one idea: define the most relevant data attributes in your database and check them on a regular basis. Send the DQ Report to the data owner.

2 Kommentare Neues Kommentar hinzufügen

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert