Data Science: Data Profiling

2 min readNov 6, 2019

Understanding your data

Abstract:

The first step in any data science project is to understand your data, a.k.a. data profiling. I came across two useful resources to help with this. One is an article, Profiling Big Data in distributed environment using Spark: A Pyspark Data Primer for Machine Learning, published by Shaheen Gauher on Medium. The other one is a python module called “pandas_profiling”.

In this blog post, I will provide an example of using “pandas_profiling” to profile data.

Assumption:

I’m making an assumption that you are already familiar with python development and installing python modules. This requires installing “pandas_profiling” module if not installed in your environment already.

Example Code:

The code can be found in the repo data-profiling.

Import python modules: The code is straightforward. We need pandas to read a csv file and create a pandas data frame.

import pandas as pd
import pandas_profiling as dprof

Load Data: I had downloaded the example dataset from FBI Crime Data Explorer site. This file has data about number of law enforcement officers employed at different regions in US from 1960 to 2018. The file is about 197 MB and has about 1.4 M records. Click on download to download this data file.

The code below reads the csv file and creates a pandas data frame named as pe_data.

# load original police employment data
pe_file_path = './pe_1960_2018.csv'
pe_data = pd.read_csv(pe_file_path)

Profile Data: The function “ProfileReport()” in “pandas_profiling” takes a pandas data frame as input, and creates a profile. The profile is saved as an html file that you can open in any browser.

#data profiling police employment data
data_prof = dprof.ProfileReport(pe_data)
data_prof.to_file('./pe_data.html')

Performance:

I’ve observed the profiling taking anywhere from few minutes to few hours, please be patient.

If you are performing any data clean up steps like removing duplicate records, or removing missing records etc., please reset the index before running the profiling function.

Conclusion:

With a few line of code, we are able to get deeper insight into the data. The report provides

Dataset summary: { Number of variables, Number of observations, Total Missing (%), Total size in memory, Average record size in memory}
For every field: {Numeric or categorical, Distinct Count, Unique %, Missing %, Mean, Minimum, Maximum
Sample data, correlation, and many other useful information

Data Science: Data Profiling

Abstract:

Assumption:

Example Code:

Performance:

Conclusion:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Rudra

No responses yet