Data Science: Data Profiling

Rudra
2 min readNov 6, 2019

--

Understanding your data

Abstract:

The first step in any data science project is to understand your data, a.k.a. data profiling. I came across two useful resources to help with this. One is an article, Profiling Big Data in distributed environment using Spark: A Pyspark Data Primer for Machine Learning, published by Shaheen Gauher on Medium. The other one is a python module called “pandas_profiling”.

In this blog post, I will provide an example of using “pandas_profiling” to profile data.

Assumption:

I’m making an assumption that you are already familiar with python development and installing python modules. This requires installing “pandas_profiling” module if not installed in your environment already.

Example Code:

The code can be found in the repo data-profiling.

Import python modules: The code is straightforward. We need pandas to read a csv file and create a pandas data frame.

import pandas as pd
import pandas_profiling as dprof

Load Data: I had downloaded the example dataset from FBI Crime Data Explorer site. This file has data about number of law enforcement officers employed at different regions in US from 1960 to 2018. The file is about 197 MB and has about 1.4 M records. Click on download to download this data file.

The code below reads the csv file and creates a pandas data frame named as pe_data.

# load original police employment data
pe_file_path = './pe_1960_2018.csv'
pe_data = pd.read_csv(pe_file_path)

Profile Data: The function “ProfileReport()” in “pandas_profiling” takes a pandas data frame as input, and creates a profile. The profile is saved as an html file that you can open in any browser.

#data profiling police employment data
data_prof = dprof.ProfileReport(pe_data)
data_prof.to_file('./pe_data.html')

Performance:

I’ve observed the profiling taking anywhere from few minutes to few hours, please be patient.

If you are performing any data clean up steps like removing duplicate records, or removing missing records etc., please reset the index before running the profiling function.

Conclusion:

With a few line of code, we are able to get deeper insight into the data. The report provides

  • Dataset summary: { Number of variables, Number of observations, Total Missing (%), Total size in memory, Average record size in memory}
  • For every field: {Numeric or categorical, Distinct Count, Unique %, Missing %, Mean, Minimum, Maximum
  • Sample data, correlation, and many other useful information

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response