Data Exploration with Pandas

Pandas is a powerful Python library for data analysis. It provides a variety of tools for exploring and understanding your data. Here are three essential code snippets to help you get started:

1. Get a quick glance at your data

Python

print(df.head())

This code snippet prints the first few rows of your dataset. It helps you get a quick glance at what the data looks like. The head() method in Pandas retrieves the top rows of your DataFrame. By default, it shows the first 5 rows, but you can specify a different number inside the parentheses if you want to see more or fewer rows.

2. Check your data types and identify missing values

Python

print(df.info())

This code uses the info() method in Pandas to provide an overview of your dataset. It displays several pieces of information:

  • The data types of each column (e.g., int, float, object)
  • The number of non-null (non-missing) values in each column
  • The total number of entries (rows)
  • Memory usage

This summary is helpful for understanding the data types of your attributes and identifying columns with missing values. For columns with missing values, you’ll see a lower count in the “non-null” column.

3. Get summary statistics for numerical columns

Python

print(df.describe())

The describe() method generates summary statistics for numerical columns in your dataset. These statistics include:

  • Count: The number of non-null values.
  • Mean: The average value of the data.
  • Std: The standard deviation, which measures the spread of the data.
  • Min: The minimum value in the column.
  • 25%: The 25th percentile value.
  • 50%: The median (50th percentile) value.
  • 75%: The 75th percentile value.
  • Max: The maximum value in the column.

These statistics provide a quick overview of the central tendency and spread of your numerical data. They can help you identify outliers, understand data distribution, and make initial observations about the data’s characteristics.

In summary, these code snippets are essential initial steps in exploratory data analysis (EDA). They allow you to quickly examine your dataset’s structure, data types, and basic statistics, which is crucial for understanding your data before diving into more in-depth analysis and visualization.

Example:

Let’s say we have a dataset of customer purchase data. We can use Pandas to explore this data as follows:

Python

import pandas as pd

# Read the dataset into a Pandas DataFrame
df = pd.read_csv('customer_purchase_data.csv')

# Print the first few rows of the dataset
print(df.head())

# Check the data types and identify missing values
print(df.info())

# Get summary statistics for numerical columns
print(df.describe())

Output:

   customer_id  total_purchases  average_purchase
0           1           1000              100
1           2           500               50
2           3           250               25
3           4           100               10
4           5           50                5

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0  customer_id     5 non-null      int64
 1  total_purchases  5 non-null      int64
 2  average_purchase  5 non-null      float64
dtypes: float64(1), int64(2)
memory usage: 160.0 bytes

   count  mean      std  min   25%   50%   75%  max
0      5  400.0  385.6  100.0  100.0  400.0  500.0  1000.0

This initial exploration of the data tells us that we have a dataset of 5 customers, with a total of 2000 purchases. The average purchase amount is $400, but there is a wide range of purchase amounts, from $100 to $1000. We also note that there are no missing values in the dataset.

We can use this information to start to ask more specific questions about the data, such as:

  • Which customers are the biggest spenders?
  • What are the most popular items purchased?
  • Is there a relationship between the total number of purchases and the average purchase amount?

By exploring our data with Pandas, we can gain insights that can help us make better business decisions.

By Pankaj

Leave a Reply

Your email address will not be published. Required fields are marked *