Data Analysis With Pandas

Pandas is the cornerstone of data analysis in Python. It provides powerful, flexible data structures and data analysis tools that make working with structured data intuitive and efficient.

What is Pandas?

Pandas is a fast, powerful, and easy-to-use data analysis library built on top of NumPy. It provides two main data structures:

Series: One-dimensional labeled array
DataFrame: Two-dimensional labeled data structure

Installation

1

pip install pandas

Getting Started

Importing Pandas

1
2


import pandas as pd
import numpy as np

Creating DataFrames

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# From dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'London', 'Tokyo', 'Paris'],
    'Salary': [70000, 80000, 90000, 75000]
}

df = pd.DataFrame(data)
print(df)

Reading Data from Files

1
2
3
4
5
6
7
8


# Read CSV file
df = pd.read_csv('data.csv')

# Read Excel file
df = pd.read_excel('data.xlsx')

# Read JSON file
df = pd.read_json('data.json')

Basic Data Exploration

Viewing Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# First few rows
df.head()

# Last few rows
df.tail()

# Basic info about the dataset
df.info()

# Statistical summary
df.describe()

# Shape of the data
print(f"Shape: {df.shape}")

Data Selection

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# Select a column
ages = df['Age']

# Select multiple columns
subset = df[['Name', 'Age']]

# Select rows by condition
young_people = df[df['Age'] < 30]

# Select with multiple conditions
filtered = df[(df['Age'] > 25) & (df['Salary'] > 70000)]

Data Cleaning

Handling Missing Values

1
2
3
4
5
6
7
8
9


# Check for missing values
df.isnull().sum()

# Drop rows with missing values
df_cleaned = df.dropna()

# Fill missing values
df_filled = df.fillna(0)  # Fill with 0
df_filled = df.fillna(df.mean())  # Fill with mean

Data Types

1
2
3
4
5
6


# Check data types
df.dtypes

# Convert data types
df['Age'] = df['Age'].astype(int)
df['Date'] = pd.to_datetime(df['Date'])

Data Analysis Operations

Grouping and Aggregation

1
2
3
4
5
6
7
8


# Group by city and calculate mean salary
city_stats = df.groupby('City')['Salary'].mean()

# Multiple aggregations
agg_stats = df.groupby('City').agg({
    'Age': 'mean',
    'Salary': ['mean', 'max', 'min']
})

Sorting

1
2
3
4
5


# Sort by column
df_sorted = df.sort_values('Salary', ascending=False)

# Sort by multiple columns
df_sorted = df.sort_values(['City', 'Age'])

Adding New Columns

1
2
3


# Create new column based on existing ones
df['Salary_per_Year'] = df['Salary'] * 12
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')

Data Visualization with Pandas

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


import matplotlib.pyplot as plt

# Simple plots
df['Age'].hist()
plt.title('Age Distribution')
plt.show()

# Box plot
df.boxplot(column='Salary', by='City')
plt.show()

# Scatter plot
df.plot.scatter(x='Age', y='Salary')
plt.show()

Advanced Operations

Merging DataFrames

1
2
3
4
5


# Merge two DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['A', 'B', 'C']})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Score': [85, 90, 78]})

merged = pd.merge(df1, df2, on='ID')

Pivot Tables

1
2
3
4
5
6
7


# Create pivot table
pivot = df.pivot_table(
    values='Salary',
    index='City',
    columns='Age_Group',
    aggfunc='mean'
)

Best Practices

Always explore your data first with head(), info(), and describe()
Handle missing values appropriately for your use case
Use vectorized operations instead of loops for better performance
Chain operations using method chaining for cleaner code
Set appropriate data types to save memory and improve performance

Conclusion

Pandas is an essential tool for data analysis in Python. With its intuitive API and powerful features, it makes data manipulation and analysis accessible and efficient. Practice with real datasets to master these concepts!

Happy analyzing! 📊