Data Analysis With Pandas
Pandas is the cornerstone of data analysis in Python. It provides powerful, flexible data structures and data analysis tools that make working with structured data intuitive and efficient.
What is Pandas?
Pandas is a fast, powerful, and easy-to-use data analysis library built on top of NumPy. It provides two main data structures:
- Series: One-dimensional labeled array
- DataFrame: Two-dimensional labeled data structure
Installation
Getting Started
Importing Pandas
1
2
|
import pandas as pd
import numpy as np
|
Creating DataFrames
1
2
3
4
5
6
7
8
9
10
|
# From dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 28],
'City': ['New York', 'London', 'Tokyo', 'Paris'],
'Salary': [70000, 80000, 90000, 75000]
}
df = pd.DataFrame(data)
print(df)
|
Reading Data from Files
1
2
3
4
5
6
7
8
|
# Read CSV file
df = pd.read_csv('data.csv')
# Read Excel file
df = pd.read_excel('data.xlsx')
# Read JSON file
df = pd.read_json('data.json')
|
Basic Data Exploration
Viewing Data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
# First few rows
df.head()
# Last few rows
df.tail()
# Basic info about the dataset
df.info()
# Statistical summary
df.describe()
# Shape of the data
print(f"Shape: {df.shape}")
|
Data Selection
1
2
3
4
5
6
7
8
9
10
11
|
# Select a column
ages = df['Age']
# Select multiple columns
subset = df[['Name', 'Age']]
# Select rows by condition
young_people = df[df['Age'] < 30]
# Select with multiple conditions
filtered = df[(df['Age'] > 25) & (df['Salary'] > 70000)]
|
Data Cleaning
Handling Missing Values
1
2
3
4
5
6
7
8
9
|
# Check for missing values
df.isnull().sum()
# Drop rows with missing values
df_cleaned = df.dropna()
# Fill missing values
df_filled = df.fillna(0) # Fill with 0
df_filled = df.fillna(df.mean()) # Fill with mean
|
Data Types
1
2
3
4
5
6
|
# Check data types
df.dtypes
# Convert data types
df['Age'] = df['Age'].astype(int)
df['Date'] = pd.to_datetime(df['Date'])
|
Data Analysis Operations
Grouping and Aggregation
1
2
3
4
5
6
7
8
|
# Group by city and calculate mean salary
city_stats = df.groupby('City')['Salary'].mean()
# Multiple aggregations
agg_stats = df.groupby('City').agg({
'Age': 'mean',
'Salary': ['mean', 'max', 'min']
})
|
Sorting
1
2
3
4
5
|
# Sort by column
df_sorted = df.sort_values('Salary', ascending=False)
# Sort by multiple columns
df_sorted = df.sort_values(['City', 'Age'])
|
Adding New Columns
1
2
3
|
# Create new column based on existing ones
df['Salary_per_Year'] = df['Salary'] * 12
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')
|
Data Visualization with Pandas
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
import matplotlib.pyplot as plt
# Simple plots
df['Age'].hist()
plt.title('Age Distribution')
plt.show()
# Box plot
df.boxplot(column='Salary', by='City')
plt.show()
# Scatter plot
df.plot.scatter(x='Age', y='Salary')
plt.show()
|
Advanced Operations
Merging DataFrames
1
2
3
4
5
|
# Merge two DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['A', 'B', 'C']})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Score': [85, 90, 78]})
merged = pd.merge(df1, df2, on='ID')
|
Pivot Tables
1
2
3
4
5
6
7
|
# Create pivot table
pivot = df.pivot_table(
values='Salary',
index='City',
columns='Age_Group',
aggfunc='mean'
)
|
Best Practices
- Always explore your data first with
head(), info(), and describe()
- Handle missing values appropriately for your use case
- Use vectorized operations instead of loops for better performance
- Chain operations using method chaining for cleaner code
- Set appropriate data types to save memory and improve performance
Conclusion
Pandas is an essential tool for data analysis in Python. With its intuitive API and powerful features, it makes data manipulation and analysis accessible and efficient. Practice with real datasets to master these concepts!
Happy analyzing! 📊