15
Lesson 15
Real-World Data visualization projects with Matplotlib and Pandas
Objective
By the end of this lesson, students will be able to load data from external files (CSV, Excel) using Pandas, visualize the data using Matplotlib, and create practical data visualizations.
1. Introduction to data visualization in the real world:
Real-world data often comes in various formats like CSV or Excel files. Pandas is a powerful Python library for data manipulation and analysis, while Matplotlib is ideal for visualizing this data. In this lesson, we will cover how to load real-world data from files and create visualizations to analyze trends, patterns, and distributions.
2. Loading data from CSV files using Pandas:
CSV (Comma-Separated Values) is one of the most commonly used formats for storing data. The pandas.read_csv() function can be used to load CSV files into a DataFrame, which makes it easy to analyze and visualize the data.
Example: Loading a CSV file
By the end of this lesson, students will be able to load data from external files (CSV, Excel) using Pandas, visualize the data using Matplotlib, and create practical data visualizations.
1. Introduction to data visualization in the real world:
Real-world data often comes in various formats like CSV or Excel files. Pandas is a powerful Python library for data manipulation and analysis, while Matplotlib is ideal for visualizing this data. In this lesson, we will cover how to load real-world data from files and create visualizations to analyze trends, patterns, and distributions.
2. Loading data from CSV files using Pandas:
CSV (Comma-Separated Values) is one of the most commonly used formats for storing data. The pandas.read_csv() function can be used to load CSV files into a DataFrame, which makes it easy to analyze and visualize the data.
Example: Loading a CSV file
import pandas as pd # Load data from a CSV file data = pd.read_csv('example_data.csv') # Display the first few rows of the data print(data.head())
- pd.read_csv('file.csv') : Loads the data from the specified CSV file into a Pandas DataFrame.
- data.head() : Displays the first 5 rows of the dataset, useful for inspecting the structure.
3. Visualizing data with Matplotlib:
Once the data is loaded into a DataFrame, we can use Matplotlib to visualize it. Here’s how to create some common plots using real-world data.
Example: Line plot of time series data
Let's assume you have a CSV file with sales data, and you want to visualize the trend over time.
import matplotlib.pyplot as plt import pandas as pd # Load the data data = pd.read_csv('sales_data.csv') # Plot the sales data over time plt.plot(data['Date'], data['Sales'], label='Sales') # Add labels and title plt.xlabel('Date') plt.ylabel('Sales') plt.title('Sales Over Time') plt.legend() # Show the plot plt.show()
- Purpose : To observe trends or fluctuations over time (e.g., sales, stock prices).
- data['Date'] and data['Sales'] : Access the columns from the DataFrame for plotting.
- plt.plot() : Creates a line plot.
4. Loading and visualizing excel data:
Excel is another common format for storing data. Pandas can easily load Excel files using the pandas.read_excel() function.
Example: Loading data from an excel file
# Load data from an Excel file data = pd.read_excel('example_data.xlsx') # Display the first few rows of the data print(data.head())
Once the data is loaded, it can be visualized similarly to CSV data.
Example: Bar chart for category comparison
Suppose we have sales data by product category in an Excel sheet and want to create a bar chart.
# Load the data data = pd.read_excel('sales_by_category.xlsx') # Plot a bar chart comparing sales by category plt.bar(data['Category'], data['Sales']) # Add labels and title plt.xlabel('Category') plt.ylabel('Sales') plt.title('Sales by Category') # Show the plot plt.show()
- Purpose : Bar charts are ideal for comparing quantities across different categories.
- plt.bar() : Creates a vertical bar chart.
5. Data cleaning and preparation:
Before visualizing data, it often needs to be cleaned or prepared. For instance, there may be missing values or irrelevant columns. Pandas provides tools for handling these issues.
Example: Handling missing data
# Check for missing values print(data.isnull().sum()) # Fill missing values with the mean data['Sales'].fillna(data['Sales'].mean(), inplace=True)
- data.isnull().sum() : Checks for missing values in the dataset.
- fillna() : Replaces missing values with the mean, median, or another value.
6. Advanced visualization: Scatter plot:
Scatter plots are useful for identifying relationships or correlations between two variables. Let's plot a scatter plot using real-world data from a CSV file.
Example: Scatter plot of sales vs. advertising
# Load the data data = pd.read_csv('sales_advertising_data.csv') # Create a scatter plot plt.scatter(data['Advertising'], data['Sales']) # Add labels and title plt.xlabel('Advertising Budget') plt.ylabel('Sales') plt.title('Sales vs. Advertising Budget') # Show the plot plt.show()
- Purpose : Scatter plots help visualize relationships between two continuous variables (e.g., sales and advertising budget).
- plt.scatter() : Creates a scatter plot.
7. Real-world project: Visualizing global temperature trends:
Let’s work on a real-world project where we visualize global temperature trends over time. The dataset is in CSV format and includes year-wise global average temperatures.
Step-by-step Example:
# Step 1: Import necessary libraries import pandas as pd import matplotlib.pyplot as plt # Step 2: Load the global temperature dataset temperature_data = pd.read_csv('global_temperature.csv') # Step 3: Plot global temperature over the years plt.plot(temperature_data['Year'], temperature_data['Temperature'], color='r') # Step 4: Add labels and title plt.xlabel('Year') plt.ylabel('Global Average Temperature (°C)') plt.title('Global Temperature Trends Over Time') # Step 5: Show the plot plt.show()
- Purpose : Analyze climate change and temperature trends over decades.
- This example shows how real-world datasets can be visualized to understand significant global issues.
8. Exercises:
Exercise: 1
1. Load a CSV file containing daily stock prices, and create a line plot of stock price vs. date.
2. Load data from an Excel file containing product sales by category, and create a bar chart.
Exercise: 2
Load a dataset that contains advertising budget and sales, and plot a scatter plot to identify any correlation.
Exercise: 3
Load a dataset with missing values and fill them using an appropriate method. Then visualize the cleaned data.
Exercise: 4
Load a CSV file containing COVID-19 case data by country and visualize the trend of confirmed cases over time for a selected country.
Conclusion
In this lesson, we explored how to load real-world data from CSV and Excel files using Pandas and visualize it using Matplotlib. We covered various types of plots such as line plots, bar charts, and scatter plots. Additionally, we touched upon data cleaning and preparation to ensure the data is ready for analysis. The skills gained in this lesson are essential for performing real-world data analysis and creating meaningful visualizations.