Table Of Content
Bar Plot
Pie Plot
Scatter Plot
KDE Plot
Box Plot
Rainfall Plot
CDF Plot
ECDF Plot
Histogram
Heatmap
1. Introduction:
Choosing the correct chart for visualizing data is critical in ensuring that insights are clearly communicated to the audience. A poorly chosen chart can lead to confusion, misinterpretation, and can obscure valuable insights from the data. In this article, we will discuss why selecting the right chart is important and outline the steps to help you choose the correct chart for your data visualization needs.
2. Why Selecting the Correct Chart is Important
Clarity and Comprehension
The primary purpose of a chart is to help the viewer easily understand the data. If the chart is not suited to the type of data being presented, it can cause confusion, making it harder for the audience to draw conclusions. A well-chosen chart simplifies complex data and allows for better insight.
Accurate Data Representation
The right chart ensures that the data is accurately represented. Using the wrong chart type can distort the meaning of the data, leading to incorrect conclusions. For example, a pie chart may not be the best choice to show changes over time, as it does not effectively convey trends.
Improved Decision Making
For business or scientific purposes, decision-making relies heavily on data insights. A poorly selected chart can mislead stakeholders, leading to bad decisions. The correct chart type provides a clear representation of the data, aiding in informed and effective decision-making.
Engagement and Interest
A visually appealing chart can engage the audience better than a raw data table. It can make the data more approachable and easier to understand. If the chart is chosen correctly, it will not only be informative but also visually stimulating, increasing audience interest.
3. Charts Guide with examples
importing libraries
import pandas as pd
import numpy as np
import openpyxl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
import plotly.express as px
import plotly.graph_objects as go
import plotly.express as px
import datetime
import missingno as msno
from ydata_profiling import ProfileReport
import great_expectations as ge
import ipywidgets as widgets
from IPython.display import display, HTML
sns.set_style("whitegrid")
pd.options.display.max_colwidth = 20
pd.options.display.max_columns = 50
Read the csv file
df = pd.read_csv("../1_Data/cleand_df.csv")
1. Bar Chart
When to Use:
To compare categorical data.
To show the distribution of data across categories.
Why:
Easy to interpret.
Effective for showing differences between groups.
Interpretation Steps:
Identify the categories on the x-axis.
Compare the heights of the bars (y-axis) to determine differences.
Look for trends or outliers.
Details:
Use horizontal bars for long category names.
Stacked or grouped bars can show sub-categories.
result = multiple_group_and_calculate_percentage(df, "male", "currentSmoker")
result['male'] = result['male'].apply(lambda x: 'male' if x == 0 else "female")
result['currentSmoker'] = result['currentSmoker'].apply(lambda x: 'non-smooker' if x == 0 else "smooker")
result
male | currentSmoker | Count | Percentage | |
0 | male | non-smooker | 1431 | 33.75 |
1 | male | smooker | 989 | 23.33 |
2 | female | non-smooker | 714 | 16.84 |
3 | female | smooker | 1106 | 26.08 |
2. Pie Chart
When to Use:
To show proportions or percentages of a whole.
When there are a few categories (less than 6).
Why:
Visually appealing for part-to-whole relationships.
Easy to understand for non-technical audiences.
Interpretation Steps:
Identify the slices and their corresponding categories.
Compare the sizes of the slices to understand proportions.
Ensure the total adds up to 100%.
Details:
Avoid using too many slices; it becomes cluttered.
Use annotations to label percentages.
result
male | currentSmoker | Count | Percentage | |
0 | male | non-smooker | 1431 | 33.75 |
1 | male | smooker | 989 | 23.33 |
2 | female | non-smooker | 714 | 16.84 |
3 | female | smooker | 1106 | 26.08 |
fig = px.pie(result, values='Percentage', names="male",
labels = {'Percentage': 'Percentage' ,'Count':'Count'},
custom_data=['Count'],
hover_data={'currentSmoker': True,'Count':True, 'Percentage':True},
color_discrete_sequence=['#008294','#bdbdbd'],height=400)
fig.update_traces( # Change marker color
marker_line_color='black', # Marker line color
marker_line_width=1, # Marker line width
opacity=1,
hoverinfo='label+percent+value', # Display label, percent, and value on hover
texttemplate='%{value:.2f}',
textinfo='percent', textfont_size=16) # Display percent and label in each segment
fig.update_layout(
title={'text': 'title', 'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'}, # Center and adjust title
title_font_size=20, # Title font size
legend_title_text='Legend', # Legend title
font=dict(family='Arial', size=14), # Font family and size for labels
)
fig.show()
3. Scatter Plot
When to Use:
To show the relationship between two numerical variables.
To identify correlations, trends, or outliers.
Why:
Effective for visualizing patterns in data.
Helps in identifying clusters or gaps.
Interpretation Steps:
Look for patterns (positive, negative, or no correlation).
Identify outliers or unusual points.
Check the density of points in different regions.
Details:
Use color or size to add a third dimension (e.g., another variable).
Add a trendline to highlight relationships.
4240 rows × 16 columns
sns.regplot(data=df, x='age', y='cigsPerDay', scatter_kws={'alpha':0.5}, color="orchid")
plt.title('title')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.show()
Single Scatter Plot
# Scatter plot between Profit and (Revenue, Quantity and Cost)
# c = sns.color_palette("rocket")[1:9]
c = sns.color_palette("rocket", 7)[1:8]
fig,ax = plt.subplots(2,3,figsize=(25,10))
# k=0
# j=0
# i = 0
k, j = 0, 0
i = 0
for col in ["age", "cigsPerDay"]:
for row in ["totChol", "sysBP", "diaBP"]:
sns.regplot(data = df, x = col, y = row, ax = ax[k,j], color= c[i])
i+=1
ax[k,j].set_xlabel(col, fontsize=17, color="k")
ax[k,j].set_ylabel(row, fontsize=17, color="k")
ax[k,j].set_title("Correlation between {} and {}".format(col, row), fontsize=18, color="k", pad=25)
j += 1
if j >= 3: # Move to the next row after 3 columns
k += 1
j = 0
#fig.suptitle('Correlation between Late_FM, Late_LM, Total_Distance and Waiting_Time', fontsize=25, color="k")
plt.subplots_adjust(hspace = 0.6, wspace=0.2);
Multiple Scatter Plots
4. KDE Plot (Kernel Density Estimate)
When to Use:
To visualize the distribution of a continuous variable.
To smooth out histograms for better interpretation.
Why:
Provides a smooth estimate of the data distribution.
Useful for comparing multiple distributions.
Interpretation Steps:
Identify peaks (modes) in the distribution.
Look for skewness (left or right) or symmetry.
Compare multiple KDEs for differences.
Details:
Bandwidth selection affects smoothness.
Overlapping KDEs can show comparisons between groups.
5. Boxplot (Box-and-Whisker Plot)
When to Use:
To show the distribution of numerical data.
To identify outliers and compare distributions across groups.
Why:
Summarizes data using quartiles.
Highlights outliers effectively.
Interpretation Steps:
Identify the median (line inside the box).
Check the interquartile range (IQR, the box).
Look for outliers (points outside the whiskers).
Details:
Useful for comparing multiple groups side by side.
Whiskers typically represent 1.5x IQR.
6. Histogram
When to Use:
To show the frequency distribution of a continuous variable.
To identify skewness, peaks, and gaps.
Why:
Simple and effective for understanding data distribution.
Helps in identifying patterns and outliers.
Interpretation Steps:
Identify bins (x-axis) and frequencies (y-axis).
Look for peaks (modes) and gaps.
Check for skewness (left or right).
Details:
Bin size affects interpretation; too small or too large bins can mislead.
Overlay with KDE for smoother interpretation.
fig,ax = plt.subplots(1,3,figsize=(15,4))
ax[0].hist(x=df.age, bins=15)
sns.boxplot(x=df.age, ax=ax[2])
sns.kdeplot(x=df.age, ax=ax[1]);
bar_chart(
df,
"male",
"age",
txt = "Count",
colors =['green','blue', "orange"] ,
height=600,
width=900,
ttle="title",
xtitle="X title" ,
ytitle="Y title" ,
bg=0.6,
bgg=0.1,
yscale = False,
yscale_percentage=False,
group="group",
leg="legend",
box=True,
facetcol="currentSmoker"
)
7. Rainfall Chart
When to Use:
To visualize time-series data with irregular spikes.
Commonly used in weather or financial data.
Why:
Highlights extreme values or events.
Effective for showing variability over time.
Interpretation Steps:
Identify spikes or drops in the data.
Look for patterns or cycles over time.
Compare with other variables (e.g., temperature vs. rainfall).
Details:
Use color gradients to show intensity.
Combine with line charts for better context.
sns.stripplot(x='education', y='age', data=df, jitter=True, alpha=0.5)
plt.title('Age Distribution by Education Level (Strip Plot)')
plt.xticks(rotation=45)
plt.show()
# Example 2: Combined strip and box plot
plt.figure(figsize=(12, 6))
sns.boxplot(x='currentSmoker', y='sysBP', data=df, whis=1.5)
sns.stripplot(x='currentSmoker', y='sysBP', data=df,
color='red', alpha=0.3, jitter=0.2, size=4)
plt.title('Systolic BP Distribution by Smoking Status')
plt.xlabel('Current Smoker (0=No, 1=Yes)')
plt.show()
8. CDF (Cumulative Distribution Function)
When to Use:
To show the cumulative probability distribution of a variable.
To compare distributions of different datasets.
Why:
Provides a complete view of the data distribution.
Helps in understanding percentiles and probabilities.
Interpretation Steps:
Identify the x-axis (data values) and y-axis (cumulative probability).
Look for steep slopes (high density) and flat regions (low density).
Compare multiple CDFs for differences.
Details:
The y-axis ranges from 0 to 1 (or 0% to 100%).
Useful for statistical analysis and hypothesis testing.
9. ECDF (Empirical Cumulative Distribution Function)
When to Use:
Similar to CDF but for empirical (observed) data.
To visualize the distribution of a sample.
Why:
Non-parametric and easy to compute.
Useful for small datasets.
Interpretation Steps:
Identify the step-like pattern of the ECDF.
Compare with theoretical CDFs or other ECDFs.
Look for deviations from expected distributions.
Details:
Each data point contributes a "step" in the ECDF.
Useful for exploratory data analysis (EDA).
# Example 1: CDF of age
def plot_cdf(data, variable, label=None):
x = np.sort(data[variable].dropna())
y = np.arange(1, len(x) + 1) / len(x)
plt.plot(x, y, label=label)
# plt.figure(figsize=(8, 6))
plot_cdf(df, 'age', 'Age')
plt.title('Cumulative Distribution Function of Age')
plt.xlabel('Age')
plt.ylabel('Cumulative Probability')
plt.grid(True)
plt.show()
# Example 2: Comparing CDFs by gender
# plt.figure(figsize=(8, 6))
plot_cdf(df[df['male']==0], 'BMI', 'Female')
plot_cdf(df[df['male']==1], 'BMI', 'Male')
plt.title('CDF of BMI by Gender')
plt.xlabel('BMI')
plt.ylabel('Cumulative Probability')
plt.legend()
plt.grid(True)
plt.show()
10. Heatmap
When to Use:
To visualize matrix-like data (e.g., correlation matrices).
To show relationships between two categorical variables.
Why:
Effective for identifying patterns and clusters.
Color gradients make it easy to interpret.
Interpretation Steps:
Identify the axes (categories or variables).
Look for dark (high values) or light (low values) regions.
Check for patterns or clusters.
Details:
Use a color legend to interpret values.
Normalize data for better comparison.
# Check correlation with heatmap
fig, ax = plt.subplots(figsize=(25,8))
corr_matrix = df.corr()
mask = np.zeros_like(corr_matrix)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr_matrix, annot=True, mask=mask );
plt.title( 'Correlations Heatmap', fontsize=20, fontweight='bold', fontfamily='serif', pad=15);
Summary Table of Charts
Chart Type | When to Use | Key Features | Interpretation Tips |
Bar Chart | Compare categories | Heights represent values | Compare bar heights |
Pie Chart | Show proportions | Slices represent percentages | Ensure total is 100% |
Scatter Plot | Show relationships between variables | Points represent data pairs | Look for patterns or outliers |
KDE Plot | Visualize distributions | Smooth curve over histogram | Identify peaks and skewness |
Boxplot | Show data distribution | Box represents IQR, whiskers show range | Check for outliers and median |
Rainfall Chart | Visualize time-series spikes | Spikes represent extreme values | Look for patterns over time |
CDF | Show cumulative distribution | Curve from 0 to 1 | Compare percentiles |
ECDF | Empirical distribution of sample | Step-like curve | Compare with theoretical CDF |
Histogram | Frequency distribution | Bars represent frequency | Check for peaks and gaps |
Heatmap | Matrix-like data | Color gradients represent values | Look for patterns and clusters |
This guide provides a comprehensive overview of various charts, their uses, and interpretation steps. Use the summary table for quick reference!