Table Of Content
- 1. Introduction
- 2. Why Selecting The Correct Chart Is Important
- 3 Charts Guide With Examples
- - Bar Plot
    - Pie Plot
    - Scatter Plot
    - KDE Plot
    - Box Plot
    - Rainfall Plot
    - CDF Plot
    - ECDF Plot
    - Histogram
    - Heatmap

1. Introduction:

Choosing the correct chart for visualizing data is critical in ensuring that insights are clearly communicated to the audience. A poorly chosen chart can lead to confusion, misinterpretation, and can obscure valuable insights from the data. In this article, we will discuss why selecting the right chart is important and outline the steps to help you choose the correct chart for your data visualization needs.

2. Why Selecting the Correct Chart is Important

Clarity and Comprehension

The primary purpose of a chart is to help the viewer easily understand the data. If the chart is not suited to the type of data being presented, it can cause confusion, making it harder for the audience to draw conclusions. A well-chosen chart simplifies complex data and allows for better insight.

Accurate Data Representation

The right chart ensures that the data is accurately represented. Using the wrong chart type can distort the meaning of the data, leading to incorrect conclusions. For example, a pie chart may not be the best choice to show changes over time, as it does not effectively convey trends.

Improved Decision Making

For business or scientific purposes, decision-making relies heavily on data insights. A poorly selected chart can mislead stakeholders, leading to bad decisions. The correct chart type provides a clear representation of the data, aiding in informed and effective decision-making.

Engagement and Interest

A visually appealing chart can engage the audience better than a raw data table. It can make the data more approachable and easier to understand. If the chart is chosen correctly, it will not only be informative but also visually stimulating, increasing audience interest.

3. Charts Guide with examples

importing libraries

import pandas as pd 
import numpy as np
import openpyxl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
import plotly.express as px
import plotly.graph_objects as go
import plotly.express as px
import datetime
import missingno as msno
from ydata_profiling import ProfileReport 
import great_expectations as ge
import ipywidgets as widgets
from IPython.display import display, HTML

sns.set_style("whitegrid")
pd.options.display.max_colwidth = 20
pd.options.display.max_columns = 50

Read the csv file

df = pd.read_csv("../1_Data/cleand_df.csv")

1. Bar Chart

When to Use:

To compare categorical data.
To show the distribution of data across categories.

Why:

Easy to interpret.
Effective for showing differences between groups.

Interpretation Steps:

Identify the categories on the x-axis.
Compare the heights of the bars (y-axis) to determine differences.
Look for trends or outliers.

Details:

Use horizontal bars for long category names.
Stacked or grouped bars can show sub-categories.

result = multiple_group_and_calculate_percentage(df, "male", "currentSmoker")
result['male'] = result['male'].apply(lambda x: 'male' if x == 0 else "female")
result['currentSmoker'] = result['currentSmoker'].apply(lambda x: 'non-smooker' if x == 0 else "smooker")
result

	male	currentSmoker	Count	Percentage
0	male	non-smooker	1431	33.75
1	male	smooker	989	23.33
2	female	non-smooker	714	16.84
3	female	smooker	1106	26.08

2. Pie Chart

When to Use:

To show proportions or percentages of a whole.
When there are a few categories (less than 6).

Why:

Visually appealing for part-to-whole relationships.
Easy to understand for non-technical audiences.

Interpretation Steps:

Identify the slices and their corresponding categories.
Compare the sizes of the slices to understand proportions.
Ensure the total adds up to 100%.

Details:

Avoid using too many slices; it becomes cluttered.
Use annotations to label percentages.

result

	male	currentSmoker	Count	Percentage
0	male	non-smooker	1431	33.75
1	male	smooker	989	23.33
2	female	non-smooker	714	16.84
3	female	smooker	1106	26.08

fig = px.pie(result, values='Percentage', names="male",
             labels = {'Percentage': 'Percentage' ,'Count':'Count'},
             custom_data=['Count'], 
             hover_data={'currentSmoker': True,'Count':True, 'Percentage':True},
             color_discrete_sequence=['#008294','#bdbdbd'],height=400)
fig.update_traces( # Change marker color
                  marker_line_color='black',  # Marker line color
                  marker_line_width=1,  # Marker line width
                  opacity=1,
                  hoverinfo='label+percent+value',  # Display label, percent, and value on hover
                  texttemplate='%{value:.2f}',
                  textinfo='percent', textfont_size=16)  # Display percent and label in each segment

fig.update_layout(
    title={'text': 'title', 'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},  # Center and adjust title
    title_font_size=20,  # Title font size
    legend_title_text='Legend',  # Legend title
    font=dict(family='Arial', size=14),  # Font family and size for labels
)
fig.show()

3. Scatter Plot

When to Use:

To show the relationship between two numerical variables.
To identify correlations, trends, or outliers.

Why:

Effective for visualizing patterns in data.
Helps in identifying clusters or gaps.

Interpretation Steps:

Look for patterns (positive, negative, or no correlation).
Identify outliers or unusual points.
Check the density of points in different regions.

Details:

Use color or size to add a third dimension (e.g., another variable).
Add a trendline to highlight relationships.

4240 rows × 16 columns

sns.regplot(data=df, x='age', y='cigsPerDay', scatter_kws={'alpha':0.5}, color="orchid")
plt.title('title')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.show()

Single Scatter Plot

# Scatter plot between Profit and (Revenue, Quantity and Cost)

# c = sns.color_palette("rocket")[1:9]
c = sns.color_palette("rocket", 7)[1:8]

fig,ax = plt.subplots(2,3,figsize=(25,10))
# k=0
# j=0
# i = 0
k, j = 0, 0
i = 0
for col in ["age", "cigsPerDay"]:
    for row in ["totChol", "sysBP", "diaBP"]:
        sns.regplot(data = df, x = col, y = row, ax = ax[k,j], color= c[i])
        i+=1
        ax[k,j].set_xlabel(col, fontsize=17, color="k")
        ax[k,j].set_ylabel(row, fontsize=17, color="k")
        ax[k,j].set_title("Correlation between {} and {}".format(col, row), fontsize=18, color="k", pad=25)
        j += 1
        if j >= 3:  # Move to the next row after 3 columns
            k += 1
            j = 0

#fig.suptitle('Correlation between Late_FM, Late_LM, Total_Distance and Waiting_Time', fontsize=25, color="k")
plt.subplots_adjust(hspace = 0.6, wspace=0.2);

Multiple Scatter Plots

4. KDE Plot (Kernel Density Estimate)

When to Use:

To visualize the distribution of a continuous variable.
To smooth out histograms for better interpretation.

Why:

Provides a smooth estimate of the data distribution.
Useful for comparing multiple distributions.

Interpretation Steps:

Identify peaks (modes) in the distribution.
Look for skewness (left or right) or symmetry.
Compare multiple KDEs for differences.

Details:

Bandwidth selection affects smoothness.
Overlapping KDEs can show comparisons between groups.

5. Boxplot (Box-and-Whisker Plot)

When to Use:

To show the distribution of numerical data.
To identify outliers and compare distributions across groups.

Why:

Summarizes data using quartiles.
Highlights outliers effectively.

Interpretation Steps:

Identify the median (line inside the box).
Check the interquartile range (IQR, the box).
Look for outliers (points outside the whiskers).

Details:

Useful for comparing multiple groups side by side.
Whiskers typically represent 1.5x IQR.

6. Histogram

When to Use:

To show the frequency distribution of a continuous variable.
To identify skewness, peaks, and gaps.

Why:

Simple and effective for understanding data distribution.
Helps in identifying patterns and outliers.

Interpretation Steps:

Identify bins (x-axis) and frequencies (y-axis).
Look for peaks (modes) and gaps.
Check for skewness (left or right).

Details:

Bin size affects interpretation; too small or too large bins can mislead.
Overlay with KDE for smoother interpretation.

fig,ax = plt.subplots(1,3,figsize=(15,4))
ax[0].hist(x=df.age, bins=15)
sns.boxplot(x=df.age, ax=ax[2])
sns.kdeplot(x=df.age, ax=ax[1]);

bar_chart(
    df,
    "male",
    "age",
    txt = "Count",
    colors =['green','blue', "orange"] ,
    height=600,
    width=900,
    ttle="title",
    xtitle="X title" ,
    ytitle="Y title" ,
    bg=0.6, 
    bgg=0.1,
    yscale = False,
    yscale_percentage=False,
    group="group",
    leg="legend",
    box=True,
    facetcol="currentSmoker"
)

7. Rainfall Chart

When to Use:

To visualize time-series data with irregular spikes.
Commonly used in weather or financial data.

Why:

Highlights extreme values or events.
Effective for showing variability over time.

Interpretation Steps:

Identify spikes or drops in the data.
Look for patterns or cycles over time.
Compare with other variables (e.g., temperature vs. rainfall).

Details:

Use color gradients to show intensity.
Combine with line charts for better context.

sns.stripplot(x='education', y='age', data=df, jitter=True, alpha=0.5)
plt.title('Age Distribution by Education Level (Strip Plot)')
plt.xticks(rotation=45)
plt.show()

# Example 2: Combined strip and box plot
plt.figure(figsize=(12, 6))
sns.boxplot(x='currentSmoker', y='sysBP', data=df, whis=1.5)
sns.stripplot(x='currentSmoker', y='sysBP', data=df, 
              color='red', alpha=0.3, jitter=0.2, size=4)
plt.title('Systolic BP Distribution by Smoking Status')
plt.xlabel('Current Smoker (0=No, 1=Yes)')
plt.show()

8. CDF (Cumulative Distribution Function)

When to Use:

To show the cumulative probability distribution of a variable.
To compare distributions of different datasets.

Why:

Provides a complete view of the data distribution.
Helps in understanding percentiles and probabilities.

Interpretation Steps:

Identify the x-axis (data values) and y-axis (cumulative probability).
Look for steep slopes (high density) and flat regions (low density).
Compare multiple CDFs for differences.

Details:

The y-axis ranges from 0 to 1 (or 0% to 100%).
Useful for statistical analysis and hypothesis testing.

9. ECDF (Empirical Cumulative Distribution Function)

When to Use:

Similar to CDF but for empirical (observed) data.
To visualize the distribution of a sample.

Why:

Non-parametric and easy to compute.
Useful for small datasets.

Interpretation Steps:

Identify the step-like pattern of the ECDF.
Compare with theoretical CDFs or other ECDFs.
Look for deviations from expected distributions.

Details:

Each data point contributes a "step" in the ECDF.
Useful for exploratory data analysis (EDA).

# Example 1: CDF of age
def plot_cdf(data, variable, label=None):
    x = np.sort(data[variable].dropna())
    y = np.arange(1, len(x) + 1) / len(x)
    plt.plot(x, y, label=label)

# plt.figure(figsize=(8, 6))
plot_cdf(df, 'age', 'Age')
plt.title('Cumulative Distribution Function of Age')
plt.xlabel('Age')
plt.ylabel('Cumulative Probability')
plt.grid(True)
plt.show()

# Example 2: Comparing CDFs by gender
# plt.figure(figsize=(8, 6))
plot_cdf(df[df['male']==0], 'BMI', 'Female')
plot_cdf(df[df['male']==1], 'BMI', 'Male')
plt.title('CDF of BMI by Gender')
plt.xlabel('BMI')
plt.ylabel('Cumulative Probability')
plt.legend()
plt.grid(True)
plt.show()

10. Heatmap

When to Use:

To visualize matrix-like data (e.g., correlation matrices).
To show relationships between two categorical variables.

Why:

Effective for identifying patterns and clusters.
Color gradients make it easy to interpret.

Interpretation Steps:

Identify the axes (categories or variables).
Look for dark (high values) or light (low values) regions.
Check for patterns or clusters.

Details:

Use a color legend to interpret values.
Normalize data for better comparison.

# Check correlation with heatmap

fig, ax = plt.subplots(figsize=(25,8))
corr_matrix = df.corr()
mask = np.zeros_like(corr_matrix)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr_matrix, annot=True, mask=mask );

plt.title( 'Correlations Heatmap', fontsize=20, fontweight='bold', fontfamily='serif', pad=15);

Summary Table of Charts

Chart Type	When to Use	Key Features	Interpretation Tips
Bar Chart	Compare categories	Heights represent values	Compare bar heights
Pie Chart	Show proportions	Slices represent percentages	Ensure total is 100%
Scatter Plot	Show relationships between variables	Points represent data pairs	Look for patterns or outliers
KDE Plot	Visualize distributions	Smooth curve over histogram	Identify peaks and skewness
Boxplot	Show data distribution	Box represents IQR, whiskers show range	Check for outliers and median
Rainfall Chart	Visualize time-series spikes	Spikes represent extreme values	Look for patterns over time
CDF	Show cumulative distribution	Curve from 0 to 1	Compare percentiles
ECDF	Empirical distribution of sample	Step-like curve	Compare with theoretical CDF
Histogram	Frequency distribution	Bars represent frequency	Check for peaks and gaps
Heatmap	Matrix-like data	Color gradients represent values	Look for patterns and clusters

This guide provides a comprehensive overview of various charts, their uses, and interpretation steps. Use the summary table for quick reference!

Charts Selection Guide

1. Introduction:

2. Why Selecting the Correct Chart is Important

Clarity and Comprehension

Accurate Data Representation

Improved Decision Making

Engagement and Interest

3. Charts Guide with examples

importing libraries

Read the csv file

1. Bar Chart

When to Use:

Why:

Interpretation Steps:

Details:

2. Pie Chart

When to Use:

Why:

Interpretation Steps:

Details:

3. Scatter Plot

When to Use:

Why:

Interpretation Steps:

Details:

Single Scatter Plot

Multiple Scatter Plots

4. KDE Plot (Kernel Density Estimate)

When to Use:

Why:

Interpretation Steps:

Details:

5. Boxplot (Box-and-Whisker Plot)

When to Use:

Why:

Interpretation Steps:

Details:

6. Histogram

When to Use:

Why:

Interpretation Steps:

Details:

7. Rainfall Chart

When to Use:

Why:

Interpretation Steps:

Details:

8. CDF (Cumulative Distribution Function)

When to Use:

Why:

Interpretation Steps:

Details:

9. ECDF (Empirical Cumulative Distribution Function)

When to Use:

Why:

Interpretation Steps:

Details:

10. Heatmap

When to Use:

Why:

Interpretation Steps:

Details:

Summary Table of Charts