Charts Selection Guide

Charts Selection Guide

1. Introduction:

  • Choosing the correct chart for visualizing data is critical in ensuring that insights are clearly communicated to the audience. A poorly chosen chart can lead to confusion, misinterpretation, and can obscure valuable insights from the data. In this article, we will discuss why selecting the right chart is important and outline the steps to help you choose the correct chart for your data visualization needs.

2. Why Selecting the Correct Chart is Important

  • Clarity and Comprehension

The primary purpose of a chart is to help the viewer easily understand the data. If the chart is not suited to the type of data being presented, it can cause confusion, making it harder for the audience to draw conclusions. A well-chosen chart simplifies complex data and allows for better insight.

  • Accurate Data Representation

The right chart ensures that the data is accurately represented. Using the wrong chart type can distort the meaning of the data, leading to incorrect conclusions. For example, a pie chart may not be the best choice to show changes over time, as it does not effectively convey trends.

  • Improved Decision Making

For business or scientific purposes, decision-making relies heavily on data insights. A poorly selected chart can mislead stakeholders, leading to bad decisions. The correct chart type provides a clear representation of the data, aiding in informed and effective decision-making.

  • Engagement and Interest

A visually appealing chart can engage the audience better than a raw data table. It can make the data more approachable and easier to understand. If the chart is chosen correctly, it will not only be informative but also visually stimulating, increasing audience interest.

3. Charts Guide with examples

importing libraries

import pandas as pd 
import numpy as np
import openpyxl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
import plotly.express as px
import plotly.graph_objects as go
import plotly.express as px
import datetime
import missingno as msno
from ydata_profiling import ProfileReport 
import great_expectations as ge
import ipywidgets as widgets
from IPython.display import display, HTML

sns.set_style("whitegrid")
pd.options.display.max_colwidth = 20
pd.options.display.max_columns = 50

Read the csv file

df = pd.read_csv("../1_Data/cleand_df.csv")

1. Bar Chart

When to Use:

  • To compare categorical data.

  • To show the distribution of data across categories.

Why:

  • Easy to interpret.

  • Effective for showing differences between groups.

Interpretation Steps:

  1. Identify the categories on the x-axis.

  2. Compare the heights of the bars (y-axis) to determine differences.

  3. Look for trends or outliers.

Details:

  • Use horizontal bars for long category names.

  • Stacked or grouped bars can show sub-categories.

result = multiple_group_and_calculate_percentage(df, "male", "currentSmoker")
result['male'] = result['male'].apply(lambda x: 'male' if x == 0 else "female")
result['currentSmoker'] = result['currentSmoker'].apply(lambda x: 'non-smooker' if x == 0 else "smooker")
result
malecurrentSmokerCountPercentage
0malenon-smooker143133.75
1malesmooker98923.33
2femalenon-smooker71416.84
3femalesmooker110626.08

2. Pie Chart

When to Use:

  • To show proportions or percentages of a whole.

  • When there are a few categories (less than 6).

Why:

  • Visually appealing for part-to-whole relationships.

  • Easy to understand for non-technical audiences.

Interpretation Steps:

  1. Identify the slices and their corresponding categories.

  2. Compare the sizes of the slices to understand proportions.

  3. Ensure the total adds up to 100%.

Details:

  • Avoid using too many slices; it becomes cluttered.

  • Use annotations to label percentages.

result
malecurrentSmokerCountPercentage
0malenon-smooker143133.75
1malesmooker98923.33
2femalenon-smooker71416.84
3femalesmooker110626.08
fig = px.pie(result, values='Percentage', names="male",
             labels = {'Percentage': 'Percentage' ,'Count':'Count'},
             custom_data=['Count'], 
             hover_data={'currentSmoker': True,'Count':True, 'Percentage':True},
             color_discrete_sequence=['#008294','#bdbdbd'],height=400)
fig.update_traces( # Change marker color
                  marker_line_color='black',  # Marker line color
                  marker_line_width=1,  # Marker line width
                  opacity=1,
                  hoverinfo='label+percent+value',  # Display label, percent, and value on hover
                  texttemplate='%{value:.2f}',
                  textinfo='percent', textfont_size=16)  # Display percent and label in each segment

fig.update_layout(
    title={'text': 'title', 'y': 0.95, 'x': 0.5, 'xanchor': 'center', 'yanchor': 'top'},  # Center and adjust title
    title_font_size=20,  # Title font size
    legend_title_text='Legend',  # Legend title
    font=dict(family='Arial', size=14),  # Font family and size for labels
)
fig.show()

3. Scatter Plot

When to Use:

  • To show the relationship between two numerical variables.

  • To identify correlations, trends, or outliers.

Why:

  • Effective for visualizing patterns in data.

  • Helps in identifying clusters or gaps.

Interpretation Steps:

  1. Look for patterns (positive, negative, or no correlation).

  2. Identify outliers or unusual points.

  3. Check the density of points in different regions.

Details:

  • Use color or size to add a third dimension (e.g., another variable).

  • Add a trendline to highlight relationships.

4240 rows × 16 columns

sns.regplot(data=df, x='age', y='cigsPerDay', scatter_kws={'alpha':0.5}, color="orchid")
plt.title('title')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.show()

Single Scatter Plot

# Scatter plot between Profit and (Revenue, Quantity and Cost)

# c = sns.color_palette("rocket")[1:9]
c = sns.color_palette("rocket", 7)[1:8]

fig,ax = plt.subplots(2,3,figsize=(25,10))
# k=0
# j=0
# i = 0
k, j = 0, 0
i = 0
for col in ["age", "cigsPerDay"]:
    for row in ["totChol", "sysBP", "diaBP"]:
        sns.regplot(data = df, x = col, y = row, ax = ax[k,j], color= c[i])
        i+=1
        ax[k,j].set_xlabel(col, fontsize=17, color="k")
        ax[k,j].set_ylabel(row, fontsize=17, color="k")
        ax[k,j].set_title("Correlation between {} and {}".format(col, row), fontsize=18, color="k", pad=25)
        j += 1
        if j >= 3:  # Move to the next row after 3 columns
            k += 1
            j = 0

#fig.suptitle('Correlation between Late_FM, Late_LM, Total_Distance and Waiting_Time', fontsize=25, color="k")
plt.subplots_adjust(hspace = 0.6, wspace=0.2);

Multiple Scatter Plots

4. KDE Plot (Kernel Density Estimate)

When to Use:

  • To visualize the distribution of a continuous variable.

  • To smooth out histograms for better interpretation.

Why:

  • Provides a smooth estimate of the data distribution.

  • Useful for comparing multiple distributions.

Interpretation Steps:

  1. Identify peaks (modes) in the distribution.

  2. Look for skewness (left or right) or symmetry.

  3. Compare multiple KDEs for differences.

Details:

  • Bandwidth selection affects smoothness.

  • Overlapping KDEs can show comparisons between groups.

5. Boxplot (Box-and-Whisker Plot)

When to Use:

  • To show the distribution of numerical data.

  • To identify outliers and compare distributions across groups.

Why:

  • Summarizes data using quartiles.

  • Highlights outliers effectively.

Interpretation Steps:

  1. Identify the median (line inside the box).

  2. Check the interquartile range (IQR, the box).

  3. Look for outliers (points outside the whiskers).

Details:

  • Useful for comparing multiple groups side by side.

  • Whiskers typically represent 1.5x IQR.

6. Histogram

When to Use:

  • To show the frequency distribution of a continuous variable.

  • To identify skewness, peaks, and gaps.

Why:

  • Simple and effective for understanding data distribution.

  • Helps in identifying patterns and outliers.

Interpretation Steps:

  1. Identify bins (x-axis) and frequencies (y-axis).

  2. Look for peaks (modes) and gaps.

  3. Check for skewness (left or right).

Details:

  • Bin size affects interpretation; too small or too large bins can mislead.

  • Overlay with KDE for smoother interpretation.

fig,ax = plt.subplots(1,3,figsize=(15,4))
ax[0].hist(x=df.age, bins=15)
sns.boxplot(x=df.age, ax=ax[2])
sns.kdeplot(x=df.age, ax=ax[1]);

bar_chart(
    df,
    "male",
    "age",
    txt = "Count",
    colors =['green','blue', "orange"] ,
    height=600,
    width=900,
    ttle="title",
    xtitle="X title" ,
    ytitle="Y title" ,
    bg=0.6, 
    bgg=0.1,
    yscale = False,
    yscale_percentage=False,
    group="group",
    leg="legend",
    box=True,
    facetcol="currentSmoker"
)

7. Rainfall Chart

When to Use:

  • To visualize time-series data with irregular spikes.

  • Commonly used in weather or financial data.

Why:

  • Highlights extreme values or events.

  • Effective for showing variability over time.

Interpretation Steps:

  1. Identify spikes or drops in the data.

  2. Look for patterns or cycles over time.

  3. Compare with other variables (e.g., temperature vs. rainfall).

Details:

  • Use color gradients to show intensity.

  • Combine with line charts for better context.

sns.stripplot(x='education', y='age', data=df, jitter=True, alpha=0.5)
plt.title('Age Distribution by Education Level (Strip Plot)')
plt.xticks(rotation=45)
plt.show()

# Example 2: Combined strip and box plot
plt.figure(figsize=(12, 6))
sns.boxplot(x='currentSmoker', y='sysBP', data=df, whis=1.5)
sns.stripplot(x='currentSmoker', y='sysBP', data=df, 
              color='red', alpha=0.3, jitter=0.2, size=4)
plt.title('Systolic BP Distribution by Smoking Status')
plt.xlabel('Current Smoker (0=No, 1=Yes)')
plt.show()

8. CDF (Cumulative Distribution Function)

When to Use:

  • To show the cumulative probability distribution of a variable.

  • To compare distributions of different datasets.

Why:

  • Provides a complete view of the data distribution.

  • Helps in understanding percentiles and probabilities.

Interpretation Steps:

  1. Identify the x-axis (data values) and y-axis (cumulative probability).

  2. Look for steep slopes (high density) and flat regions (low density).

  3. Compare multiple CDFs for differences.

Details:

  • The y-axis ranges from 0 to 1 (or 0% to 100%).

  • Useful for statistical analysis and hypothesis testing.


9. ECDF (Empirical Cumulative Distribution Function)

When to Use:

  • Similar to CDF but for empirical (observed) data.

  • To visualize the distribution of a sample.

Why:

  • Non-parametric and easy to compute.

  • Useful for small datasets.

Interpretation Steps:

  1. Identify the step-like pattern of the ECDF.

  2. Compare with theoretical CDFs or other ECDFs.

  3. Look for deviations from expected distributions.

Details:

  • Each data point contributes a "step" in the ECDF.

  • Useful for exploratory data analysis (EDA).

# Example 1: CDF of age
def plot_cdf(data, variable, label=None):
    x = np.sort(data[variable].dropna())
    y = np.arange(1, len(x) + 1) / len(x)
    plt.plot(x, y, label=label)

# plt.figure(figsize=(8, 6))
plot_cdf(df, 'age', 'Age')
plt.title('Cumulative Distribution Function of Age')
plt.xlabel('Age')
plt.ylabel('Cumulative Probability')
plt.grid(True)
plt.show()

# Example 2: Comparing CDFs by gender
# plt.figure(figsize=(8, 6))
plot_cdf(df[df['male']==0], 'BMI', 'Female')
plot_cdf(df[df['male']==1], 'BMI', 'Male')
plt.title('CDF of BMI by Gender')
plt.xlabel('BMI')
plt.ylabel('Cumulative Probability')
plt.legend()
plt.grid(True)
plt.show()

10. Heatmap

When to Use:

  • To visualize matrix-like data (e.g., correlation matrices).

  • To show relationships between two categorical variables.

Why:

  • Effective for identifying patterns and clusters.

  • Color gradients make it easy to interpret.

Interpretation Steps:

  1. Identify the axes (categories or variables).

  2. Look for dark (high values) or light (low values) regions.

  3. Check for patterns or clusters.

Details:

  • Use a color legend to interpret values.

  • Normalize data for better comparison.

# Check correlation with heatmap

fig, ax = plt.subplots(figsize=(25,8))
corr_matrix = df.corr()
mask = np.zeros_like(corr_matrix)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr_matrix, annot=True, mask=mask );

plt.title( 'Correlations Heatmap', fontsize=20, fontweight='bold', fontfamily='serif', pad=15);

Summary Table of Charts

Chart TypeWhen to UseKey FeaturesInterpretation Tips
Bar ChartCompare categoriesHeights represent valuesCompare bar heights
Pie ChartShow proportionsSlices represent percentagesEnsure total is 100%
Scatter PlotShow relationships between variablesPoints represent data pairsLook for patterns or outliers
KDE PlotVisualize distributionsSmooth curve over histogramIdentify peaks and skewness
BoxplotShow data distributionBox represents IQR, whiskers show rangeCheck for outliers and median
Rainfall ChartVisualize time-series spikesSpikes represent extreme valuesLook for patterns over time
CDFShow cumulative distributionCurve from 0 to 1Compare percentiles
ECDFEmpirical distribution of sampleStep-like curveCompare with theoretical CDF
HistogramFrequency distributionBars represent frequencyCheck for peaks and gaps
HeatmapMatrix-like dataColor gradients represent valuesLook for patterns and clusters

This guide provides a comprehensive overview of various charts, their uses, and interpretation steps. Use the summary table for quick reference!