How to create box plot with quartile ranges and outliers using Matplotlib?

share link

by kanika dot icon Updated: Jul 20, 2023

technology logo
technology logo

Solution Kit Solution Kit  

A box plot is also known as a box-and-whisker plot. It is a graphical representation of the distribution of a dataset. It provides a visual summary of the key characteristics of the data. They are the center, spread, and skewness. The plot displays the five-number summary of the dataset. That includes the smallest, first quartile, median, third quartile, and most.  

 

Boxplots are useful for comparing distributions across different categories or groups. They allow for easy identification of skewness and outliers. This helps us to identify differences in central tendency and variability between datasets. Boxplots are used in exploratory data analysis, statistical analysis, and data visualization. They provide a concise and intuitive summary of the dataset. Finding the patterns and detecting potential anomalies between groups is easy with this.  

 

Boxplots, also known as box-and-whisker plots. This is a graphical representation. In which the numerical data provide a summary of its distribution. They display various statistical measures such as the median, quartiles, and range. There are different types of boxplots based on the specific values they represent. Let's discuss:  

Types of Boxplots:  

  • Simple Boxplot: This is the most common type. This provides the least, most, median, quartiles, and outliers if any.  
  • Notched Boxplot: Notched boxplots display a notch around the median. This indicates the uncertainty around its estimate.  
  • Violin Plot: A violin plot combines a box plot with a kernel density plot on each side. This provides a more detailed distribution view.  
  • Grouped Boxplot: Many boxplots can be grouped side by side. This is done to compare the distributions across different categories or variables.  
  • Interpretation: Explain how to interpret a boxplot. These types are created to represent different data and their distributions.  


The box will represent the middle 50% of the data (IQR), with the line inside representing the median. The whiskers extend to the least and largest values within a certain range (e.g., 1.5 times the IQR). Outliers are represented as individual points outside the whiskers. Boxplots, also known as box-and-whisker plots. They are powerful visualization tools used to display and analyze numerical data. They provide a concise summary of the distribution of a dataset. It allows us to compare many datasets and identify outliers. It helps gain insights into the data's central tendency, spread, and skewness.  

 

Here are the different ways boxplots can be used:  

  • Visualizing Data - This visual representation allows for a quick understanding. About the spread, central tendency, and skewness of the data. To do this, we don't need to examine individual data points.  
  • Comparing Data Sets - variations between different groups, categories, or variables is done.  
  • Identifying Outliers - Outliers are data points that differ from most data.  
  • Understanding Distribution - Boxplots provide a visual summary of the distribution.  
  • Assessing Central Tendency - To identify differences in these dataset's boxplots.  

 

Boxplots are a powerful tool for visualizing. It will summarize the distribution of a dataset. Here, I'll discuss three different methods used to create boxplots:  

  • The basic Box and Whisker plot.  
  • The Box and Whisker plot with outliers.  
  • The boxplot smoothing technique.  

 

Good boxplots are created to ensure clear and effective visualization of data. Here are some tips for creating a good boxplot:  

  • Understand the data. 
  • Choose an appropriate scale.  
  • Avoid overplotting.  
  • Include necessary elements.  
  • Customize visuals for clarity.  
  • Consider grouping or categorizing.  
  • Provide context and explanations.  

Use cases and advantages:  

  • Identify distribution: Boxplots provide a visual summary. It is about the distribution shape, skewness, and presence of outliers.  
  • Compare groups: Grouped box plots allow easy comparison of distributions. It is done across different groups or categories.  
  • Detect outliers: Outliers are highlighted. This makes it easy to identify potential anomalies or extreme values.  
  • Assess symmetry: The position and shape of the box and whiskers state. This is about the symmetry or skewness of the data.  
  • Limitations and considerations: Highlight the limitations of boxplots. It is the inability to show the actual data points. This has the potential for oversimplification. It is sensitive to sample size.  


In conclusion, boxplots are powerful and informative visual tools. It is used for analyzing and interpreting data. They provide a concise summary of the distribution of a dataset. This helps highlight key statistics such as the median, quartiles, and outliers. We can identify our data's central tendencies, variations, and potential anomalies.  


Here is an example of creating a box plot with quartile ranges and outliers using Matplotlib.




Fig 1: Preview of the output that you will get on running this code from your IDE.

Code


In this solution, we use the matplotlib library.

import matplotlib.cbook as cbook
import matplotlib.pyplot as plt
import numpy as np

# Generate some random data to visualise
np.random.seed(2019)
data = np.random.normal(size=100)

stats = {}
# Compute the boxplot stats (as in the default matplotlib implementation)
stats['A'] = cbook.boxplot_stats(data, labels='A')[0]
stats['B'] = cbook.boxplot_stats(data, labels='B')[0]
stats['C'] = cbook.boxplot_stats(data, labels='C')[0]

# For box A compute the 1st and 99th percentiles
stats['A']['q1'], stats['A']['q3'] = np.percentile(data, [1, 99])
# For box B compute the 10th and 90th percentiles
stats['B']['q1'], stats['B']['q3'] = np.percentile(data, [10, 90])
# For box C compute the 25th and 75th percentiles (matplotlib default)
stats['C']['q1'], stats['C']['q3'] = np.percentile(data, [25, 75])

fig, ax = plt.subplots(1, 1)
# Plot boxplots from our computed statistics
ax.bxp([stats['A'], stats['B'], stats['C']], positions=range(3))

import itertools
from matplotlib.cbook import _reshape_2D
import matplotlib.pyplot as plt
import numpy as np

# Function adapted from matplotlib.cbook
def my_boxplot_stats(X, whis=1.5, bootstrap=None, labels=None,
                  autorange=False, percents=[25, 75]):

    def _bootstrap_median(data, N=5000):
        # determine 95% confidence intervals of the median
        M = len(data)
        percentiles = [2.5, 97.5]

        bs_index = np.random.randint(M, size=(N, M))
        bsData = data[bs_index]
        estimate = np.median(bsData, axis=1, overwrite_input=True)

        CI = np.percentile(estimate, percentiles)
        return CI

    def _compute_conf_interval(data, med, iqr, bootstrap):
        if bootstrap is not None:
            # Do a bootstrap estimate of notch locations.
            # get conf. intervals around median
            CI = _bootstrap_median(data, N=bootstrap)
            notch_min = CI[0]
            notch_max = CI[1]
        else:

            N = len(data)
            notch_min = med - 1.57 * iqr / np.sqrt(N)
            notch_max = med + 1.57 * iqr / np.sqrt(N)

        return notch_min, notch_max

    # output is a list of dicts
    bxpstats = []

    # convert X to a list of lists
    X = _reshape_2D(X, "X")

    ncols = len(X)
    if labels is None:
        labels = itertools.repeat(None)
    elif len(labels) != ncols:
        raise ValueError("Dimensions of labels and X must be compatible")

    input_whis = whis
    for ii, (x, label) in enumerate(zip(X, labels)):

        # empty dict
        stats = {}
        if label is not None:
            stats['label'] = label

        # restore whis to the input values in case it got changed in the loop
        whis = input_whis

        # note tricksyness, append up here and then mutate below
        bxpstats.append(stats)

        # if empty, bail
        if len(x) == 0:
            stats['fliers'] = np.array([])
            stats['mean'] = np.nan
            stats['med'] = np.nan
            stats['q1'] = np.nan
            stats['q3'] = np.nan
            stats['cilo'] = np.nan
            stats['cihi'] = np.nan
            stats['whislo'] = np.nan
            stats['whishi'] = np.nan
            stats['med'] = np.nan
            continue

        # up-convert to an array, just to be safe
        x = np.asarray(x)

        # arithmetic mean
        stats['mean'] = np.mean(x)

        # median
        med = np.percentile(x, 50)
        ## Altered line
        q1, q3 = np.percentile(x, (percents[0], percents[1]))

        # interquartile range
        stats['iqr'] = q3 - q1
        if stats['iqr'] == 0 and autorange:
            whis = 'range'

        # conf. interval around median
        stats['cilo'], stats['cihi'] = _compute_conf_interval(
            x, med, stats['iqr'], bootstrap
        )

        # lowest/highest non-outliers
        if np.isscalar(whis):
            if np.isreal(whis):
                loval = q1 - whis * stats['iqr']
                hival = q3 + whis * stats['iqr']
            elif whis in ['range', 'limit', 'limits', 'min/max']:
                loval = np.min(x)
                hival = np.max(x)
            else:
                raise ValueError('whis must be a float, valid string, or list '
                                 'of percentiles')
        else:
            loval = np.percentile(x, whis[0])
            hival = np.percentile(x, whis[1])

        # get high extreme
        wiskhi = np.compress(x <= hival, x)
        if len(wiskhi) == 0 or np.max(wiskhi) < q3:
            stats['whishi'] = q3
        else:
            stats['whishi'] = np.max(wiskhi)

        # get low extreme
        wisklo = np.compress(x >= loval, x)
        if len(wisklo) == 0 or np.min(wisklo) > q1:
            stats['whislo'] = q1
        else:
            stats['whislo'] = np.min(wisklo)

        # compute a single array of outliers
        stats['fliers'] = np.hstack([
            np.compress(x < stats['whislo'], x),
            np.compress(x > stats['whishi'], x)
        ])

        # add in the remaining stats
        stats['q1'], stats['med'], stats['q3'] = q1, med, q3

    return bxpstats

# Generate some random data to visualise
np.random.seed(2019)
data = np.random.normal(size=100)

stats = {}

# Compute the boxplot stats with our desired percentiles
stats['A'] = my_boxplot_stats(data, labels='A', percents=[1, 99])[0]
stats['B'] = my_boxplot_stats(data, labels='B', percents=[10, 90])[0]
stats['C'] = my_boxplot_stats(data, labels='C', percents=[25, 75])[0]

fig, ax = plt.subplots(1, 1)
# Plot boxplots from our computed statistics
ax.bxp([stats['A'], stats['B'], stats['C']], positions=range(3))

Instructions

Follow the steps carefully to get the output easily.

  1. Install Jupyter Notebook on your computer.
  2. Open terminal and install the required libraries with following commands.
  3. Install numpy - pip install numpy.
  4. Install matplotlib - pip install matplotlib.
  5. Copy the snippet using the 'copy' button and paste it into that file.
  6. Run the file using run button.


I hope you found this useful. I have added the link to dependent libraries, and version information in the following sections.


I found this code snippet by searching for "Create box plot with quartile ranges and outliers using Matplotlib" in kandi. You can try any such use case!

Dependent Libraries

matplotlibby matplotlib

Python doticonstar image 17559 doticonVersion:v3.7.1doticon
no licences License: No License (null)

matplotlib: plotting with Python

Support
    Quality
      Security
        License
          Reuse

            matplotlibby matplotlib

            Python doticon star image 17559 doticonVersion:v3.7.1doticonno licences License: No License

            matplotlib: plotting with Python
            Support
              Quality
                Security
                  License
                    Reuse

                      numpyby numpy

                      Python doticonstar image 23755 doticonVersion:v1.25.0rc1doticon
                      License: Permissive (BSD-3-Clause)

                      The fundamental package for scientific computing with Python.

                      Support
                        Quality
                          Security
                            License
                              Reuse

                                numpyby numpy

                                Python doticon star image 23755 doticonVersion:v1.25.0rc1doticon License: Permissive (BSD-3-Clause)

                                The fundamental package for scientific computing with Python.
                                Support
                                  Quality
                                    Security
                                      License
                                        Reuse

                                          If you do not have matplotlib or numpy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the respective page in kandi.


                                          You can search for any dependent library on kandi like matplotlib

                                          Environment Tested


                                          I tested this solution in the following versions. Be mindful of changes when working with other versions.

                                          1. The solution is created in Python 3.9.6
                                          2. The solution is tested on matplotlib version 3.5.0


                                          Using this solution, we are able to create box plot with quartile ranges and outliers using Matplotlib.

                                          Support

                                          1. For any support on kandi solution kits, please use the chat
                                          2. For further learning resources, visit the Open Weaver Community learning page.


                                          FAQ:  

                                          1. What is the matplotlib boxplot function, and how does it work?  

                                          The boxplot function is a graphical representation through a box-and-whisker plot. It helps visualize the data's distribution, central tendency and spread. The boxplot is useful for identifying outliers and comparing many datasets. Here's how the boxplot function works in Matplotlib:  

                                          • Import the necessary modules:  

                                             import matplotlib.pyplot as plt  

                                          • Prepare your data: The input data can be a list, an array, or a Data Frame.  
                                          • Create a boxplot: Use the plt.boxplot() function to generate the boxplot.  
                                          • Display the plot: Call plt.show() to display the generated boxplot.  

                                           

                                          2. How can I use the Matplotlib library to create and customize a box plot?  

                                          To create and customize a box plot using the Matplotlib library, you can follow these steps:  

                                          • Import the necessary libraries.  
                                          • Generate some data to plot.  
                                          • Create a figure and axes object.  
                                          • Plot the box plot using the boxplot() function.  
                                          • Customize the fill color of the boxes.  
                                          • Customize the color of the whiskers and caps.  
                                          • Customize the color and style of the medians.  
                                          • Customize the color and style of the fliers/outliers.  
                                          • Customize the x-axis tick labels.  
                                          • Add a title and axis labels.  
                                          • Finally, display the plot.  


                                          3. Is there an easy way to generate a box plot from a pandas' data frame? 

                                          Yes, pandas provide a simple way to generate a box plot from a Data Frame using the boxplot () function.  

                                           

                                          4. What is PyPlot, and how do its plotting functions help me create my box plots?  

                                          PyPlot is a plotting library. It provides a high-level interface for creating various types of plots. When creating box plots using PyPlot, you can use its plotting functions. It generates and customizes the box plots according to your needs. Here's a step-by-step guide on how to create box plots using PyPlot:  

                                          • Import the necessary libraries.  
                                          • Prepare your data.  
                                          • Create the box plot.  
                                          • Use the boxplot () function from PyPlot to generate the box plot.  
                                          • Customize the box plot.  

                                           

                                          5. How are quartiles determined when using Matplotlib's boxplot function?  

                                          Quartiles are determined based on the data provided when using the boxplot function. The boxplot function calculates quartiles using the following method:  

                                          • The data is sorted in ascending order.  
                                          • The median (second quartile, Q2) is determined as the middle value of the sorted data.  
                                          • The lower quartile (Q1) is the median of the lower half of the sorted data.  
                                          • The upper quartile (Q3) is the median of the upper half of the sorted data. 

                                          See similar Kits and Libraries