Displaying and Describing Quantitative Data: Study Notes

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Chapter 3: Displaying and Describing Quantitative Data

3.1 Displaying Quantitative Variables

Quantitative data consists of numerical values that can be measured and ordered. To understand patterns and trends, it is essential to visualize these data using appropriate graphical methods.

Histograms: A histogram is a graphical representation of the distribution of a quantitative variable. The data range is divided into intervals (bins), and the frequency of data within each bin is depicted by the height of the bar.

Table of monthly average stock prices for AIG Histogram of monthly average price

Relative Frequency Histograms: Similar to histograms, but the vertical axis shows the percentage of cases in each bin rather than the count.

Relative frequency histogram of monthly average price

Stem-and-Leaf Displays: These provide a way to visualize the distribution while retaining the original data values. Each value is split into a 'stem' (leading digit(s)) and a 'leaf' (trailing digit).

Histogram and stem-and-leaf display of monthly average price

Quantitative Data Condition: Only quantitative data with known units should be displayed using histograms or stem-and-leaf plots. Categorical data require bar or pie charts.

3.2 Shape of Distributions

Describing the shape of a distribution is crucial for understanding the underlying data. Key aspects include modes, symmetry, and outliers.

Modes: The peaks in a histogram are called modes. Distributions can be unimodal (one peak), bimodal (two peaks), or multimodal (more than two peaks).

Histogram showing bimodal distribution

Uniform Distribution: A distribution where all bars are approximately the same height, indicating no apparent mode.

Uniform distribution histogram

Symmetry: A distribution is symmetric if the left and right sides are mirror images.

Symmetric histogram

Skewness: If one tail is longer, the distribution is skewed in that direction. Right-skewed means the right tail is longer; left-skewed means the left tail is longer.

Right-skewed distribution Left-skewed distribution

Outliers: Values that are far from the rest of the data. Outliers can affect statistical methods, may indicate errors, or provide important information.

Judgment in Describing Shape: Understanding the context and data collection method is important when characterizing the shape.

Histogram of AIG average stock prices

3.3 Center of a Distribution

The center of a distribution summarizes where most values are located. The two main measures are the mean and the median.

Mean (\( \bar{x} \)): The arithmetic average, calculated as:

The mean is best for unimodal, symmetric distributions.
Median: The middle value when data are ordered. The median is resistant to outliers and skewness.
If the distribution is symmetric, the mean and median are close.

Histogram showing mean and median

3.4 Spread of the Distribution

Spread describes how much the data values vary. Common measures include range, interquartile range (IQR), and standard deviation.

Range: The difference between the maximum and minimum values.

Quartiles and IQR: Quartiles divide the data into four equal parts. The IQR is the range of the middle 50% of the data.

Standard Deviation (s): Measures the average distance of data values from the mean. The variance (\( s^2 \)) is the average squared deviation, and the standard deviation is its square root.

Standard deviation is best for symmetric distributions; IQR is better for skewed data or data with outliers.

3.5 Choosing Measures of Center and Spread

The choice of summary statistics depends on the shape of the distribution.

For skewed distributions, use the median and IQR.
For unimodal, symmetric distributions, use the mean and standard deviation (and possibly median and IQR).
Always pair the median with the IQR and the mean with the standard deviation.
Report outliers and consider their impact on summary statistics.

3.6 Standardizing Variables (z-scores)

Standardizing allows comparison of values from different distributions by expressing them in terms of standard deviations from the mean.

z-score: The standardized value is calculated as:

A z-score indicates how many standard deviations a value is from the mean. Positive z-scores are above the mean; negative are below.
z-scores are unitless and allow comparison across different variables.

3.7 Five-Number Summary and Boxplots

The five-number summary provides a concise description of a distribution: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.

Boxplot: A graphical display of the five-number summary. The box shows the IQR, the line inside the box is the median, and whiskers extend to the most extreme non-outlier values. Outliers are plotted individually.

Five-number summary table Boxplot of monthly trading volume

To construct a boxplot:
1. Draw an axis and mark the quartiles and median.
2. Draw a box from Q1 to Q3 and a line at the median.
3. Calculate fences (1.5 IQR above Q3 and below Q1) to identify outliers.
4. Draw whiskers to the most extreme values within the fences.
5. Plot outliers individually.

Stem-and-leaf display for Gretzky games played Boxplot for Gretzky games played

Boxplots are useful for identifying skewness, spread, and outliers.

3.8 Comparing Groups

Comparing distributions across groups helps identify patterns and differences. Histograms and boxplots are commonly used for this purpose.

Side-by-side histograms are useful for comparing two groups.
Boxplots are more effective for comparing several groups simultaneously.

Histogram for AIG 2002 Histogram for AIG 2003 Yearly boxplots for AIG data Monthly boxplots for AIG 2008

Example: Comparing a.m. and p.m. music downloads using boxplots.

Table of hourly downloads Boxplot comparing a.m. and p.m. downloads

3.9 Identifying Outliers

Outliers should be carefully examined to determine their cause and whether they should be included in the analysis.

Check for data entry errors or special circumstances that explain the outlier.
Outliers may be valid and provide important information about the data.

Table of real estate prices Boxplot showing extreme outlier in real estate prices Histogram of real estate prices

3.10 Time Series Plots

Time series plots display data points in chronological order, revealing trends, cycles, and other temporal patterns.

Time series plots can show both short-term variation and long-term trends.
Connecting points or adding a smooth trace can help visualize trends.
For non-stationary data (with trends or changing variability), time series plots are more informative than histograms.

Time series plot of AIG daily closing prices 2007 Time series plot with connected points Time series plot with smooth trace Time series plot for AIG 2008

3.11 Transforming Skewed Data

When data are highly skewed, transformations can make the distribution more symmetric, facilitating analysis and interpretation.

Common transformations include logarithms and square roots for right-skewed data, and squaring for left-skewed data.
Transformation can clarify the center and spread and help identify outliers.

Histogram and boxplot of CEO compensation (skewed) Histogram and boxplot of log-transformed CEO compensation

3.12 Common Pitfalls and Best Practices

Do not use histograms for categorical variables.
Choose appropriate and consistent scales for axes.
Label variables and axes clearly.
Check that numerical summaries make sense for the data type.
Be cautious with multiple modes and outliers; consider separating data if necessary.

Histogram of policy numbers (categorical variable)

Summary Table: Measures of Center and Spread

Distribution Shape	Center	Spread
Unimodal, Symmetric	Mean	Standard Deviation
Skewed or with Outliers	Median	IQR

What Have We Learned?

How to make and interpret histograms and boxplots for quantitative data.
How to describe distributions in terms of shape, center, and spread.
How to compute and interpret mean, median, standard deviation, and IQR.
How to standardize values using z-scores for comparison.
How to use the five-number summary and boxplots to identify outliers.
How to compare distributions across groups and over time.
How to transform skewed data for better analysis.