Numerically Summarizing Data: Measures of Central Tendency and Dispersion

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Chapter 3: Measures of Central Tendency

Mean, Median, and Mode

Measures of central tendency are statistical values that describe the center or typical value of a dataset. The three most common measures are the mean, median, and mode.

Mean (Arithmetic Average): The sum of all data values divided by the number of values. For a sample, the mean is denoted as $ \bar{x} $; for a population, it is $ \mu $.
- Sample mean formula:
- Population mean formula:
- Not resistant to outliers (extreme values can significantly affect the mean).
Median: The middle value when data are arranged in ascending order. If the number of observations is even, the median is the average of the two middle values.
- Resistant to outliers (not affected by extreme values).
Mode: The value that appears most frequently in the dataset. A dataset may have one mode, more than one mode, or no mode at all.

Example: For the dataset {1, 2, 2, 3, 4}:

Mean:
Median: 2
Mode: 2

Additional info: The mean is sensitive to outliers, while the median provides a better measure of center for skewed distributions.

Measures of Dispersion

Range, Standard Deviation, and Variance

Measures of dispersion describe the spread or variability of a dataset.

Range: The difference between the maximum and minimum values.
- Formula:
Standard Deviation (SD): Measures the average distance of data values from the mean.
- Sample standard deviation:
- Population standard deviation:
Variance: The square of the standard deviation.
- Sample variance:
- Population variance:

Example: For the dataset {2, 4, 4, 4, 5, 5, 7, 9}:

Mean: $5$
Range:
Sample standard deviation:

Additional info: The standard deviation is more informative than the range because it uses all data values.

Empirical Rule and Chebyshev's Inequality

Describing Data Spread in Bell-Shaped Distributions

The Empirical Rule applies to bell-shaped (normal) distributions and provides approximate percentages of data within certain standard deviations from the mean:

About 68% of data fall within 1 standard deviation ()
About 95% within 2 standard deviations ()
About 99.7% within 3 standard deviations ()

Chebyshev's Inequality applies to any data set, regardless of shape, and states that at least of the data lie within standard deviations of the mean (for ).

Example: For , at least (75%) of data are within 2 standard deviations of the mean.

Grouped Data: Approximating the Mean and Standard Deviation

Using Frequency Tables

When data are grouped into classes, the mean and standard deviation can be approximated using class midpoints and frequencies.

Mean from grouped data:
- Where is the frequency and is the class midpoint.
Standard deviation from grouped data:

Example Table:

Class Interval	Midpoint (m)	Frequency (f)
10-19	14.5	2
20-29	24.5	5
30-39	34.5	3

Mean:

Weighted Mean

Calculating Averages with Different Weights

The weighted mean is used when different data values contribute unequally to the mean.

Formula:
- Where is the weight and is the value.

Example: Calculating GPA with grades and credit hours:

Suppose: 2 units of A (4.0), 5 units of B (3.0), 2 units of C (2.0)
GPA:

z-Scores and Standardization

Comparing Data Values

A z-score indicates how many standard deviations a value is from the mean. It allows comparison across different datasets or distributions.

Formula:
For a sample:
A positive z-score means the value is above the mean; a negative z-score means below the mean.

Example: If , , , then

Percentiles, Quartiles, and the Five-Number Summary

Describing Data Position and Spread

Percentiles: Indicate the value below which a given percentage of observations fall. For example, the 90th percentile is the value below which 90% of the data lie.
Quartiles: Divide the data into four equal parts:
- Q1: 25th percentile
- Q2: 50th percentile (median)
- Q3: 75th percentile
Interquartile Range (IQR): Measures the spread of the middle 50% of data.
- Formula:
Five-number summary: Minimum, Q1, Median (Q2), Q3, Maximum

Example: For the dataset {1, 2, 3, 4, 5, 6, 7, 8, 9}:

Min: 1
Q1: 3
Median: 5
Q3: 7
Max: 9

Detecting Outliers

Using the IQR Method

Outliers are values that are unusually high or low compared to the rest of the data. The IQR method is commonly used to detect outliers:

Lower Fence:
Upper Fence:
Any value below the lower fence or above the upper fence is considered an outlier.

Boxplots and Data Distribution Shapes

Visualizing Data Spread and Skewness

A boxplot (box-and-whisker plot) visually displays the five-number summary and highlights the distribution's shape and potential outliers.

Skewed Right: Tail extends to the right (higher values).
Symmetric: Data are evenly distributed around the center.
Skewed Left: Tail extends to the left (lower values).

Boxplots are useful for comparing distributions and identifying skewness and outliers.

Additional info: Histograms and boxplots together provide a comprehensive view of data distribution, center, and spread.