BackNumerically Summarizing Data: Measures of Central Tendency and Dispersion
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Chapter 3: Measures of Central Tendency
Mean, Median, and Mode
Measures of central tendency are statistical values that describe the center or typical value of a dataset. The three most common measures are the mean, median, and mode.
Mean (Arithmetic Average): The sum of all data values divided by the number of values. For a sample, the mean is denoted as \( \bar{x} \); for a population, it is \( \mu \).
Sample mean formula:
Population mean formula:
Not resistant to outliers (extreme values can significantly affect the mean).
Median: The middle value when data are arranged in ascending order. If the number of observations is even, the median is the average of the two middle values.
Resistant to outliers (not affected by extreme values).
Mode: The value that appears most frequently in the dataset. A dataset may have one mode, more than one mode, or no mode at all.
Example: For the dataset {1, 2, 2, 3, 4}:
Mean:
Median: 2
Mode: 2
Additional info: The mean is sensitive to outliers, while the median provides a better measure of center for skewed distributions.
Measures of Dispersion
Range, Standard Deviation, and Variance
Measures of dispersion describe the spread or variability of a dataset.
Range: The difference between the maximum and minimum values.
Formula:
Standard Deviation (SD): Measures the average distance of data values from the mean.
Sample standard deviation:
Population standard deviation:
Variance: The square of the standard deviation.
Sample variance:
Population variance:
Example: For the dataset {2, 4, 4, 4, 5, 5, 7, 9}:
Mean: $5$
Range:
Sample standard deviation:
Additional info: The standard deviation is more informative than the range because it uses all data values.
Empirical Rule and Chebyshev's Inequality
Describing Data Spread in Bell-Shaped Distributions
The Empirical Rule applies to bell-shaped (normal) distributions and provides approximate percentages of data within certain standard deviations from the mean:
About 68% of data fall within 1 standard deviation ()
About 95% within 2 standard deviations ()
About 99.7% within 3 standard deviations ()
Chebyshev's Inequality applies to any data set, regardless of shape, and states that at least of the data lie within standard deviations of the mean (for ).
Example: For , at least (75%) of data are within 2 standard deviations of the mean.
Grouped Data: Approximating the Mean and Standard Deviation
Using Frequency Tables
When data are grouped into classes, the mean and standard deviation can be approximated using class midpoints and frequencies.
Mean from grouped data:
Where is the frequency and is the class midpoint.
Standard deviation from grouped data:
Example Table:
Class Interval | Midpoint (m) | Frequency (f) |
|---|---|---|
10-19 | 14.5 | 2 |
20-29 | 24.5 | 5 |
30-39 | 34.5 | 3 |
Mean:
Weighted Mean
Calculating Averages with Different Weights
The weighted mean is used when different data values contribute unequally to the mean.
Formula:
Where is the weight and is the value.
Example: Calculating GPA with grades and credit hours:
Suppose: 2 units of A (4.0), 5 units of B (3.0), 2 units of C (2.0)
GPA:
z-Scores and Standardization
Comparing Data Values
A z-score indicates how many standard deviations a value is from the mean. It allows comparison across different datasets or distributions.
Formula:
For a sample:
A positive z-score means the value is above the mean; a negative z-score means below the mean.
Example: If , , , then
Percentiles, Quartiles, and the Five-Number Summary
Describing Data Position and Spread
Percentiles: Indicate the value below which a given percentage of observations fall. For example, the 90th percentile is the value below which 90% of the data lie.
Quartiles: Divide the data into four equal parts:
Q1: 25th percentile
Q2: 50th percentile (median)
Q3: 75th percentile
Interquartile Range (IQR): Measures the spread of the middle 50% of data.
Formula:
Five-number summary: Minimum, Q1, Median (Q2), Q3, Maximum
Example: For the dataset {1, 2, 3, 4, 5, 6, 7, 8, 9}:
Min: 1
Q1: 3
Median: 5
Q3: 7
Max: 9
Detecting Outliers
Using the IQR Method
Outliers are values that are unusually high or low compared to the rest of the data. The IQR method is commonly used to detect outliers:
Lower Fence:
Upper Fence:
Any value below the lower fence or above the upper fence is considered an outlier.
Boxplots and Data Distribution Shapes
Visualizing Data Spread and Skewness
A boxplot (box-and-whisker plot) visually displays the five-number summary and highlights the distribution's shape and potential outliers.
Skewed Right: Tail extends to the right (higher values).
Symmetric: Data are evenly distributed around the center.
Skewed Left: Tail extends to the left (lower values).
Boxplots are useful for comparing distributions and identifying skewness and outliers.
Additional info: Histograms and boxplots together provide a comprehensive view of data distribution, center, and spread.