Chapter 4 STATS

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Chapter 4: Describing the Relation Between Two Variables

Scatter Diagrams and Correlation

When analyzing data involving two quantitative variables measured on the same individual, it is essential to understand the relationship between them. This section introduces graphical and numerical methods for describing such bivariate data.

Scatter Diagrams

A scatter diagram is a graph that displays the relationship between two quantitative variables. Each point represents an individual in the data set, with the explanatory variable (independent variable) on the horizontal axis and the response variable (dependent variable) on the vertical axis.

Purpose: To visually assess the type of relationship (linear, nonlinear, or no relation) between two variables.
Interpretation: Patterns in the scatter diagram help distinguish between linear, nonlinear, and no association.
Example: Predicting home sale prices using Zestimate values, or analyzing drilling time versus depth.

Types of relationships in scatter diagrams: linear, nonlinear, and no relation

Types of Association

Positive Association: Above-average values of one variable are associated with above-average values of the other. As one increases, so does the other.
Negative Association: Above-average values of one variable are associated with below-average values of the other. As one increases, the other decreases.

Properties of the Linear Correlation Coefficient

The linear correlation coefficient (Pearson product moment correlation coefficient) measures the strength and direction of the linear relationship between two quantitative variables. The population correlation coefficient is denoted by , and the sample correlation coefficient by .

Formula for Sample Correlation Coefficient:

: ith observation of explanatory variable
: sample mean of explanatory variable
: sample standard deviation of explanatory variable
: ith observation of response variable
: sample mean of response variable
: sample standard deviation of response variable
: number of individuals in the sample

Interpretation of Correlation Coefficient

Range:
r = +1: Perfect positive linear relation
r = -1: Perfect negative linear relation
r close to +1: Strong positive association
r close to -1: Strong negative association
r close to 0: Little or no evidence of linear relation (but possibly nonlinear relation)
Unitless: The correlation coefficient is unitless.
Not Resistant: Outliers can significantly affect .

Examples of scatter diagrams with different correlation coefficients

Correlation vs. Causation

Correlation does not imply causation. High correlation between two variables does not mean one causes the other. Causality can only be established through designed experiments, not observational data. Lurking variables may explain observed associations.

Example: Ice cream sales and crime rates are correlated due to temperature, a lurking variable.

Least-Squares Regression

The least-squares regression line is the line that best fits the data by minimizing the sum of squared residuals (errors). It is used to make predictions and interpret relationships between variables.

Equation of the Least-Squares Regression Line

: Slope
: y-intercept

Interpretation of Slope and y-Intercept

Slope: Represents the average change in the response variable for each unit increase in the explanatory variable.
y-Intercept: Represents the predicted value of the response variable when the explanatory variable is zero (if zero is a reasonable value).
Caution: Predictions outside the observed range (extrapolation) may not be valid.

Sum of Squared Residuals

The least-squares regression line minimizes the sum of squared residuals:

Any other line will have a greater sum of squared residuals.

The Coefficient of Determination ()

The coefficient of determination () measures the proportion of total variation in the response variable explained by the regression line.

Range:
Interpretation: means the line explains none of the variation; means it explains all the variation.
Calculation: For linear regression,

Regression plot showing explained and unexplained variation

Decomposition of Variation

Total deviation:
Explained deviation:
Unexplained deviation:
Relationship:

Contingency Tables and Association

Contingency tables (two-way tables) relate two categorical variables and are used to analyze association among categorical data.

Marginal Distribution

Definition: Frequency or relative frequency distribution of either the row or column variable.
Example: Distribution of grades or delivery methods in a course effectiveness study.

Conditional Distribution

Definition: Relative frequency of each category of the response variable, given a specific value of the explanatory variable.
Purpose: To identify association between categorical variables.

Simpson’s Paradox

Simpson’s Paradox occurs when an association between two variables reverses or disappears when a third variable is introduced. This highlights the importance of considering lurking variables in data analysis.

Example: Survival rates for diabetes types appear different until age is considered, revealing a different association.

Type	Survived	Died	Total
Type 1	253	105	358
Type 2	326	218	544
Total	579	323	902

Additional info: When stratified by age, the association between diabetes type and mortality changes, illustrating Simpson's Paradox.