Describing the Relation Between Two Variables: Scatter Diagrams, Correlation, and Regression

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Describing the Relation Between Two Variables

Scatter Diagrams and Correlation

Understanding the relationship between two quantitative variables is fundamental in statistics. This section introduces scatter diagrams, correlation, and the interpretation of linear relationships.

Response Variable: The variable whose value is explained by the explanatory (predictor) variable.
Explanatory Variable: The variable that explains or influences changes in the response variable.

Scatter Diagrams

A scatter diagram is a graph that displays the relationship between two quantitative variables measured on the same individuals.
The explanatory variable is plotted on the horizontal axis, and the response variable is plotted on the vertical axis.
Scatter diagrams help identify the type of relationship: linear, nonlinear, or no relation.
Example: Investigating the relationship between club-head speed (mph) and golf ball distance (yards) using a controlled experiment.

Types of Association

Positive Association: As one variable increases, the other also increases.
Negative Association: As one variable increases, the other decreases.

Linear Correlation Coefficient

The linear correlation coefficient (Pearson's r) measures the strength and direction of the linear relationship between two quantitative variables.

Population correlation coefficient:
Sample correlation coefficient:

The formula for the sample correlation coefficient is:

are the individual sample values
are the sample means
are the sample standard deviations
is the sample size

Properties of the Linear Correlation Coefficient

indicates a perfect positive linear relation
indicates a perfect negative linear relation
close to 0 suggests little or no linear relation
is unitless and not resistant to outliers

Determining the Existence of a Linear Relation

Compare the absolute value of to a critical value (from statistical tables) based on sample size.
If is greater than the critical value, a linear relation exists.

Correlation vs. Causation

Correlation does not imply causation.
A lurking variable is an unmeasured variable that influences both the explanatory and response variables, potentially confounding results.
Example: Higher air-conditioning bills and crime rates are both influenced by temperature, a lurking variable.

Least-Squares Regression

Finding the Least-Squares Regression Line

The least-squares regression line is the line that minimizes the sum of the squared residuals (vertical distances between observed and predicted values).

The equation is:

(slope)
(y-intercept)

Interpretation of Slope and Intercept

Slope (): The average change in the response variable for each one-unit increase in the explanatory variable.
Y-intercept (): The predicted value of the response variable when the explanatory variable is zero (interpret only if zero is a reasonable value).

Making Predictions

Substitute a value of into the regression equation to predict .
Residual: The difference between the observed value and the predicted value ().

Example: Regression Output from Statistical Software

The following outputs show the results of fitting a simple linear regression model to the golf data (Distance vs. Club-Head Speed) using Excel, StatCrunch, and R:

Regression output from Excel, StatCrunch, and R

Regression Equation:
Interpretation: For each additional mile per hour in club-head speed, the distance increases by approximately 3.166 yards, on average.

Sum of Squared Residuals

The least-squares regression line minimizes the sum of squared residuals:

Any other line will have a larger sum of squared residuals.

Diagnostics on the Least-Squares Regression Line

Coefficient of Determination ()

Coefficient of determination (): Measures the proportion of total variation in the response variable explained by the regression line.

For example, means 88.2% of the variation in distance is explained by club-head speed.

Residual Analysis

Residual plots help assess the appropriateness of the linear model.
If residuals display no pattern and have constant variance, the linear model is appropriate.
Outliers and influential observations can be detected using residual plots and boxplots.

Influential Observations

An influential observation significantly affects the regression line's slope, intercept, or correlation coefficient.
Influence depends on the observation's leverage (distance from the mean of the explanatory variable) and residual size.
Influential points should only be removed with justification; otherwise, consider collecting more data or using robust methods.

Contingency Tables and Association

Contingency Tables

A contingency table (two-way table) displays the frequency distribution of two categorical variables.
Marginal distribution: The totals for each row or column, representing the distribution of one variable regardless of the other.
Conditional distribution: The distribution of one variable for a fixed value of the other variable.

Association Among Categorical Data

Conditional distributions can reveal associations between categorical variables.
For example, employment status is associated with level of education; higher education levels correspond to higher employment rates.

Simpson’s Paradox

Simpson’s Paradox: An association between two variables reverses or disappears when a third variable is considered.
Example: In university admissions, apparent gender bias disappears when program of study is included as a lurking variable.

Additional info: This summary covers the main concepts, definitions, and examples from Chapter 4, including scatter diagrams, correlation, regression, diagnostics, and contingency tables, with academic context added for clarity and completeness.