Describing the Relationship Between Two Variables: Scatter Diagrams, Correlation, and Regression

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Describing the Relationship Between Two Variables

Scatter Diagrams

Scatter diagrams are graphical tools used to visualize the relationship between two quantitative variables. Each point on the diagram represents a pair of values for the two variables.

Explanatory (Predictor) Variable: Plotted on the horizontal (x) axis. This variable is used to explain variability in the response variable.
Response Variable: Plotted on the vertical (y) axis. This variable is the outcome of interest.
Interpretation: The pattern of points can suggest the type (linear or nonlinear) and strength of the relationship.
Example: A scatter diagram of students' study hours (x) versus exam scores (y) may show a positive association.

Correlation and the Linear Correlation Coefficient

The linear correlation coefficient, denoted by r, measures the strength and direction of the linear relationship between two quantitative variables.

Properties of r:
- r ranges from -1 to 1.
- r = 1: Perfect positive linear relationship.
- r = -1: Perfect negative linear relationship.
- r = 0: No linear relationship.
- The absolute value |r| indicates the strength of the association; values closer to 1 or -1 indicate stronger relationships.
- r is similar in interpretation to the slope (rise/run), but it is a standardized measure.
Formula for r:
- Where and are the data values, and are the means, and and are the standard deviations of x and y, respectively.
Important Note: Correlation does not imply causation. A strong correlation does not mean that changes in one variable cause changes in the other.

Response vs. Explanatory Variables

Understanding the roles of variables is crucial in statistical analysis:

Explanatory Variable (Predictor): The variable that is manipulated or categorized to observe its effect on the response variable.
Response Variable: The outcome or variable being measured.
Example: In a study of the effect of temperature (x) on ice cream sales (y), temperature is the explanatory variable and sales is the response variable.

Least-Squares Regression Line

The least-squares regression line is the line that best fits the data in a scatter diagram, minimizing the sum of the squared vertical distances (residuals) between the observed values and the line.

Equation of the Line:
- : Predicted value of the response variable
- : Slope of the line
- : y-intercept
Formulas:
- Slope:
- Intercept:
Residual: The difference between the observed value and the predicted value:
Interpretation: A smaller residual indicates a better fit for that data point.
Example: If the regression line is , and for , the observed is 295, then the predicted $y$ is . The residual is .

Coefficient of Determination (R2)

The coefficient of determination, denoted , measures the proportion of the variance in the response variable that is explained by the explanatory variable using the regression line.

Formula:
Interpretation: An value of 0.85 means that 85% of the variability in the response variable is explained by the regression model.

Correlation vs. Causation

It is important to distinguish between correlation and causation:

Correlation: Indicates a statistical association between two variables.
Causation: Implies that changes in one variable directly cause changes in another.
Note: A strong correlation does not imply a causal relationship. Other factors (confounding variables) may be involved.

Summary Table: Key Concepts in Bivariate Analysis

Concept	Definition	Formula
Scatter Diagram	Graphical display of paired data	—
Correlation Coefficient (r)	Measures strength and direction of linear relationship
Least-Squares Regression Line	Best-fitting line for predicting y from x
Slope (b1)	Change in y per unit change in x
Intercept (b0)	Predicted y when x = 0
Residual	Observed y minus predicted y
Coefficient of Determination (R2)	Proportion of variance explained by the model

Additional info:

Contingency tables are referenced as being covered in later chapters (Chapters 11 and 12), which deal with categorical data analysis.
Some formulas and steps were inferred and expanded for clarity and completeness.