Foundations of Descriptive Statistics: Data Types, Experimental Design, and Data Distributions

Study Guide - Smart Notes

Tailored notes based on your materials, expanded with key definitions, examples, and context.

Chapter 1: Data Organization and Types

Unstacked vs. Stacked Data Formats

Data can be organized in different formats depending on the variables and the purpose of analysis. Two common formats are stacked and unstacked data.

Unstacked Data: Each group or category is presented in a separate column. For example, ages of men and women are listed in two columns.
Stacked Data: All data are in a single column, with another column indicating the group or category for each entry.

Example: The table below shows unstacked data for ages of men and women:

Men's Ages	Women's Ages
35	33
39	41
47	37
40	35
	39

Reorganized (stacked) format:

Age	Gender
35	M
39	M
47	M
40	M
33	F
41	F
37	F
35	F
39	F

Types of Variables

Variables in statistics are classified as either numerical or categorical:

Numerical (Quantitative): Variables that represent measurable quantities (e.g., age, weight).
Categorical (Qualitative): Variables that represent categories or groups (e.g., gender, ethnicity, favorite math class).

Examples:

Age: Numerical
Gender: Categorical
Weight: Numerical
Ethnicity: Categorical
Favorite math class: Categorical

Chapter 2: Experimental Design and Observational Studies

Controlled Experiments

A controlled experiment is a study where participants are randomly assigned to different groups, typically a treatment group and a control group. This design allows researchers to determine causation.

Random Selection: Participants are randomly chosen to reduce bias.
Treatment Group: Receives the intervention or treatment.
Control Group: Does not receive the treatment (may receive a placebo).

Example: A doctor randomly selects 100 people, gives 50 a medication and 50 a placebo, and measures blood pressure weekly.

Observational Studies

In an observational study, researchers observe subjects without assigning treatments. These studies can identify associations but cannot establish causation.

Treatment Variable: The variable that is believed to influence the outcome (e.g., participation in sports).
Response/Outcome Variable: The variable that is measured as the result (e.g., happiness).

Key Point: Observational studies cannot determine causation, only correlation.

Chapter 3: Describing and Summarizing Data

Components of Distribution

When examining distributions of numerical data, consider the following components:

Shape: The form of the distribution (e.g., symmetric, skewed, bimodal).
Center: The typical value (e.g., mean, median).
Spread: The variability (e.g., range, interquartile range, standard deviation).

Bimodal Distributions

A bimodal distribution has two distinct peaks or modes. This often occurs when data are combined from two different groups.

Example: Heights of all students in a high school (male vs. female).

Interpreting Histograms

Histograms visually display the distribution of numerical data. Key features include:

Symmetry: Whether the left and right sides are mirror images.
Skewness: Whether the data are stretched more to one side.
Modality: The number of peaks (unimodal, bimodal, etc.).

Example: A histogram with two peaks suggests a bimodal distribution.

Measures of Center and Spread

Mean (): The arithmetic average.
Median: The middle value when data are ordered.
Range: The difference between the largest and smallest values.
Interquartile Range (IQR): The range of the middle 50% of the data:
Standard Deviation (): A measure of how spread out the data are from the mean.

Z-Score

The z-score measures how many standard deviations a value is from the mean:

Formula:
Interpretation: A z-score near 0 means the value is close to the mean; a z-score above 2 or below -2 is considered unusual.

Boxplots

Boxplots (box-and-whisker plots) visually summarize the distribution of a dataset, showing the median, quartiles, and possible outliers.

Median: The line inside the box.
Interquartile Range (IQR): The width of the box.
Whiskers: Extend to the smallest and largest values within 1.5 IQR of the quartiles.

Example: In a boxplot of marathon times, the median and the range of the middle 50% can be estimated from the plot.

Skewness and the Mean/Median Relationship

Right-Skewed Distribution: The mean is greater than the median.
Left-Skewed Distribution: The mean is less than the median.
Symmetric Distribution: The mean and median are approximately equal.

Summary Table: Types of Variables

Variable	Type
Age	Numerical
Gender	Categorical
Weight	Numerical
Ethnicity	Categorical
Favorite math class	Categorical

Summary Table: Measures of Center and Spread

Measure	Definition	Formula
Mean	Arithmetic average
Median	Middle value	--
Range	Max - Min
IQR	Interquartile range
Standard Deviation	Spread from mean

Additional info:

When data are strongly skewed, the median is a better measure of center than the mean.
Bimodal distributions often indicate the presence of two distinct groups within the data.
Boxplots are useful for comparing distributions between groups (e.g., male vs. female marathon times).