BackFoundations of Descriptive Statistics: Data Types, Experimental Design, and Data Distributions
Study Guide - Smart Notes
Tailored notes based on your materials, expanded with key definitions, examples, and context.
Chapter 1: Data Organization and Types
Unstacked vs. Stacked Data Formats
Data can be organized in different formats depending on the variables and the purpose of analysis. Two common formats are stacked and unstacked data.
Unstacked Data: Each group or category is presented in a separate column. For example, ages of men and women are listed in two columns.
Stacked Data: All data are in a single column, with another column indicating the group or category for each entry.
Example: The table below shows unstacked data for ages of men and women:
Men's Ages | Women's Ages |
|---|---|
35 | 33 |
39 | 41 |
47 | 37 |
40 | 35 |
39 |
Reorganized (stacked) format:
Age | Gender |
|---|---|
35 | M |
39 | M |
47 | M |
40 | M |
33 | F |
41 | F |
37 | F |
35 | F |
39 | F |
Types of Variables
Variables in statistics are classified as either numerical or categorical:
Numerical (Quantitative): Variables that represent measurable quantities (e.g., age, weight).
Categorical (Qualitative): Variables that represent categories or groups (e.g., gender, ethnicity, favorite math class).
Examples:
Age: Numerical
Gender: Categorical
Weight: Numerical
Ethnicity: Categorical
Favorite math class: Categorical
Chapter 2: Experimental Design and Observational Studies
Controlled Experiments
A controlled experiment is a study where participants are randomly assigned to different groups, typically a treatment group and a control group. This design allows researchers to determine causation.
Random Selection: Participants are randomly chosen to reduce bias.
Treatment Group: Receives the intervention or treatment.
Control Group: Does not receive the treatment (may receive a placebo).
Example: A doctor randomly selects 100 people, gives 50 a medication and 50 a placebo, and measures blood pressure weekly.
Observational Studies
In an observational study, researchers observe subjects without assigning treatments. These studies can identify associations but cannot establish causation.
Treatment Variable: The variable that is believed to influence the outcome (e.g., participation in sports).
Response/Outcome Variable: The variable that is measured as the result (e.g., happiness).
Key Point: Observational studies cannot determine causation, only correlation.
Chapter 3: Describing and Summarizing Data
Components of Distribution
When examining distributions of numerical data, consider the following components:
Shape: The form of the distribution (e.g., symmetric, skewed, bimodal).
Center: The typical value (e.g., mean, median).
Spread: The variability (e.g., range, interquartile range, standard deviation).
Bimodal Distributions
A bimodal distribution has two distinct peaks or modes. This often occurs when data are combined from two different groups.
Example: Heights of all students in a high school (male vs. female).
Interpreting Histograms
Histograms visually display the distribution of numerical data. Key features include:
Symmetry: Whether the left and right sides are mirror images.
Skewness: Whether the data are stretched more to one side.
Modality: The number of peaks (unimodal, bimodal, etc.).
Example: A histogram with two peaks suggests a bimodal distribution.
Measures of Center and Spread
Mean (): The arithmetic average.
Median: The middle value when data are ordered.
Range: The difference between the largest and smallest values.
Interquartile Range (IQR): The range of the middle 50% of the data:
Standard Deviation (): A measure of how spread out the data are from the mean.
Z-Score
The z-score measures how many standard deviations a value is from the mean:
Formula:
Interpretation: A z-score near 0 means the value is close to the mean; a z-score above 2 or below -2 is considered unusual.
Boxplots
Boxplots (box-and-whisker plots) visually summarize the distribution of a dataset, showing the median, quartiles, and possible outliers.
Median: The line inside the box.
Interquartile Range (IQR): The width of the box.
Whiskers: Extend to the smallest and largest values within 1.5 IQR of the quartiles.
Example: In a boxplot of marathon times, the median and the range of the middle 50% can be estimated from the plot.
Skewness and the Mean/Median Relationship
Right-Skewed Distribution: The mean is greater than the median.
Left-Skewed Distribution: The mean is less than the median.
Symmetric Distribution: The mean and median are approximately equal.
Summary Table: Types of Variables
Variable | Type |
|---|---|
Age | Numerical |
Gender | Categorical |
Weight | Numerical |
Ethnicity | Categorical |
Favorite math class | Categorical |
Summary Table: Measures of Center and Spread
Measure | Definition | Formula |
|---|---|---|
Mean | Arithmetic average | |
Median | Middle value | -- |
Range | Max - Min | |
IQR | Interquartile range | |
Standard Deviation | Spread from mean |
Additional info:
When data are strongly skewed, the median is a better measure of center than the mean.
Bimodal distributions often indicate the presence of two distinct groups within the data.
Boxplots are useful for comparing distributions between groups (e.g., male vs. female marathon times).