research-methods-lab sampling · experiments · correlation · surveys · measurement

1. Population, sample & sampling error

A population of dots with a true proportion of "successes". Draw a random sample and the estimate $\hat p$ jiggles around the truth. As $n$ grows the sampling error $\hat p - p$ shrinks like $1/\sqrt{n}$. Biased sampling instead drifts to a systematically wrong value.

estimate $\hat p$
error $\hat p - p$
std. error

2. Sample size vs margin of error

For a proportion at confidence level $C$, the margin of error is $$E = z\sqrt{\dfrac{\hat p(1-\hat p)}{n}}.$$ The curve shows how $E$ falls as $n$ grows — quartering the error needs roughly $16\times$ the sample. A $95\%$ confidence interval $\hat p \pm E$ is drawn for the current $n$.

$z^\ast$
margin $E$
interval

3. Correlation & scatterplots

Drag the slider to set the target correlation $r$ and resample a cloud of points. The least-squares line and Pearson $r$ update live. A strong line is not proof of cause — see the next two demos for why.

measured $r$
$r^2$ (variance shared)
slope

4. Confounding & randomization

A confounder $Z$ affects both the treatment assignment and the outcome, faking a treatment effect. Turn on randomization to assign groups by coin flip — this breaks the $Z\rightarrow$ treatment arrow, so the naive difference collapses toward the true effect.

naive difference
vs true effect

5. Simpson's paradox

Two subgroups each show a positive trend, yet pooling them can reverse the direction. Slide the group separation to see the aggregate line flip against the within-group lines — a vivid third-variable warning.

group A slope
group B slope
pooled slope

6. Two-group experiment & effect size

A control and a treatment group with adjustable means and spread. The standardized effect size is Cohen's $d=\dfrac{\bar x_T-\bar x_C}{s}$, and a two-sample test gives the $t$ statistic and an approximate $p$-value.

Cohen's $d$
$t$
$p$-value

7. Type I & Type II error

Two sampling distributions: $H_0$ (no effect) and $H_1$ (true effect). The decision threshold at significance $\alpha$ splits the picture into $\alpha$ (false positive), $\beta$ (false negative) and power $1-\beta$.

$\alpha$ (type I)
$\beta$ (type II)
power $1-\beta$

8. p-hacking & multiple comparisons

When nothing is real, each test is a $p<\alpha$ false positive with probability $\alpha$. Run $k$ independent tests and the chance of at least one "significant" result is $1-(1-\alpha)^k$ — testing many things almost guarantees a fluke.

$P(\ge 1$ false pos$)$
this run: "significant"

9. Survey question bias (Likert)

The same question, asked neutrally or with a leading / loaded framing, shifts the response distribution. Watch the 5-point Likert bars and the mean response slide as you change the wording bias and acquiescence (yea-saying) tendency.

mean response
% agree (4–5)

10. Reliability vs validity

The classic dartboard analogy. Reliability is consistency (tight cluster); validity is hitting the true target (bullseye). A measure can be reliable but biased, valid but noisy, or — the goal — both.

spread (scatter)
bias (off-center)
verdict