research-methods-lab · worked example project

Worked example: screen-time & focus — a mixed-methods study, start to finish

This page is a single, fully worked research project that threads together every method in the course: framing a testable question, operationalizing fuzzy constructs, drawing a representative sample, writing a clean questionnaire, designing a quasi-experiment, and analyzing a mock dataset with real formulas — then auditing it for bias, confounding and ethics. Treat it as a template for the group field study (Sessions 26–29) and a model of what "doing it properly" looks like.

The scenario. A study team at a university wants to know whether heavy smartphone use is associated with worse sustained attention in first-year students, and whether a simple "notifications-off" intervention measurably improves focus. They run a correlational survey phase and a small quasi-experiment, then combine the qualitative and quantitative findings.

Goal

Link daily screen-time to sustained attention & test an intervention

Type

Mixed-methods: survey + quasi-experiment

Target sample

n ≈ 180 first-year students

Difficulty

intermediate

Est. effort

~12–15 h (design + fieldwork + write-up)

Deliverable

APA-style report + conference-style talk

Sessions exercised

Each phase below maps to specific course sessions. The tags throughout the page link back to the relevant session and to the matching interactive demo.

S5 operational definitions S6 validity · type I/II S7 sampling S8 surveys · descriptives S9 correlation S10–11 experiments S20 quasi-experiment S22 measures & ethics S26 field study S27 APA write-up

1 · Research question, constructs & hypotheses

A good study starts from one sharp question and a handful of falsifiable predictions — not a vague topic. Compare the textbook progression from a "topic" to a testable hypothesis.

From topic to question

Topic: "phones and attention" — too broad to test.
Question Q1 (correlational): Among first-year students, is higher daily smartphone screen-time associated with lower sustained-attention performance?
Question Q2 (causal): Does silencing non-essential notifications for one week improve sustained attention relative to a no-change control?

Constructs → operational definitions

Construct	Operationalization
Screen-time	7-day mean of OS-reported daily minutes (Screen Time / Digital Wellbeing screenshot)
Sustained attention	d2-style cancellation task score: correct − errors over 5 min
Self-reported focus	5-item scale, 5-pt Likert, mean of items
Notification load	# of push-enabled apps (count)

Hypotheses

Correlational (Q1). Let $X$ = mean daily screen-time (min) and $Y$ = attention-task score. $$H_0:\ \rho_{XY}=0 \qquad H_1:\ \rho_{XY}<0$$ A one-sided alternative: more screen-time is predicted to go with lower attention.

Experimental (Q2). Let $\mu_T,\mu_C$ be mean post-intervention attention for treatment (notifications off) and control. $$H_0:\ \mu_T-\mu_C = 0 \qquad H_1:\ \mu_T-\mu_C > 0$$

Directionality trap. Even a strong negative $\rho_{XY}$ in Q1 cannot establish that phones cause inattention — reverse causation (inattentive people reach for phones) and third variables (sleep, stress) are live. That is precisely why Q2 adds an experiment.

Session 5 · operational definitions demo: correlation & scatterplots → demo: type I & II error →

2 · Design: sampling, instrument & experiment

2.1 Sampling plan

The target population is first-year undergraduates at one university (≈ 1,200 students). A census is infeasible, so we sample. We use stratified random sampling to guarantee the sample mirrors the population on a variable likely tied to the outcome (degree area).

Frame: the registrar's first-year enrolment list (closest available list to the true population).
Strata: degree area (STEM / business / humanities), since baseline study habits may differ.
Within strata: simple random sampling, allocation proportional to stratum size.
Target size: $n=180$ (see margin-of-error calc below); over-recruit to $n=210$ for ~15% attrition.

How big must the sample be? For a proportion estimate at 95% confidence with margin of error $E$, the worst-case ($\hat p=0.5$) size is $$n=\frac{z^2\,\hat p(1-\hat p)}{E^2}=\frac{1.96^2\cdot 0.25}{0.05^2}\approx 384.$$ With a finite population of $N=1200$ we apply the finite-population correction: $$n_{\text{adj}}=\frac{n}{1+(n-1)/N}=\frac{384}{1+383/1200}\approx 291.$$ For our continuous attention outcome we instead size for the correlation/experiment (power analysis below), landing on $n\approx 180$ — comfortably above the minimum to detect $r=-0.25$.

Why not just post a link? A volunteer/convenience web sample over-represents the highly-engaged and self-selects on the very trait we study (phone habits) — classic selection bias. Stratified random sampling from the enrolment frame is the defensible choice.

Session 7 · sampling from populations demo: population, sample & sampling error → demo: sample size vs margin of error →

2.2 The questionnaire

A short instrument with mixed item types. Note the deliberate use of neutral wording, a reverse-scored item to catch straight-lining, and a behavioral anchor (the OS screenshot) so we are not relying on self-report alone.

#	Item	Type / scale
Q1	Degree area	Categorical (3 options) — stratum check
Q2	Paste your 7-day average daily screen-time (minutes)	Numeric, open — behavioral anchor
Q3	Number of apps with notifications enabled	Numeric, open
Q4	"I can stay focused on one task for 30+ minutes."	5-pt Likert (1 strongly disagree – 5 strongly agree)
Q5	"I check my phone without consciously deciding to." (reverse-scored)	5-pt Likert (R)
Q6	"My phone rarely interrupts my studying."	5-pt Likert
Q7	Average nightly sleep (hours) — covariate	Numeric, open
Q8	"What does 'being focused' feel like for you?"	Open text — qualitative

Scale reliability. Items Q4–Q6 form the self-reported-focus scale. After reverse-scoring Q5, we report internal consistency with Cronbach's $\alpha$; we treat $\alpha\ge 0.70$ as acceptable. Reliability (consistency) is necessary but not sufficient for validity (measuring the right thing).

Session 8 · what makes a good survey Session 22 · measures & reliability demo: survey question bias → demo: reliability vs validity →

2.3 The (quasi-)experimental design

Q2 needs a manipulation. Ideally we randomly assign participants to condition, making it a true experiment; if assignment must respect existing tutorial groups, it becomes a quasi-experiment with intact groups (weaker causal claim). We use a pretest–posttest two-group design.

Group	Pretest	Manipulation (1 week)	Posttest
Treatment ($n=45$)	O₁ attention task	Silence non-essential notifications	O₂ attention task
Control ($n=45$)	O₁ attention task	No change to phone	O₂ attention task

Independent variable: notification condition (off vs. unchanged) — manipulated.
Dependent variable: change in attention score $\Delta = O_2-O_1$.
Control of confounds: randomization (true experiment) breaks the link between condition and lurking variables; single-blind scoring; identical task instructions; same time-of-day testing.
Why pretest–posttest: using $\Delta$ removes stable individual differences in baseline attention.

Sessions 10–11 · experiments Session 20 · quasi-experimental designs demo: confounding & randomization → demo: two-group experiment & effect size →

3 · Data collection plan & mock analysis

1Pilot the questionnaire on ~10 students; check item clarity, timing, and that the OS screenshot instruction works on both iOS and Android.

2Recruit & consent. Email the stratified random sample; obtain informed consent; assign anonymous IDs (no names stored with data).

3Survey wave. Collect Q1–Q8 + the pretest attention task (O₁) for all participants — this powers the correlational analysis.

4Assign & intervene. Randomize the experimental subset to treatment/control; run the 1-week manipulation.

5Posttest. Re-administer the attention task (O₂).

6Clean & analyze. Reverse-score Q5, screen for straight-lining and impossible values, then run the analyses below.

3.1 Descriptive statistics (correlational wave, n = 180)

Always describe before you infer. Means, standard deviations and the shape of each variable.

Variable	Mean	SD	Min	Max
Screen-time $X$ (min/day)	312	98	95	588
Attention score $Y$	142	27	71	203
Self-report focus (1–5)	3.1	0.8	1.2	4.8
Sleep (h)	6.8	1.1	4.0	9.0

Sample mean and (unbiased) standard deviation: $$\bar x=\frac{1}{n}\sum_{i=1}^{n}x_i,\qquad s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar x)^2}.$$

3.2 Correlation: screen-time vs. attention

Pearson's correlation coefficient: $$r=\frac{\sum (x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum (x_i-\bar x)^2}\,\sqrt{\sum (y_i-\bar y)^2}}.$$ Mock result: $r=-0.34$, so $r^2=0.12$ — screen-time accounts for ~12% of the variance in attention scores. We test it against $H_0:\rho=0$ with $$t=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}}=\frac{-0.34\sqrt{178}}{\sqrt{1-0.116}}\approx -4.83,$$ on $df=178$, giving $p<0.001$ (one-sided). We reject $H_0$: a moderate negative association.

# 3.2 — correlation + significance (Python / SciPy)
import numpy as np
from scipy import stats

r, p_two = stats.pearsonr(screen_time, attention)
p_one    = p_two / 2            # one-sided: predicted negative
print(f"r = {r:.2f},  r^2 = {r**2:.2f},  one-sided p = {p_one:.4f}")
# r = -0.34,  r^2 = 0.12,  one-sided p = 0.0000

Correlation ≠ causation. The $r=-0.34$ is consistent with phones harming attention, attention problems driving phone use, or a confounder (poor sleep) driving both. We added sleep as a covariate (Q7); a partial correlation controlling for sleep drops to $r_{XY\cdot Z}=-0.27$ — attenuated but still present.

3.3 Two-group test: did the intervention work?

We compare the attention change $\Delta=O_2-O_1$ between treatment and control with an independent-samples (Welch) $t$-test, and report a standardized effect size.

Group	n	Mean Δ	SD
Treatment (notifications off)	45	+11.2	14.0
Control (no change)	45	+2.4	13.1
Difference	—	8.8	—

Welch's two-sample $t$ and pooled-SD Cohen's $d$: $$t=\frac{\bar\Delta_T-\bar\Delta_C}{\sqrt{\dfrac{s_T^2}{n_T}+\dfrac{s_C^2}{n_C}}} =\frac{11.2-2.4}{\sqrt{\frac{14.0^2}{45}+\frac{13.1^2}{45}}}\approx 3.08,$$ $$d=\frac{\bar\Delta_T-\bar\Delta_C}{s_{\text{pooled}}}=\frac{8.8}{13.6}\approx 0.65.$$ With $df\approx 87.6$, $p\approx 0.0014$ (one-sided). We reject $H_0$: a moderate, statistically significant improvement (Cohen's $d\approx 0.65$).

# 3.3 — Welch t-test + Cohen's d
t, p_two = stats.ttest_ind(delta_T, delta_C, equal_var=False)
sp = np.sqrt((delta_T.var(ddof=1) + delta_C.var(ddof=1)) / 2)
d  = (delta_T.mean() - delta_C.mean()) / sp
print(f"t = {t:.2f},  one-sided p = {p_two/2:.4f},  d = {d:.2f}")
# t = 3.08,  one-sided p = 0.0014,  d = 0.65

3.4 Results summary

Analysis	Test	Statistic	p	Effect size	Decision
Q1 screen-time ↔ attention	Pearson $r$	r = −0.34	< .001	r² = .12	reject H₀
Q1 controlling for sleep	partial $r$	−0.27	< .001	r² = .07	reject H₀
Q2 intervention effect	Welch $t$	t = 3.08	.0014	d = 0.65	reject H₀

Plain-language conclusion. Heavier screen-time is moderately associated with lower sustained attention (and remains so after adjusting for sleep), and a one-week notifications-off intervention produced a moderate, significant gain in attention. The combined design lets us speak more confidently about cause than the survey alone could.

Session 8 · descriptive statistics Session 9 · correlation demo: correlation & scatterplots → demo: two-group experiment & effect size →

4 · Threats to validity, bias & ethics

A finding is only as trustworthy as the design that produced it. Here we audit our own study against the standard threats — the kind of critique the final exam asks you to perform.

Internal validity (causal claim)

Confounding: sleep, stress, course load. Mitigated by randomization (Q2) and a sleep covariate (Q1).
Maturation / testing: the attention task itself may improve with practice — the control group accounts for this (their +2.4 baseline drift).
Demand characteristics: treatment participants may try harder knowing they're "the phone group." Single-blind scoring + a plausible control task reduce this.

External validity (generalization)

One university, first-years only — results may not transfer to other ages or cultures.
One-week manipulation says little about durable, months-long effects.

Measurement & bias

Self-report bias: screen-time recall is unreliable — hence the OS screenshot anchor.
Acquiescence / social desirability: the reverse-scored Q5 detects straight-lining.
Selection bias: avoided by random sampling from the enrolment frame, not volunteers.
Multiple comparisons / p-hacking: we pre-registered exactly three tests; running dozens of subgroup analyses would inflate the family-wise false-positive rate to $1-(1-\alpha)^k$.

Statistical conclusion validity

Power: $n=180$ gives >80% power to detect $r=-0.25$; the experiment ($n=90$) detects $d\ge 0.6$.
We report effect sizes ($r^2$, $d$), not just $p$ — significance ≠ importance.

Ethics (IE / APA principles, Session 22). Informed consent and the right to withdraw; anonymous IDs with screen-time data stored separately from identifiers; minimal risk (no deception); a debrief explaining the hypotheses; and a data-retention plan. The notifications-off manipulation is low-risk and reversible. Approval from the research ethics board precedes any data collection.

Session 6 · validity · type I & II error Session 22 · research ethics demo: confounding & randomization → demo: Simpson's paradox → demo: p-hacking & multiple comparisons → demo: type I & II error →

5 · Mapping to learning outcomes

How each part of this project demonstrates a course learning objective.

LO1 · Think critically about research

Framed a testable question, separated correlation (Q1) from causation (Q2), and audited our own confounds.

LO2 · Evaluate quality (reliability, validity, triangulation)

Reported Cronbach's $\alpha$, distinguished reliability from validity, triangulated self-report with a behavioral task and a qualitative item.

LO3 · Communicate research clearly

Structured the work as an APA-style report with a results table, effect sizes, and a plain-language conclusion ready for a conference-style talk.

Methods breadth

Exercised sampling, survey design, correlation, true/quasi-experiment, and descriptive + inferential statistics in one coherent study.

6 · Extensions & variations

Add a factor: turn Q2 into a $2\times2$ factorial — notifications (off/on) × study environment (silent/social) — and look for an interaction (Sessions 12–14).
Go within-subjects: a crossover design where each participant does both conditions in random order increases power and controls for individual differences.
Deepen the qualitative arm: run focus groups (Sessions 24–25) on Q8 responses and code themes, then triangulate against the quantitative results.
Interrupted time series: track attention daily across a phone-free week for a single cohort (quasi-experimental, Session 20) to see the trajectory, not just pre/post.
Robustness: replace Pearson $r$ with Spearman $\rho$ if attention scores are skewed; bootstrap the confidence interval for $d$.

Sessions 12–14 · factorial designs Sessions 24–25 · focus groups

7 · References & further reading

Privitera, G. J. Research Methods for the Behavioral Sciences. SAGE. — Ch. 5 (Sampling), Ch. 8 (Correlational Designs), Ch. 10 (Between-Subjects), Ch. 11 (Quasi-Experimental), Ch. 13 (Descriptive Statistics).
American Psychological Association. Publication Manual of the APA (7th ed.). — report structure & statistics reporting.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). — effect-size conventions ($d$, $r$).
Field, A. Discovering Statistics. SAGE. — Pearson $r$, $t$-tests, partial correlation, Cronbach's $\alpha$.
Course syllabus — Learning to Observe, Experiment and Survey (PDF).

All numbers on this page are illustrative mock data created to demonstrate the analysis workflow — not results from a real study.