From-Scratch Build · Machine Learning

Statistical Learning

statlearn is a small Python library that implements the core statistical-learning methods from scratch with numpy: linear and logistic regression, ridge and lasso, a CART decision tree, k-fold cross-validation and the bootstrap. No scikit-learn touches the algorithms — it's used only in the tests to check each implementation against a reference, and in the demo to load datasets.

PythonNumPy only20 tests passing cross-checked vs scikit-learnMIT

What it is

The foundations, implemented not imported

A compact statlearn/ package where each estimator is written out by hand in numpy and shares one fit(X, y) / predict(X) interface. Linear models expose coef_ and intercept_ after fitting; the logistic model adds predict_proba. The point is that nothing is a black box — the normal equation, the gradient-descent loop, the soft-thresholding step in lasso and the k-fold split are all right there in the source.

Correctness isn't taken on faith. The tests/ suite re-derives the same fit with scikit-learn and asserts the from-scratch coefficients, accuracies and split sizes agree within tolerance — 20 tests, all passing.

pip install -r requirements.txt python -m pytest -q python demo.py

The toolkit

The methods, from first principles

Each one built to be understood, then cross-checked against scikit-learn.

linear.py

Linear regression

OLS two ways — the normal equation in closed form, and batch gradient descent on the MSE loss. Both converge to the same fit.

linear.py

Ridge & lasso

Ridge is closed-form L2 on centred data; lasso runs coordinate descent with soft-thresholding, zeroing weak coefficients outright.

logistic.py

Logistic regression

Binary classifier via gradient descent on cross-entropy, with a numerically stable sigmoid and predict_proba.

tree.py

Decision tree (CART)

Greedy recursive binary splits — Gini impurity for classification, variance reduction for regression.

resampling.py

k-fold cross-validation

Index-based folds whose sizes match scikit-learn's KFold, plus a cross_val_score driver.

resampling.py

Bootstrap

Resample with replacement; the held-out out-of-bag rows (~37%) give an honest generalisation estimate.

Usage

One interface, all the way down

Every estimator is a plain object you fit then predict with — and the resampling helpers take any of them:

import numpy as np
from statlearn import LinearRegression, LogisticRegression, cross_val_score

X = np.random.randn(200, 3)
y = X @ np.array([2.0, -1.0, 0.5]) + 0.1 * np.random.randn(200)

model = LinearRegression(method="normal").fit(X, y)
print(model.coef_, model.intercept_)
print(model.score(X, y))                 # R^2 on the data

scores = cross_val_score(LinearRegression(), X, y, n_splits=5, scoring="r2")
print(scores.mean())

Results

Real numbers from python demo.py

The demo trains every method on two standard datasets, using scikit-learn only to load the data. These are the actual printed metrics from a run:

Regression — diabetes (442 samples, 10 features), 75/25 train/test split.
ModelMetric
OLS — normal equationR² = 0.3728
OLS — gradient descentR² = 0.3728
Ridge (α = 10)R² = 0.3696
Lasso (α = 0.5, 9/10 features kept)R² = 0.3733
Decision tree (depth 3)R² = 0.2043
OLS — 5-fold CVR² = 0.4875 ± 0.0751
Classification — breast cancer (569 samples, 30 features), standardised.
ModelMetric
Logistic regressionacc = 0.9580
Decision tree (depth 4)acc = 0.9231
Logistic — 5-fold CVacc = 0.9736 ± 0.0124
Tree — OOB bootstrapacc = 0.9207 ± 0.0177

Two details that double as sanity checks: the normal-equation and gradient-descent solvers reach an identical test R² (0.3728), and lasso drops one of the ten diabetes features while holding the same accuracy — regularisation doing feature selection.