From-Scratch Build · Machine Learning
statlearn is a small Python library that implements the core statistical-learning methods from scratch with numpy: linear and logistic regression, ridge and lasso, a CART decision tree, k-fold cross-validation and the bootstrap. No scikit-learn touches the algorithms — it's used only in the tests to check each implementation against a reference, and in the demo to load datasets.
What it is
A compact statlearn/ package where each estimator is written out by hand in numpy and shares one fit(X, y) / predict(X) interface. Linear models expose coef_ and intercept_ after fitting; the logistic model adds predict_proba. The point is that nothing is a black box — the normal equation, the gradient-descent loop, the soft-thresholding step in lasso and the k-fold split are all right there in the source.
Correctness isn't taken on faith. The tests/ suite re-derives the same fit with scikit-learn and asserts the from-scratch coefficients, accuracies and split sizes agree within tolerance — 20 tests, all passing.
The toolkit
Each one built to be understood, then cross-checked against scikit-learn.
OLS two ways — the normal equation in closed form, and batch gradient descent on the MSE loss. Both converge to the same fit.
Ridge is closed-form L2 on centred data; lasso runs coordinate descent with soft-thresholding, zeroing weak coefficients outright.
Binary classifier via gradient descent on cross-entropy, with a numerically stable sigmoid and predict_proba.
Greedy recursive binary splits — Gini impurity for classification, variance reduction for regression.
Index-based folds whose sizes match scikit-learn's KFold, plus a cross_val_score driver.
Resample with replacement; the held-out out-of-bag rows (~37%) give an honest generalisation estimate.
Usage
Every estimator is a plain object you fit then predict with — and the resampling helpers take any of them:
import numpy as np
from statlearn import LinearRegression, LogisticRegression, cross_val_score
X = np.random.randn(200, 3)
y = X @ np.array([2.0, -1.0, 0.5]) + 0.1 * np.random.randn(200)
model = LinearRegression(method="normal").fit(X, y)
print(model.coef_, model.intercept_)
print(model.score(X, y)) # R^2 on the data
scores = cross_val_score(LinearRegression(), X, y, n_splits=5, scoring="r2")
print(scores.mean())
Results
python demo.pyThe demo trains every method on two standard datasets, using scikit-learn only to load the data. These are the actual printed metrics from a run:
| Model | Metric |
|---|---|
| OLS — normal equation | R² = 0.3728 |
| OLS — gradient descent | R² = 0.3728 |
| Ridge (α = 10) | R² = 0.3696 |
| Lasso (α = 0.5, 9/10 features kept) | R² = 0.3733 |
| Decision tree (depth 3) | R² = 0.2043 |
| OLS — 5-fold CV | R² = 0.4875 ± 0.0751 |
| Model | Metric |
|---|---|
| Logistic regression | acc = 0.9580 |
| Decision tree (depth 4) | acc = 0.9231 |
| Logistic — 5-fold CV | acc = 0.9736 ± 0.0124 |
| Tree — OOB bootstrap | acc = 0.9207 ± 0.0177 |
Two details that double as sanity checks: the normal-equation and gradient-descent solvers reach an identical test R² (0.3728), and lasso drops one of the ten diabetes features while holding the same accuracy — regularisation doing feature selection.