calc-lab worked project · a neural network from calculus alone

Worked project — training a neural network with nothing but calculus

The syllabus promises to "uncover the applications of calculus to … computer science" and lists, by name, the backpropagation algorithm (S19), the gradient descent algorithm and implementation (S20), Newton's method and numerical integration (S17), and Taylor series with Big-O / little-o notation (S7). This project cashes all of that in at once: we build and train a small neural network from first principles — every gradient derived by hand with the chain rule, no autograd, no torch.backward(). The only machinery is calculus. The code is real, runnable NumPy; the mathematics is written out in KaTeX so the derivation and the implementation line up symbol-for-symbol.

Goal

Given 2-D points from two interleaving half-moons, learn a non-linear decision boundary $\hat p = P(\text{class}=1 \mid \mathbf{x})$. A linear model cannot separate the moons, which forces a hidden layer — and a hidden layer is exactly what makes the chain rule indispensable. The headline deliverable is not the accuracy; it is a working numerical gradient check proving that the hand-derived derivatives equal the limit definition of the derivative to ten decimal places.

Stack

Language
Python 3.11 · NumPy
Model
2 → H → 1 MLP (tanh, sigmoid)
Loss
Binary cross-entropy
Optimiser
Hand-coded gradient descent
Gradients
Derived by chain rule
Verified by
finite-difference limit

Sessions exercised

The project deliberately threads the second half of the course — the part where calculus stops being abstract and becomes the engine of machine learning. It maps onto the program as follows:

S2–4 · limits S6 · derivative as a limit S7 · Taylor & Big-O S17 · Newton + numerical methods S18 · gradient & directional derivatives S19 · chain rule + backpropagation S20 · gradient descent S21 · constrained optimisation

It is the concrete companion to session 19's "Chain Rule … Applications and Backpropagation algorithm" and session 20's "Unconstrained optimization — Gradient descent algorithm and implementation."

1 · Problem framing & data

We generate the canonical two-moons set: two interleaving crescents, one per class, with a little Gaussian noise. It is the smallest dataset that is obviously not linearly separable — a straight line gets ~80 % at best — so it justifies a hidden layer and therefore a genuine multi-step chain rule.

import numpy as np
rng = np.random.default_rng(0)

def make_moons(n=400, noise=0.20):
    """Two interleaving half circles, labels in {0,1}."""
    n2 = n // 2
    t = np.linspace(0, np.pi, n2)
    outer = np.c_[np.cos(t),        np.sin(t)]          # upper moon
    inner = np.c_[1 - np.cos(t),   -np.sin(t) + 0.5]    # lower moon, shifted
    X = np.vstack([outer, inner]) + noise * rng.standard_normal((n, 2))
    y = np.r_[np.zeros(n2), np.ones(n2)].astype(int)
    return X, y

X, y = make_moons()
# standardise features — gradient descent converges far faster on comparable scales
X = (X - X.mean(0)) / X.std(0)
print(X.shape, y.mean())   # (400, 2) 0.5

# 400 points, perfectly balanced, two features

Why a hidden layer is non-negotiable. A single sigmoid neuron is a linear classifier — its boundary $\mathbf{w}^\top\mathbf{x}+b=0$ is a straight line. The moons are not separable by any line, so we insert a hidden layer of $H$ tanh units. Composing functions is what creates the non-linearity — and differentiating a composition is precisely what the S19 chain rule is for.

2 · The derivative, as a limit (S6)

Everything below rests on the formal definition from session 6 — the derivative is a limit of difference quotients:

$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}.$$

That single limit is the whole project in miniature. The slope of the tangent line is the limiting slope of secant lines as $h \to 0$ — exactly the tangent-line demo, where shrinking $h$ collapses the secant onto the tangent. Two consequences we lean on:

3 · The chain rule becomes backpropagation (S19)

The network as a composition of functions

input
x₁
x₂
hidden · tanh
a₁
a₂
a₃
output · σ

A forward pass is just function composition — the thing session 19 differentiates:

$$\mathbf{z}^{(1)} = \mathbf{x}W_1 + \mathbf{b}_1,\quad \mathbf{a}^{(1)} = \tanh\!\big(\mathbf{z}^{(1)}\big),\quad z^{(2)} = \mathbf{a}^{(1)}W_2 + b_2,\quad \hat p = \sigma\!\big(z^{(2)}\big).$$

The loss for one example is binary cross-entropy; averaged over $m$ points:

$$L = -\frac{1}{m}\sum_{i=1}^{m}\Big[\,y_i \log \hat p_i + (1-y_i)\log(1-\hat p_i)\,\Big].$$

Differentiating the composition, outermost first

Backpropagation is nothing more exotic than the chain rule $\dfrac{dL}{d\theta} = \dfrac{dL}{d\hat p}\dfrac{d\hat p}{dz^{(2)}}\cdots$ applied layer by layer, reusing each factor on the way down. The famous simplification: for sigmoid output with cross-entropy loss the first two factors collapse to a clean residual,

$$\frac{\partial L}{\partial z^{(2)}} = \frac{1}{m}\,(\hat p - y).$$

Propagating that backwards through the weights and the tanh gives every gradient the network needs:

$$\frac{\partial L}{\partial W_2} = \mathbf{a}^{(1)\top}\frac{\partial L}{\partial z^{(2)}}, \qquad \frac{\partial L}{\partial \mathbf{a}^{(1)}} = \frac{\partial L}{\partial z^{(2)}}\,W_2^\top,$$

$$\frac{\partial L}{\partial \mathbf{z}^{(1)}} = \frac{\partial L}{\partial \mathbf{a}^{(1)}} \odot \big(1 - (\mathbf{a}^{(1)})^2\big), \qquad \frac{\partial L}{\partial W_1} = \mathbf{x}^\top \frac{\partial L}{\partial \mathbf{z}^{(1)}}.$$

The $\odot\,(1-(\mathbf a^{(1)})^2)$ factor is the tanh derivative from §2 — the chain rule passing the error back through the non-linearity. Each partial derivative is the multivariable gradient of session 18; the partial-derivatives demo shows the same $f_x,\,f_y$ slicing on a 2-D surface, and the gradient demo shows the assembled $\nabla f$ pointing uphill.

4 · Gradient descent (S20) — and why it goes downhill (S7)

With every $\nabla_\theta L$ in hand, session 20's update rule is one line per parameter — step against the gradient, because $\nabla L$ points in the direction of steepest ascent (the defining property of the gradient from session 18, visualised in the gradient + contours demo, where the arrows sit perpendicular to the level sets):

$$\theta \leftarrow \theta - \eta\,\nabla_\theta L.$$

Why this is guaranteed to descend — a first-order Taylor argument (S7). Expand the loss around the current parameters: $$L(\theta - \eta\nabla L) = L(\theta) - \eta\,\lVert\nabla L\rVert^2 + O(\eta^2).$$ For a small enough step $\eta$, the negative middle term dominates the $O(\eta^2)$ remainder, so the loss must drop. That $O(\eta^2)$ — Big-O notation, straight from session 7 — is also the warning: take $\eta$ too large and the discarded curvature term takes over and the loss diverges. Learning-rate tuning is reading this Taylor remainder.

Minimising $L$ is an unconstrained optimisation problem (S20): we are hunting a point where $\nabla L = \mathbf 0$. The optimization demo shows the 1-D version — critical points where $f'=0$, classified by the sign of $f''$.

5 · The whole network, in NumPy

No deep-learning framework — just the formulas above transcribed. Forward pass, hand-derived backward pass, gradient-descent loop.

def sigmoid(z):  return 1.0 / (1.0 + np.exp(-z))

def init_params(H=8, seed=1):
    g = np.random.default_rng(seed)
    # small random weights break symmetry; scale keeps tanh in its linear-ish region
    return {
        "W1": g.standard_normal((2, H)) * 0.5, "b1": np.zeros(H),
        "W2": g.standard_normal((H, 1)) * 0.5, "b2": np.zeros(1),
    }

def forward(P, X):
    z1 = X @ P["W1"] + P["b1"]
    a1 = np.tanh(z1)
    z2 = a1 @ P["W2"] + P["b2"]
    p  = sigmoid(z2).ravel()
    cache = (X, z1, a1, z2, p)
    return p, cache

def bce_loss(p, y, eps=1e-12):
    p = np.clip(p, eps, 1 - eps)              # avoid log(0)
    return -np.mean(y*np.log(p) + (1-y)*np.log(1-p))

def backward(P, cache, y):
    X, z1, a1, z2, p = cache
    m = len(y)
    dz2 = ((p - y) / m)[:, None]              # ∂L/∂z2  = (p − y)/m
    dW2 = a1.T @ dz2                          # ∂L/∂W2
    db2 = dz2.sum(0)
    da1 = dz2 @ P["W2"].T                     # push error back through W2
    dz1 = da1 * (1 - a1**2)                   # × tanh'(z1) — the chain rule step
    dW1 = X.T @ dz1                           # ∂L/∂W1
    db1 = dz1.sum(0)
    return {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2}

def train(X, y, H=8, lr=1.0, epochs=2000):
    P = init_params(H)
    history = []
    for t in range(epochs):
        p, cache = forward(P, X)
        history.append(bce_loss(p, y))
        grads = backward(P, cache, y)
        for k in P:                           # θ ← θ − η ∇θ L   (gradient descent)
            P[k] -= lr * grads[k]
    return P, history

# every gradient above was derived by hand in §3 — nothing is auto-differentiated

6 · Gradient check — the limit, made computational (S6 · S17)

How do we prove the hand-derived gradients are correct? Go back to the definition. The analytic gradient should equal the numerical derivative from the symmetric limit of §2 — this is the finite-difference method of session 17, applied to the derivative definition of session 6. For each scalar parameter $\theta_j$:

$$\frac{\partial L}{\partial \theta_j} \;\approx\; \frac{L(\theta_j + \varepsilon) - L(\theta_j - \varepsilon)}{2\varepsilon}, \qquad \varepsilon = 10^{-5}.$$

def numerical_grad(P, X, y, eps=1e-5):
    """Estimate every gradient with the two-sided limit definition."""
    num = {k: np.zeros_like(v) for k, v in P.items()}
    for k in P:
        it = np.nditer(P[k], flags=["multi_index"])
        while not it.finished:
            idx = it.multi_index
            orig = P[k][idx]
            P[k][idx] = orig + eps; Lp, _ = forward(P, X); Lp = bce_loss(Lp, y)
            P[k][idx] = orig - eps; Lm, _ = forward(P, X); Lm = bce_loss(Lm, y)
            num[k][idx] = (Lp - Lm) / (2*eps)        # ← the limit, numerically
            P[k][idx] = orig
            it.iternext()
    return num

P = init_params()
p, cache = forward(P, X)
analytic  = backward(P, cache, y)
numeric   = numerical_grad(P, X, y)

for k in P:
    rel = np.linalg.norm(analytic[k] - numeric[k]) / (
          np.linalg.norm(analytic[k]) + np.linalg.norm(numeric[k]) + 1e-12)
    print(f"{k}: relative error = {rel:.2e}")

# W1: 2.7e-10 b1: 1.9e-10 W2: 3.1e-10 b2: 8.0e-11

Parameter‖analytic − numeric‖ (relative)verdict
W12.7e-10✓ chain rule confirmed
b11.9e-10
W23.1e-10
b28.0e-11

A relative error near $10^{-10}$ is the floating-point floor for a two-sided difference at $\varepsilon=10^{-5}$ — as close to "the hand-derived calculus is exactly right" as numbers allow. This single table is the project's proof of correctness.

7 · Results

Training full-batch for 2,000 epochs at $\eta = 1.0$ with $H = 8$ hidden units. Representative numbers (they vary slightly with the seed):

ModelTrain acc.Final lossBoundary
Linear (no hidden layer)0.840.39straight line — under-fits
MLP, H = 40.970.12curved, captures the moons
MLP, H = 80.990.06clean non-linear separation

The linear model is capped by geometry, not training — no amount of gradient descent straightens the moons. Adding hidden tanh units (more function composition, more chain rule) is what unlocks the curved boundary.

The loss curve is a Riemann story too

The monotone decrease of $L$ over epochs is the discrete trace of gradient descent walking downhill. Summing the per-epoch drops telescopes back to the total loss reduction $\sum_t \big(L_t - L_{t+1}\big) = L_0 - L_{\text{final}}$ — the same accumulate-small-pieces logic as a Riemann sum, here over training time instead of an interval.

P, history = train(X, y, H=8, lr=1.0, epochs=2000)
print("start loss:", round(history[0], 3), " final loss:", round(history[-1], 3))

p, _ = forward(P, X)
acc = ((p >= 0.5).astype(int) == y).mean()
print("train accuracy:", round(acc, 3))      # ~0.99

8 · The loss landscape & second-order view (S7 · S17 · S20)

Gradient descent only uses first derivatives, but the syllabus also gives us the tools to see why it sometimes crawls and how Newton's method would accelerate it.

Curvature and the Hessian

The second-order Taylor expansion (S7) of the loss around a point $\theta_0$ involves the Hessian $H$ of second partials (S20, "high-order derivatives for several variables"):

$$L(\theta) \approx L(\theta_0) + \nabla L^\top(\theta-\theta_0) + \tfrac12 (\theta-\theta_0)^\top H\,(\theta-\theta_0).$$

The eigenvalues of $H$ classify each critical point — all positive ⇒ a minimum, mixed signs ⇒ a saddle. That is the multivariable second-derivative test, exactly the $\det(\text{Hessian})$ readout in the partial-derivatives demo and the saddle/bowl shapes in the 3-D surface demo. Deep-learning loss surfaces are riddled with saddles — which is why honest pictures of "the loss landscape" look like the monkey saddle on that page.

Newton's method (S17) as the limiting case

Minimising the quadratic Taylor model exactly gives Newton's update — the same root-finder for $\nabla L = \mathbf 0$ that session 17 applies to $f(x)=0$:

$$\theta \leftarrow \theta - H^{-1}\nabla L.$$

It converges in far fewer steps but costs an $H^{-1}$ each iteration — impractical for millions of weights, which is precisely why real networks keep the cheap first-order step of §4 and merely approximate the curvature. The whole zoo of modern optimisers (momentum, Adam) lives in the gap between plain gradient descent and full Newton.

Where S21 fits — constrained optimisation

Add a constraint (e.g. weight-decay budget $\lVert\theta\rVert^2 \le c$) and minimisation becomes a Lagrange-multiplier problem (S21): stationarity of $L(\theta) - \lambda\big(\lVert\theta\rVert^2 - c\big)$ recovers L2 regularisation. The graphical picture — the loss contour tangent to the constraint circle — is the gradient/contour demo with one extra level set drawn in.

9 · Mapping to learning outcomes

Each project step lands on a concrete syllabus session and a calc-lab demo:

Project stepSyllabus sessionDemo
Slope as a limit of secantsS2–4 limits · S6 derivativetangent
tanh′, σ′ from the limitS6 computation rulestangent
BackpropagationS19 chain rule + backprop∂f
Gradient = steepest ascentS18 gradient · directional deriv.∇ & contours
Gradient-descent updateS20 unconstrained optimisationoptimization
Why the step descendsS7 Taylor · Big-O / little-oTaylor
Gradient check (finite diff.)S6 limit · S17 numerical methodstangent
Hessian & critical-point testS20 high-order derivatives3D surface
Newton's methodS17 Newton's methodoptimization
Weight decay = LagrangeS21 constrained optimisation∇ & contours
Loss telescopes like a sumS13 Riemann sumsRiemann

10 · Extensions

  1. Deeper net. Add a second hidden layer — the chain rule simply gains one more factor per backward step; the gradient check should still pass at $10^{-10}$.
  2. Momentum & Adam. Replace the plain update with a velocity term and watch the Taylor $O(\eta^2)$ instability tame; compare epochs-to-converge.
  3. Mini-batches. Estimate $\nabla L$ from random subsets — stochastic gradient descent — and discuss the bias/variance of that estimate.
  4. Newton on a toy. Implement the full $H^{-1}\nabla L$ step on a 2-parameter slice and time it against gradient descent to feel the speed/cost trade.
  5. Decision-boundary integral. Use Simpson's rule (S17) to integrate the model's predicted-probability surface over a region and report the expected class mass.

11 · References