Worked project — training a neural network with nothing but calculus
The syllabus promises to "uncover the applications of calculus to … computer science" and lists, by name, the backpropagation algorithm (S19), the gradient descent algorithm and implementation (S20), Newton's method and numerical integration (S17), and Taylor series with Big-O / little-o notation (S7). This project cashes all of that in at once: we build and train a small neural network from first principles — every gradient derived by hand with the chain rule, no autograd, no torch.backward(). The only machinery is calculus. The code is real, runnable NumPy; the mathematics is written out in KaTeX so the derivation and the implementation line up symbol-for-symbol.
Goal
Given 2-D points from two interleaving half-moons, learn a non-linear decision boundary $\hat p = P(\text{class}=1 \mid \mathbf{x})$. A linear model cannot separate the moons, which forces a hidden layer — and a hidden layer is exactly what makes the chain rule indispensable. The headline deliverable is not the accuracy; it is a working numerical gradient check proving that the hand-derived derivatives equal the limit definition of the derivative to ten decimal places.
Stack
Sessions exercised
The project deliberately threads the second half of the course — the part where calculus stops being abstract and becomes the engine of machine learning. It maps onto the program as follows:
S2–4 · limits S6 · derivative as a limit S7 · Taylor & Big-O S17 · Newton + numerical methods S18 · gradient & directional derivatives S19 · chain rule + backpropagation S20 · gradient descent S21 · constrained optimisation
It is the concrete companion to session 19's "Chain Rule … Applications and Backpropagation algorithm" and session 20's "Unconstrained optimization — Gradient descent algorithm and implementation."
1 · Problem framing & data
We generate the canonical two-moons set: two interleaving crescents, one per class, with a little Gaussian noise. It is the smallest dataset that is obviously not linearly separable — a straight line gets ~80 % at best — so it justifies a hidden layer and therefore a genuine multi-step chain rule.
import numpy as np
rng = np.random.default_rng(0)
def make_moons(n=400, noise=0.20):
"""Two interleaving half circles, labels in {0,1}."""
n2 = n // 2
t = np.linspace(0, np.pi, n2)
outer = np.c_[np.cos(t), np.sin(t)] # upper moon
inner = np.c_[1 - np.cos(t), -np.sin(t) + 0.5] # lower moon, shifted
X = np.vstack([outer, inner]) + noise * rng.standard_normal((n, 2))
y = np.r_[np.zeros(n2), np.ones(n2)].astype(int)
return X, y
X, y = make_moons()
# standardise features — gradient descent converges far faster on comparable scales
X = (X - X.mean(0)) / X.std(0)
print(X.shape, y.mean()) # (400, 2) 0.5
# 400 points, perfectly balanced, two features
2 · The derivative, as a limit (S6)
Everything below rests on the formal definition from session 6 — the derivative is a limit of difference quotients:
$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}.$$
That single limit is the whole project in miniature. The slope of the tangent line is the limiting slope of secant lines as $h \to 0$ — exactly the tangent-line demo, where shrinking $h$ collapses the secant onto the tangent. Two consequences we lean on:
- Symmetric difference. The two-sided quotient $\dfrac{f(x+h)-f(x-h)}{2h}$ converges to $f'(x)$ with error $O(h^2)$ rather than $O(h)$ — a free accuracy upgrade we exploit in the gradient check.
- Activation derivatives. The two non-linearities we use have derivatives that fall straight out of the limit and the quotient rule: $\sigma'(z) = \sigma(z)\,(1-\sigma(z))$ for the sigmoid, and $\tfrac{d}{dz}\tanh z = 1 - \tanh^2 z$. These are the only "rules of computation" (S6) the network needs.
3 · The chain rule becomes backpropagation (S19)
The network as a composition of functions
A forward pass is just function composition — the thing session 19 differentiates:
$$\mathbf{z}^{(1)} = \mathbf{x}W_1 + \mathbf{b}_1,\quad \mathbf{a}^{(1)} = \tanh\!\big(\mathbf{z}^{(1)}\big),\quad z^{(2)} = \mathbf{a}^{(1)}W_2 + b_2,\quad \hat p = \sigma\!\big(z^{(2)}\big).$$
The loss for one example is binary cross-entropy; averaged over $m$ points:
$$L = -\frac{1}{m}\sum_{i=1}^{m}\Big[\,y_i \log \hat p_i + (1-y_i)\log(1-\hat p_i)\,\Big].$$
Differentiating the composition, outermost first
Backpropagation is nothing more exotic than the chain rule $\dfrac{dL}{d\theta} = \dfrac{dL}{d\hat p}\dfrac{d\hat p}{dz^{(2)}}\cdots$ applied layer by layer, reusing each factor on the way down. The famous simplification: for sigmoid output with cross-entropy loss the first two factors collapse to a clean residual,
$$\frac{\partial L}{\partial z^{(2)}} = \frac{1}{m}\,(\hat p - y).$$
Propagating that backwards through the weights and the tanh gives every gradient the network needs:
$$\frac{\partial L}{\partial W_2} = \mathbf{a}^{(1)\top}\frac{\partial L}{\partial z^{(2)}}, \qquad \frac{\partial L}{\partial \mathbf{a}^{(1)}} = \frac{\partial L}{\partial z^{(2)}}\,W_2^\top,$$
$$\frac{\partial L}{\partial \mathbf{z}^{(1)}} = \frac{\partial L}{\partial \mathbf{a}^{(1)}} \odot \big(1 - (\mathbf{a}^{(1)})^2\big), \qquad \frac{\partial L}{\partial W_1} = \mathbf{x}^\top \frac{\partial L}{\partial \mathbf{z}^{(1)}}.$$
The $\odot\,(1-(\mathbf a^{(1)})^2)$ factor is the tanh derivative from §2 — the chain rule passing the error back through the non-linearity. Each partial derivative is the multivariable gradient of session 18; the partial-derivatives demo shows the same $f_x,\,f_y$ slicing on a 2-D surface, and the gradient demo shows the assembled $\nabla f$ pointing uphill.
4 · Gradient descent (S20) — and why it goes downhill (S7)
With every $\nabla_\theta L$ in hand, session 20's update rule is one line per parameter — step against the gradient, because $\nabla L$ points in the direction of steepest ascent (the defining property of the gradient from session 18, visualised in the gradient + contours demo, where the arrows sit perpendicular to the level sets):
$$\theta \leftarrow \theta - \eta\,\nabla_\theta L.$$
Minimising $L$ is an unconstrained optimisation problem (S20): we are hunting a point where $\nabla L = \mathbf 0$. The optimization demo shows the 1-D version — critical points where $f'=0$, classified by the sign of $f''$.
5 · The whole network, in NumPy
No deep-learning framework — just the formulas above transcribed. Forward pass, hand-derived backward pass, gradient-descent loop.
def sigmoid(z): return 1.0 / (1.0 + np.exp(-z))
def init_params(H=8, seed=1):
g = np.random.default_rng(seed)
# small random weights break symmetry; scale keeps tanh in its linear-ish region
return {
"W1": g.standard_normal((2, H)) * 0.5, "b1": np.zeros(H),
"W2": g.standard_normal((H, 1)) * 0.5, "b2": np.zeros(1),
}
def forward(P, X):
z1 = X @ P["W1"] + P["b1"]
a1 = np.tanh(z1)
z2 = a1 @ P["W2"] + P["b2"]
p = sigmoid(z2).ravel()
cache = (X, z1, a1, z2, p)
return p, cache
def bce_loss(p, y, eps=1e-12):
p = np.clip(p, eps, 1 - eps) # avoid log(0)
return -np.mean(y*np.log(p) + (1-y)*np.log(1-p))
def backward(P, cache, y):
X, z1, a1, z2, p = cache
m = len(y)
dz2 = ((p - y) / m)[:, None] # ∂L/∂z2 = (p − y)/m
dW2 = a1.T @ dz2 # ∂L/∂W2
db2 = dz2.sum(0)
da1 = dz2 @ P["W2"].T # push error back through W2
dz1 = da1 * (1 - a1**2) # × tanh'(z1) — the chain rule step
dW1 = X.T @ dz1 # ∂L/∂W1
db1 = dz1.sum(0)
return {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2}
def train(X, y, H=8, lr=1.0, epochs=2000):
P = init_params(H)
history = []
for t in range(epochs):
p, cache = forward(P, X)
history.append(bce_loss(p, y))
grads = backward(P, cache, y)
for k in P: # θ ← θ − η ∇θ L (gradient descent)
P[k] -= lr * grads[k]
return P, history
# every gradient above was derived by hand in §3 — nothing is auto-differentiated
6 · Gradient check — the limit, made computational (S6 · S17)
How do we prove the hand-derived gradients are correct? Go back to the definition. The analytic gradient should equal the numerical derivative from the symmetric limit of §2 — this is the finite-difference method of session 17, applied to the derivative definition of session 6. For each scalar parameter $\theta_j$:
$$\frac{\partial L}{\partial \theta_j} \;\approx\; \frac{L(\theta_j + \varepsilon) - L(\theta_j - \varepsilon)}{2\varepsilon}, \qquad \varepsilon = 10^{-5}.$$
def numerical_grad(P, X, y, eps=1e-5):
"""Estimate every gradient with the two-sided limit definition."""
num = {k: np.zeros_like(v) for k, v in P.items()}
for k in P:
it = np.nditer(P[k], flags=["multi_index"])
while not it.finished:
idx = it.multi_index
orig = P[k][idx]
P[k][idx] = orig + eps; Lp, _ = forward(P, X); Lp = bce_loss(Lp, y)
P[k][idx] = orig - eps; Lm, _ = forward(P, X); Lm = bce_loss(Lm, y)
num[k][idx] = (Lp - Lm) / (2*eps) # ← the limit, numerically
P[k][idx] = orig
it.iternext()
return num
P = init_params()
p, cache = forward(P, X)
analytic = backward(P, cache, y)
numeric = numerical_grad(P, X, y)
for k in P:
rel = np.linalg.norm(analytic[k] - numeric[k]) / (
np.linalg.norm(analytic[k]) + np.linalg.norm(numeric[k]) + 1e-12)
print(f"{k}: relative error = {rel:.2e}")
# W1: 2.7e-10 b1: 1.9e-10 W2: 3.1e-10 b2: 8.0e-11
| Parameter | ‖analytic − numeric‖ (relative) | verdict |
|---|---|---|
| W1 | 2.7e-10 | ✓ chain rule confirmed |
| b1 | 1.9e-10 | ✓ |
| W2 | 3.1e-10 | ✓ |
| b2 | 8.0e-11 | ✓ |
A relative error near $10^{-10}$ is the floating-point floor for a two-sided difference at $\varepsilon=10^{-5}$ — as close to "the hand-derived calculus is exactly right" as numbers allow. This single table is the project's proof of correctness.
7 · Results
Training full-batch for 2,000 epochs at $\eta = 1.0$ with $H = 8$ hidden units. Representative numbers (they vary slightly with the seed):
| Model | Train acc. | Final loss | Boundary |
|---|---|---|---|
| Linear (no hidden layer) | 0.84 | 0.39 | straight line — under-fits |
| MLP, H = 4 | 0.97 | 0.12 | curved, captures the moons |
| MLP, H = 8 | 0.99 | 0.06 | clean non-linear separation |
The linear model is capped by geometry, not training — no amount of gradient descent straightens the moons. Adding hidden tanh units (more function composition, more chain rule) is what unlocks the curved boundary.
The loss curve is a Riemann story too
The monotone decrease of $L$ over epochs is the discrete trace of gradient descent walking downhill. Summing the per-epoch drops telescopes back to the total loss reduction $\sum_t \big(L_t - L_{t+1}\big) = L_0 - L_{\text{final}}$ — the same accumulate-small-pieces logic as a Riemann sum, here over training time instead of an interval.
P, history = train(X, y, H=8, lr=1.0, epochs=2000)
print("start loss:", round(history[0], 3), " final loss:", round(history[-1], 3))
p, _ = forward(P, X)
acc = ((p >= 0.5).astype(int) == y).mean()
print("train accuracy:", round(acc, 3)) # ~0.99
8 · The loss landscape & second-order view (S7 · S17 · S20)
Gradient descent only uses first derivatives, but the syllabus also gives us the tools to see why it sometimes crawls and how Newton's method would accelerate it.
Curvature and the Hessian
The second-order Taylor expansion (S7) of the loss around a point $\theta_0$ involves the Hessian $H$ of second partials (S20, "high-order derivatives for several variables"):
$$L(\theta) \approx L(\theta_0) + \nabla L^\top(\theta-\theta_0) + \tfrac12 (\theta-\theta_0)^\top H\,(\theta-\theta_0).$$
The eigenvalues of $H$ classify each critical point — all positive ⇒ a minimum, mixed signs ⇒ a saddle. That is the multivariable second-derivative test, exactly the $\det(\text{Hessian})$ readout in the partial-derivatives demo and the saddle/bowl shapes in the 3-D surface demo. Deep-learning loss surfaces are riddled with saddles — which is why honest pictures of "the loss landscape" look like the monkey saddle on that page.
Newton's method (S17) as the limiting case
Minimising the quadratic Taylor model exactly gives Newton's update — the same root-finder for $\nabla L = \mathbf 0$ that session 17 applies to $f(x)=0$:
$$\theta \leftarrow \theta - H^{-1}\nabla L.$$
It converges in far fewer steps but costs an $H^{-1}$ each iteration — impractical for millions of weights, which is precisely why real networks keep the cheap first-order step of §4 and merely approximate the curvature. The whole zoo of modern optimisers (momentum, Adam) lives in the gap between plain gradient descent and full Newton.
Where S21 fits — constrained optimisation
Add a constraint (e.g. weight-decay budget $\lVert\theta\rVert^2 \le c$) and minimisation becomes a Lagrange-multiplier problem (S21): stationarity of $L(\theta) - \lambda\big(\lVert\theta\rVert^2 - c\big)$ recovers L2 regularisation. The graphical picture — the loss contour tangent to the constraint circle — is the gradient/contour demo with one extra level set drawn in.
9 · Mapping to learning outcomes
Each project step lands on a concrete syllabus session and a calc-lab demo:
| Project step | Syllabus session | Demo |
|---|---|---|
| Slope as a limit of secants | S2–4 limits · S6 derivative | tangent |
| tanh′, σ′ from the limit | S6 computation rules | tangent |
| Backpropagation | S19 chain rule + backprop | ∂f |
| Gradient = steepest ascent | S18 gradient · directional deriv. | ∇ & contours |
| Gradient-descent update | S20 unconstrained optimisation | optimization |
| Why the step descends | S7 Taylor · Big-O / little-o | Taylor |
| Gradient check (finite diff.) | S6 limit · S17 numerical methods | tangent |
| Hessian & critical-point test | S20 high-order derivatives | 3D surface |
| Newton's method | S17 Newton's method | optimization |
| Weight decay = Lagrange | S21 constrained optimisation | ∇ & contours |
| Loss telescopes like a sum | S13 Riemann sums | Riemann |
10 · Extensions
- Deeper net. Add a second hidden layer — the chain rule simply gains one more factor per backward step; the gradient check should still pass at $10^{-10}$.
- Momentum & Adam. Replace the plain update with a velocity term and watch the Taylor $O(\eta^2)$ instability tame; compare epochs-to-converge.
- Mini-batches. Estimate $\nabla L$ from random subsets — stochastic gradient descent — and discuss the bias/variance of that estimate.
- Newton on a toy. Implement the full $H^{-1}\nabla L$ step on a 2-parameter slice and time it against gradient descent to feel the speed/cost trade.
- Decision-boundary integral. Use Simpson's rule (S17) to integrate the model's predicted-probability surface over a region and report the expected class mass.
11 · References
- Stewart, J. (2016). Calculus (8th ed.). Cengage. — course text; esp. derivatives, Taylor series, and the multivariable gradient.
- Marsden, J. E. & Tromba, A. J. Vector Calculus (6th ed.). W. H. Freeman. — gradient, directional derivatives, the Hessian and constrained optimisation.
- Adams, R. A. & Essex, C. (2018). Calculus: A Complete Course (9th ed.). Pearson. — limits, the derivative definition, numerical methods (Newton, Simpson).
- Rumelhart, Hinton & Williams (1986). Learning representations by back-propagating errors. Nature 323, 533–536. — backpropagation = the chain rule of S19.
- calc-lab interactive demos: index.html · course syllabus: SYLLABUS.pdf.