Automatic Differentiation

Why Differentiation Matters

Training neural networks requires computing gradients of a loss function with respect to model parameters. Automatic differentiation (autodiff) does this efficiently and exactly.

Forward Mode vs Reverse Mode

Forward mode: Computes derivatives alongside the function evaluation. Efficient when there are few inputs.
Reverse mode: Computes derivatives by propagating backwards from the output. Efficient when there are few outputs (like a scalar loss).

Deep learning uses reverse mode (backpropagation).

Computational Graphs

Autodiff works by building a computational graph of operations:

import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x + 1
y.backward()
print(x.grad)  # dy/dx = 2x + 3 = 7.0

Chain Rule

The chain rule is the mathematical foundation:

\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial w}

Reverse mode autodiff applies the chain rule systematically through the computational graph.

Why Differentiation Matters

Forward Mode vs Reverse Mode

Computational Graphs

Chain Rule

Backlinks

Related Notes