Automatic differentiation is the foundation upon which deep learning frameworks lie. Deep learning models are typically trained using gradient based techniques, and autodiff makes it easy to get gradients, even from enormous, complex models. ‘Reverse-mode autodiff’ is the autodiff method used by most deep learning frameworks, due to its efficiency and accuracy.

Let’s:

  • Look at how reverse-mode autodiff works.
  • Create a minimal autodiff framework in Python.

The small autodiff framework will deal with scalars. We can (and will later) use NumPy to vectorise it.

Note on terminology: from now on ‘autodiff’ will refer to ‘reverse-mode autodiff’. ‘Gradient’ is used loosely, but in this context generally means ‘first order partial derivative’.

How does autodiff work?

Let’s start with an example.

a = 4
b = 3
c = a + b  # = 4 + 3 = 7
d = a * c  # = 4 * 7 = 28

Q1: What is the gradient of $d$ with respect to $a$, i.e. $\frac{\partial{d}}{\partial{a}}$? (Go ahead and try this!)

Solving the ‘traditional’ way:

There’s many ways to solve Q1, but let’s use the product rule, i.e. if $ y = x_1x_2 $ then $y’ = x_1’x_2 + x_1x_2’$.

Phew… and if you wanted to know $\frac{\partial{d}}{\partial{b}}$ you’d have to carry out the process again.

Solving the autodiff way

We’ll now look at the autodiff way to solve Q1. Here is a figure:

The system as a graph.

On the left we see the system represented as a graph. Each variable is a node; e.g. $d$ is the topmost node, and $a$ and $b$ are leaf nodes at the bottom.

On the right we see the system from autodiff’s point of view. Let’s call the values on the graph edges local derivatives. By using local derivatives and simple rules, we will be able to compute the derivatives that we want.

Here is the answer to Q1, calculated the autodiff way. Can you see how it relates to the figure?

We get this answer from the graph by finding the routes from $d$ to $a$ (not going against the dotted arrows), and then applying the following rules:

  • Multiply the edges of a route.
  • Add together the different routes.

The first route is straight from $d$ to $a$, which gives us the $\frac{\partial{\bar{d}}}{\partial{a}}$ term. The second route is from $d$ to $c$ to $a$, which gives us the term $\frac{\partial{\bar{d}}}{\partial{c}} * \frac{\partial{\bar{c}}}{\partial{a}}$.

Our autodiff implementation will go down the graph and compute the derivative of $d$ with respect to every sub-node, rather than just computing it for a particular node, as we have just done with $d$. Notice that we could compute the gradient of $d$ with respect to $c$, and $b$, without much more work.

Local derivatives

We saw ‘local derivatives’ on the graph edges above, written in the form: $ \frac{\partial \bar{y} }{\partial x}$.

The bar is to convey that we are doing a simpler kind of differentiation.

In general: to get a local derivative, treat the variables going into a node as not being functions of other variables.

For example, recall that $d = a * c$. Then compare $ \frac{\partial{d}}{\partial{a}} = 2a + b $, to $ \frac{\partial \bar{d} }{\partial a} = c $. The local derivative $\frac{\partial \bar{d} }{\partial a} = c$, is obtained by treating $c$ as a constant before differentiating the expression for $d$.

Local derivatives make our lives easier.

It is often easy to define the local derivatives of simple functions, and adding functions to the autodiff framework is easy if you know the local derivatives. E.g.

Addition: $n = a + b$. The local derivatives are: $\frac{\partial \bar{n} }{\partial a} = 1$ and $\frac{\partial \bar{n} }{\partial b} = 1$.

Multiplication: $n = a * b$. The local derivatives are: $\frac{\partial \bar{n} }{\partial a} = b$ and $\frac{\partial \bar{n} }{\partial b} = a$.

Let’s create the framework

Description of the implementation.

A node contains two pieces of data:

  • value - the value of the node.
  • grad - the node’s children & corresponding ‘local derivatives’.

The function get_gradients uses the nodes’ grad data to go through the graph recursively, computing the gradients. It uses the rules we saw above:

  • Multiply the edges of a route (into a node).
  • Add together the different routes (that lead to a node).

The tuples in stack (in get_gradients) are similar to the tuples in grad, but they contain the current route value, instead of the local derivative value.

Implementing just enough to solve the example above:

from collections import defaultdict

class Var:
    """A leaf node (a node with no children)."""
    def __init__(self, value):
        self.value = value  # the scalar value of the node.

class Add:
    """The node that results from adding two nodes."""
    def __init__(self, a, b):
        self.value = a.value + b.value
        self.grad = [(a, 1), (b, 1)]  # child nodes & corresponding 'local derivatives'

class Mul:
    """The node that results from multiplying two nodes."""
    def __init__(self, a, b):
        self.value = a.value * b.value
        self.grad = [(a, b.value), (b, a.value)]

def get_gradients(parent_node):
    """Go down the graph, and compute derivative of `parent_node` with respect to each node."""
    gradients = defaultdict(lambda : 0)
    stack = parent_node.grad.copy()  # list of (node, route_value) tuples.
    while stack:
        node, route_value = stack.pop()
        gradients[node] += route_value  # "Add together the different routes."
        if not isinstance(node, Var): 
            # if the node has children, put them onto the stack.
            for child_node, child_route_value in node.grad:
                stack.append((child_node, child_route_value * route_value))  # "Multiply the edges of a route."
    return dict(gradients)

Solving the example above:

a = Var(4)
b = Var(3)
c = Add(a, b) # = 4 + 3 = 7
d = Mul(a, c) # = 4 * 7 = 28

gradients = get_gradients(d)

print('d.value =', d.value)
print("The partial derivative of d with respect to a =", gradients[a])
d.value = 28
The partial derivative of d with respect to a = 11

Success!

We also got gradients for the other nodes:

print('gradients[b] =', gradients[b])
print('gradients[c] =', gradients[c])
gradients[b] = 4
gradients[c] = 4

Let’s take a look at the grad values (the local derivatives):

print('dict(d.grad)[a] =', dict(d.grad)[a])
print('dict(d.grad)[c] =', dict(d.grad)[c])
print('dict(c.grad)[a] =', dict(c.grad)[a])
print('dict(c.grad)[b] =', dict(c.grad)[b])
dict(d.grad)[a] = 7
dict(d.grad)[c] = 4
dict(c.grad)[a] = 1
dict(c.grad)[b] = 1

We saw these in our example above as:

$ \frac{\partial \bar{d} }{\partial a} = c = 7$

$ \frac{\partial \bar{d} }{\partial c} = a = 4$

$ \frac{\partial \bar{c} }{\partial a} = 1$

$ \frac{\partial \bar{c} }{\partial b} = 1$


A few small improvements

Let’s:

  • Enable the use of operators, such as +, *, - …
  • Add a few more functions
class Ops:
    """Enables use of +, *, -, etc."""
    def __add__(self, other):
        return Add(self, other)
    
    def __mul__(self, other):
        return Mul(self, other)
    
    def __sub__(self, other):
        return Add(self, Neg(other))
    
    def __truediv__(self, other):
        return Mul(self, Inv(other))

class Var(Ops):
    def __init__(self, value):
        self.value = value
        
class Add(Ops):
    def __init__(self, a, b):
        self.value = a.value + b.value
        self.grad = [(a, 1), (b, 1)]

class Mul(Ops):
    def __init__(self, a, b):
        self.value = a.value * b.value
        self.grad = [(a, b.value), (b, a.value)]
    
class Neg(Ops):
    def __init__(self, var):
        self.value = -1 * var.value
        self.grad = [(var, -1)]

class Inv(Ops):
    def __init__(self, var):
        self.value = 1 / var.value
        self.grad = [(var, -var.value ** -2 )]        

Some more examples

We can get the gradients of arbitrary functions made from the functions we’ve added to the framework. E.g.

a = Var(230.3)
b = Var(33.2)

def f(a, b):
    return (a / b - a) * (b / a + a + b) * (a - b)

y = f(a, b)

gradients = get_gradients(y)

print("The partial derivative of y with respect to a =", gradients[a])
print("The partial derivative of y with respect to b =", gradients[b])
The partial derivative of y with respect to a = -153284.83150602411
The partial derivative of y with respect to b = 3815.0389441500993

We can use numerical estimates to check that we’re getting correct results:

delta = Var(1e-8)
numerical_grad_a = (f(a + delta, b) - f(a, b)) / delta
numerical_grad_b = (f(a, b + delta) - f(a, b)) / delta
print("The numerical estimate for a =", numerical_grad_a.value)
print("The numerical estimate for b =", numerical_grad_b.value)
The numerical estimate for a = -153284.89243984222
The numerical estimate for b = 3815.069794654846

It’s easy to add more functions

You just need to be able to define the local derivatives.

import numpy as np

class Sin(Ops):
    def __init__(self, var):
        self.value = np.sin(var.value)
        self.grad = [(var, np.cos(var.value))]

class Exp(Ops):
    def __init__(self, var):
        self.value = np.exp(var.value)
        self.grad = [(var, self.value)]
        
class Log(Ops):
    def __init__(self, var):
        self.value = np.log(var.value)
        self.grad = [(var, 1. / var.value)]

Let’s check that these work:

a = Var(43.)
b = Var(3.)
c = Var(2.)

def f(a, b, c):
    f = Sin(a * b) + Exp(c - (a / b))
    return Log(f * f) * c

y = f(a, b, c)

gradients = get_gradients(y)

print("The partial derivative of y with respect to a =", gradients[a])
print("The partial derivative of y with respect to b =", gradients[b])
print("The partial derivative of y with respect to c =", gradients[c])
The partial derivative of y with respect to a = 60.85353612046653
The partial derivative of y with respect to b = 872.2331479536114
The partial derivative of y with respect to c = -3.2853671032530305
delta = Var(1e-8)
numerical_grad_a = (f(a + delta, b, c) - f(a, b, c)) / delta
numerical_grad_b = (f(a, b + delta, c) - f(a, b, c)) / delta
numerical_grad_c = (f(a, b, c + delta) - f(a, b, c)) / delta

print("The numerical estimate for a =", numerical_grad_a.value)
print("The numerical estimate for b =", numerical_grad_b.value)
print("The numerical estimate for c =", numerical_grad_c.value)
The numerical estimate for a = 60.85352186602222
The numerical estimate for b = 872.232160009645
The numerical estimate for c = -3.285367089489455

That’s the end of our minimal autodiff implementation!

Of course there’s various features missing, such as:

  • Vectorisation.
  • Placeholder variables.
  • Nth derivatives.
  • Optimisations.
  • All the other amazing stuff deep learning / autodiff frameworks can do.

The most fruitful addition to our minimal framework would be vectorisation (but not carried out in the manner that follows).


A naive vectorisation

We will look at an incredibly computationally inefficient way to vectorise our autodiff framework. (It is not recommended to use this for anything, because of how slow it is.)

The approach is:

  • Put our Var objects from above into NumPy arrays
  • We can then use NumPy operations
import numpy as np

to_var = np.vectorize(lambda x : Var(x))  # convert NumPy array into array of Var objects
to_vals = np.vectorize(lambda var : var.value)  # get values from array of Var objects

A single linear layer artificial neural network (fitting noise to noise):

import matplotlib.pyplot as plt

def update_weights(weights, gradients, lrate):
    for i in range(weights.shape[0]):
        for j in range(weights.shape[1]):
            weights[i, j].value -= lrate * gradients[weights[i, j]]

np.random.seed(0)

input_size = 50
output_size = 10
lrate = 0.001

x = to_var(np.random.random(input_size))
y_true = to_var(np.random.random(output_size))
weights = to_var(np.random.random((input_size, output_size)))

loss_vals = []
for i in range(100):

    y_pred = np.dot(x, weights)

    loss = np.sum((y_true - y_pred) * (y_true - y_pred))
    loss_vals.append(loss.value)
    
    gradients = get_gradients(loss)

    update_weights(weights, gradients, lrate)

    
plt.plot(loss_vals)
plt.xlabel("Time step"); plt.ylabel("Loss"); plt.title("Single linear layer learning")
plt.show()

Plot of the loss of the linear layer.

Coming soon: a part two, where we look at how to vectorize our minimal framework more efficiency.

References: