Introduction
In our high school mathematics, we have learnt Differential Calculus, which has introduced the terms ‘differentiation’ and ‘derivative’.
The concept of differentiating a function has played a significant role in today’s deep learning algorithms among many of its useful real life applications.
In this post, I will revisit the core idea of computing derivatives by looking at the geometrical interpretation of it and simultaneously try to put some light to the idea that is used behind the efficient learning process of a neural network, also known as backpropagation.
This post is highly motivated by a series of lecture videos by Andrej Karpathy. I have given a reference to it at the end.
Differentiation
The term differentiation refers to the process of applying a mathematical operator to a function f, with respect to its input x and the output of differentiation is referred to as derivative (sometimes gradient) and it’s denoted by \frac{df}{dx}.
Now, let’s assume that there is a function f(.) that can be differentiated w.r.t to its input x i.e. derivative of f exists at some point x (in f’s domain of definition).
There are functions, for which derivative does not exist at some point. For example, we cannot perform differentiation of y = |x| at point x=0.
Read about conditions for differentiability and other related concepts here.
The process of differentiation may return another function, say, g() or a constant depending on the structure of the function f.
The definition of derivative of a function f(.) w.r.t the input x is given below:
\frac{df}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{(x+h) - x}
It’s a ratio, where the denominator measures the change in the input x and the numerator measures the change in output f(x), hence indicating the rate of change in f w.r.t x.
The animation below, made with geogebra, will make the definition easier to understand by giving geometrical sense, especially the limiting nature of the ratio.
Once you open this animation in full screen mode, you get to see a slider that is controlling our h. On the left panel, there is a play button associated with h and it can be used to play the animation.
If you make h very close to zero, you will understand how the secant line passing through the points P and Q gradually becomes the tangent line of the curve at point A.
As you understand, the ratio \frac{f(x+h) - f(x)}{(x+h) - x} is nothing but the slope or tan(.) of the angle \angle{QPP'} from the triangle \triangle{QPP'}.
Theoretically, when h is very close to zero, the ratio gives you the value of the slope that the tangent line of the curve has, at the point (x,0).
Therefore, the overall definition of derivative of a function, geometrically refers to the slope (m) for the tangent line of the curve y = f(x) at the point x.
A tangent line is a straight line that can be represented by the general equation of straight line y = mx+c. Here m is the outcome of the tan(.) function of the angle that the tangent line makes with the x-axis. m is called slope or gradient and c is the intercept that the straight line makes with the y-axis.
The tangent line of a curve at some point x is important as it gives us an idea about the nature of the curve locally. Geometrically, if the tangent line makes an acute angle with the x-axis (which is the case here), then it denotes a positive change on the output and when it makes an obtuse angle, it denotes a negative change on the output. If the tangent line is horizontal to the x-axis, then there is no change on the output.
Similar information can also be obtained by evaluating the expression of derivative at some point x. If the limiting value of the ratio is positive, then obviously, there is a positive change on the output otherwise the change is negative and having a value as zero indicates there is no change on the output.
Note that, till now, we have discussed only on a scalar valued function with a scalar valued input. We will now try to get some ideas to calculate derivatives when either of the function and input or both of them are vector valued.
Derivative for different type of input mappings
Functions of type f: \mathbb{R}^n \to \mathbb{R} - vector in, scalar out
Such functions take vector (s) as input (s) and return a scalar. The mean-squared-error and cross-entropy loss functions are a couple of examples of such functions that are heavily used in neural network setup.
Suppose you have a continuous target variable (a.k.a dependent variable) Y corresponding to a feature (a.k.a independent variable) X for a regression problem.
Let us take N as the total number of such pairs (i.e. examples) (x_i, y_i)_{i=1}^N in our data. Now, at the end of the learning process of a neural network, we are about to get N predictions which is another set of values \hat{y}_{i=1}^N.
To measure the goodness of fit of the predictions w.r.t to actual target values (y_i ’s), typically a mse loss function is used. This is defined as below:
mse(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2
We expect the mse value to be close to zero (or some predefined threshold depending on the use case and problem setup) to ensure the correctness of the prediction.
Consider a binary classification problem with X as the single feature variable and Y as the class labels which can take a value as either zero or one. This is also known as a binary logistic regression problem.
As before, we have a total of N examples (x_i, y_i)_{i=1}^N in our data, where \forall i, y_i \in \left\{0,1\right\}.
Binary classification is the simplest form of classification problem, which will help us to understand the concept easily.
Since, Y can take only two possible values, it’s safe to consider that Y is having a bernoulli distribution.
Also assume,
\begin{align} Y &= 1 \Rightarrow \text{success} \nonumber \\ &= 0 \Rightarrow \text{failure} \nonumber \end{align}
Here we need one more quantity p as probability of having a success as part of the distribution of Y, it is an important attribute of the data that we have.
Now, following the nature of bernoulli distribution, the probability of Y taking an outcome, can be expressed in a generic way as given below:
P(Y = y) = p^y \times (1-p)^{1-y}; \text{where } Y \in \left\{{0,1}\right\}
Note that, putting y=1 in the above function will give you p. This function always gives a value in between zero and one. It is also known as the likelihood function given that the quantity p is unknown, but y is known.
Having defined the above expression, we are ready to look at the entire data. It is to be noted that each of the examples are independent of each other.
Now, the next step is to find out the probability of having the target values themselves as y_1, y_2, \ldots, y_N at some point of time that we are considering for the classification task.
To be specific, we would want to know that whichever process has generated the data, what would be the probability of generating this sample again. This combined probability is called likelihood.
Since these Y_i’s are independent of each other, this likelihood (equivalent to joint probability) of Y_1=y_1, \dots, Y_N=y_N is just the product of the individual likelihoods i.e.
\begin{align} P(Y_1=y_1, \dots, Y_N=y_N) &= \prod_{i=1}^N P(Y_i = y_i) \nonumber \\ &= \prod_{i=1}^N p_i^{y_i} \times (1-p_i)^{1-y_i} \nonumber \\ \end{align}
At this stage, a log function is typically applied to scale the above quantity for future calculations. The log function has same pattern as the function on which it is applied to i.e. mathematically speaking,
\text{if } \forall a,b \in \mathbb{R}, a > b \Rightarrow f(a) > f(b), \text{ then } log(f(a)) > log(f(b))
Therefore we have, \begin{align} log\left[P(Y_1=y_1, \dots, Y_N=y_N)\right] &= \sum_{i=1}^N y_i log(p_i) + (1-y_i) log(1-p_i) \nonumber \end{align}
The above quantity is a negative quantity as the log function returns negative values when the input is within zero and one and working with a negative quantity can be conceptually misleading. That is why it is usually multiplied by negative one to make it positive.
This positive quantity is often referred to as negative log likelihood (nll) and is considered as the loss function for this type of binary classification problems.
Hence, the negative log-likelihood loss function is finally expressed as,
nll = -\sum_{i=1}^N \left\{y_i log(p_i) + (1-y_i) log(1-p_i)\right\}
It should be noted that, the quantity p_i is usually estimated by the model (like a FFN for binary classification or logistic regression), whereas y_i being the true value in our data for the i^{th} sample.
The NLL loss function tries to measure the similarity between the distribution of the predicted values p and the distribution of actual values y (i.e. labels).
The cross entropy loss function is just a generalisation of this simple negative log likelihood giving the similar kind of information for a prediction when the data has, say, K labels instead of just two.
It is expressed as,
\text{cross entropy loss} = - \sum_{k=1}^K y_k log(\hat{y}_k)
Where y_k’s are true probabilities (or give the true distribution of labels), whereas \hat{y}_k’s are predicted probabilities (or give the predicted/estimated distribution of labels).
The vector of gradients of such a function f is defined as below:
\frac{df}{d\vec{x}} = \left(\frac{df}{dx_1}, \ldots, \frac{df}{dx_n}\right)'
Functions of type f: \mathbb{R}^n \to \mathbb{R}^m - vector in, vector out
Such functions take vector (s) as input (s) and return another vector with different dimensions as output.
A simple example can be a function f: \mathbb{R}^2 \to \mathbb{R}^3:
- input: \vec{x} = \left(x_1, x_2\right) and a matrix of parameters \mathbf{W}_{3 \times 2}
- output: \vec{y} = f(\vec{x}) = \left(w_{11}x_1+w_{12}x_2, w_{21}x_1+w_{22}x_2, w_{31}x_1+w_{32}x_2\right)
In this case, derivative of f(\vec{x}) will be a matrix of dimension m \times n, as given below:
\begin{align} \frac{d\vec{y}}{d\vec{x}} &= \left(\frac{d\vec{y}}{dx_1}, \ldots, \frac{d\vec{y}}{dx_n}\right) \nonumber \\ &= \begin{pmatrix} \frac{dy_1}{dx_1} & \frac{dy_1}{dx_2} & \ldots & \frac{dy_1}{dx_n} \\ \frac{dy_2}{dx_1} & \frac{dy_2}{dx_2} & \ldots & \frac{dy_2}{dx_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{dy_m}{dx_1} & \frac{dy_m}{dx_2} & \ldots & \frac{dy_m}{dx_n} \\ \end{pmatrix} \nonumber \end{align}
This matrix is often referred to as jacobian matrix of \vec{y} with respect to \vec{x}.
The chain rule of differentiation
Consider the situation, where you have two different functions f(.) and g(.) defined as follows y = f(x) and z = g(y) = g\left(f(x)\right).
While forming z, f(.) is the inner function and g(.) is the outer function. This is known as function composition as often denoted as g \circ f(x).
We want to measure the rate of change in z w.r.t x i.e. \frac{dz}{dx}.
This derivative is computed as follows:
\frac{dz}{dx} = \frac{dz}{dy} \times \frac{dy}{dx}
and it is known as the chain rule of differentiation.
Here, the actual derivative is computed in two steps,
- the local derivative of z w.r.t y as z explicitly depends on y
- the local derivative of y w.r.t x as y explicitly depends on x
and these are multiplied together to form the actual derivative.
Wikipedia has an intuitive explanation to this as given below:
Intuitively, the chain rule states that knowing the instantaneous rate of change of z relative to y and that of y relative to x allows one to calculate the instantaneous rate of change of z relative to x as the product of the two rates of change.
As put by George F. Simmons: “if a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking man, then the car travels 2 × 4 = 8 times as fast as the man.”
This chain rule is very important as it does all the heavy lifting for backpropagation.
Partial differentiation
Partial differentiation is mainly applicable for a function that depends on several input variables simultaneously.
For example, we might have a function like f(x,y,z) = ax+by+cz+d representing a plane in a three dimensional coordinate system.
If we would like to measure the rate of change only along the x-axis, we would perform partial differentiation w.r.t x and by ignoring other inputs.
Unlike the previous one (i.e. complete derivative), partial derivative is denoted by \frac{\partial f}{\partial x} and the definition is given below:
\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x+h,y,z) - f(x,y,z)}{(x+h) - x}
In the neural network setup, we usually deal with complex functions that depend on several variables. Hence, it will be meaningful to use partial derivatives while demystifying the underlying operations.
The computational graph
A computational graph is a directed acyclic graph that keeps track of the sequence of operations performed during the run time of an algorithm.
Let us take an example to understand it.
The above diagram is typically referred to as a computational graph, where both a and b are inputs to the graph and l is the output of the graph.
Each box in the graph is called a node where a mathematical operation is performed. So, if you imagine, a deep neural network is actually such a massive computational graph where, depending on the depth of the network, millions (or maybe billions) of mathematical operations are performed.
A partial view of a graph is shown below so that we can imagine how large and complex a computational graph can be for a real neural network:
Differentiation through a graph
Neural networks learn (i.e. update the parameters) using an iterative process where at the core it uses backpropagation algorithm to compute derivatives of a loss function w.r.t each node in the graph.
Let us revisit the previously shown simple computational graph to understand it in a better way,
Here, we can assume l as the loss and we want to measure the rate of change of loss w.r.t each of the nodes (including loss node itself) in the backward direction (why backward? … I will explain later).
This whole computation is performed in multiple steps as follows:
- Step 1: rate of change of l w.r.t l itself i.e. \frac{\partial l}{\partial l} = 1
- Step 2: rate of change of l w.r.t e i.e. \frac{\partial l}{\partial e} = \frac{\partial l}{\partial e} \times \frac{\partial l}{\partial l} = \frac{\partial (e+f)}{\partial e} \times \frac{\partial l}{\partial l} = 1
- Step 3: rate of change of l w.r.t f i.e. \frac{\partial l}{\partial f} = \frac{\partial l}{\partial f} \times \frac{\partial l}{\partial l} = \frac{\partial (e+f)}{\partial f} \times \frac{\partial l}{\partial l} = 1
- Step 4: rate of change of l w.r.t c i.e. \frac{\partial l}{\partial c} = \frac{\partial e}{dc} \times \frac{\partial l}{\partial e} = \frac{\partial (c*d)}{\partial c} \times \frac{\partial l}{\partial e} = d
- Step 5: rate of change of l w.r.t d i.e. \frac{\partial l}{\partial d} = \frac{\partial e}{\partial d} \times \frac{\partial l}{\partial e} = \frac{\partial (c*d)}{\partial d} \times \frac{\partial l}{\partial e} = c
- Step 6: rate of change of l w.r.t a i.e. \frac{\partial l}{\partial a}, this step can be broken into two parts as changing a, can change l either through c or through d, therefore,
- \frac{\partial l}{\partial a} = \underbrace{\left( \frac{\partial c}{\partial a} \times \frac{\partial l}{\partial c} \right)}_{\text{changes through c}} + \underbrace{\left( \frac{\partial d}{\partial a} \times \frac{\partial l}{\partial d} \right)}_{\text{changes through d}} = d + c
- Step 7: rate of change of l w.r.t b i.e. \frac{\partial l}{\partial b} = \frac{\partial d}{\partial b} \times \frac{\partial l}{\partial d} = 1 \times c = c
The real benefit of computing derivatives (a.k.a gradients) in the backward direction is that it allows the learning algorithm to compute gradients for all the nodes with a single attempt.
If we would have started in the forward direction, then the learning algorithm would have travelled the entire graph for each of the input nodes at the very beginning. Unlike a simple graph like the above one, this forward mode differentiation is computationally very costly for a large network.
Pytorch, being a popular deep learning toolkit, has implemented a .backward()
method for their neural network class to compute the derivatives for all the nodes (technically not all, only for them with requires_grad=True
) in a computational graph with a single attempt.
Important takeaways
When we have a '+' operation in the graph, like the one below,
Computing gradients with respect to the input nodes is fairly simple as the backward gradient computation just copies the gradient of the output node to all the input nodes.
However, when we have a '\times' operation in the graph, like the next one,
The gradient of one input node is just the product of the gradient of the output node and the value of the other input node.
+ and \times are the two building blocks powering up any mathematical computation.
Sorting nodes before performing backpropagation
If we observe the above graph, it is clear that while performing backpropagation, computing gradient of any of the leaf nodes i.e. \left\{a, b\right\} is possible only when gradient value for the node c is already available. This is just because of the chain rule we use.
So the fact is that we cannot choose a node randomly and try computing the gradient for it. We must compute the gradient of node c, which is essentially 1, then we can take either of node a or node b.
Nodes must be sorted in order to perform backpropagation. Topological Sort is a way to achieve this ordering before performing the backpropagation.
The efficient vector-jacobian product
Consider our tiny network:
It has only two input nodes, a single output node and in between them three additional nodes forming the hidden layer.
The mapping from input layer to hidden layer is supported by a matrix of parameters \mathbf{W}_{3 \times 2} and of course by a vector valued linear function, say, \vec{f}.
Hidden layer values are formed using the equation below:
\begin{pmatrix} y_1 \\ y_2 \\ y_3 \end{pmatrix} = \begin{pmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ w_{31} & w_{32} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}
where w_{ij}; i=1(1)3, j=1(1)2 is the weight for the connection coming from j^{th} node in the input layer to i^{th} neuron in the hidden layer and consider a differentiable scalar valued function g(.) that combines hidden layer output values to form the final output l.
Computing l from the given fixed set of input values and with the help of some values of the parameters, is called the forward pass.
At this stage, we can definitely compute the jacobian matrix of \vec{f} w.r.t \vec{x}, which is given below:
\mathbf{J} = \begin{pmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} \\ \frac{\partial y_3}{\partial x_1} & \frac{\partial y_3}{\partial x_2} \end{pmatrix}
Now, suppose \vec{v} is the gradient vector of l i.e. \vec{v} = \left(\frac{dl}{dy_1}, \frac{dl}{dy_2}, \frac{dl}{dy_3}\right)' and we will assume that we have already computed it. Strange? No problem, I will explain this assumption at the end of this section.
Now, if we take the dot product of \mathbf{J}^T and \vec{v}, this is what we are going to get,
\begin{align} \mathbf{J}^T \cdot \vec{v} &= \begin{pmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} & \frac{\partial y_3}{\partial x_1} \\ \frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} & \frac{\partial y_3}{\partial x_2} \\ \end{pmatrix} \begin{pmatrix} \frac{\partial l}{\partial y_1} \\ \frac{\partial l}{\partial y_2} \\ \frac{\partial l}{\partial y_3} \\ \end{pmatrix} \nonumber \\ &= \begin{pmatrix} \frac{\partial l}{\partial x_1} \\ \frac{\partial l}{\partial x_2} \end{pmatrix} \nonumber \\ &= \frac{\partial l}{\partial \vec{x}} \nonumber \end{align}
If we observe the dot product carefully, we will notice a series of chain rules being performed.
Therefore, this vector-jacobian product helps us to compute the gradient of loss function w.r.t the nodes in another layer which, in reality, might be far behind the output layer in the computational graph and it does so with the help of a known gradient vector for the layer right next to it.
The intermediate gradient vector has been considered to be computed beforehand because while performing reverse mode differentiation, we must know the gradient of the output node before computing the gradients of the input nodes which have defined the output node itself and we have seen it here where we have started taking derivatives from the output node.
Here \vec{y} is the output of f(\vec{x}), so we can safely take this assumption to understand the theory.
Now, the next part is to compute \frac{\partial l}{\partial \mathbf{W}} which can again be done using,
\frac{\partial l}{\partial \mathbf{W}} = \frac{\partial \vec{y}}{\partial \mathbf{W}} \times \frac{\partial l}{\partial \vec{y}}
Hopefully, it gives us an idea to implement automatic differentiation (autograd) for a large computational graph.
The training loop
The last and one of the important parts of training a neural network is the training loop. Technically it is a for
loop where in each iteration some steps are performed. Each of these iterations is also called epoch.
It is to be noted that, before entering into the loop, all the parameters are randomly initialised.
Below are the steps that are performed within a training loop.
Conclusion
In conclusion, I hope that this post has provided you with valuable insights and information about the learning process of a neural network.
Remember that learning is a never-ending process, and there is always more to discover and explore. I have created a colab notebook based on Andrej’s lecture on YouTube, which will definitely be helpful to grasp the whole idea.
If you have any questions or feedback, please feel free to leave a comment or reach out to me directly.
Thank you for taking the time to read this blog, and I hope to see you again soon.