Back to basics — coding neural network using NumPy alone
Recently I was explaining the concept of the neural network to my mother; I couldn’t go beyond the usual trope of inputs, hidden layers, outputs etc. I don’t think I did a very good job at that and it led me to think — Do I understand them myself or am I merely pretending that I understand them?
I took a piece of paper, made a few diagrams of connected networks, and on a jupyter notebook coded the entire network using Keras layers. It took me less than 10 lines to code a multi-layer network and it performed well.
I wasn’t satisfied though, it felt that I am merely posing that I understand the concept and I would falter for sure if I were to go any deeper. I have read the workings of a neural network umpteen times and can superficially explain how they work and can whip out some maths too, but do I really understand them so that I can lucidly explain to anyone? That was the question that became the earworm that evening. I decided to go back to first principles and code the entire natural network with nothing but high school maths. No faffing around, no use of any fancy frameworks of PyTorch, TensorFlow, or Keras.
It should be doable!
The underlying idea was clear to me but I wasn’t clear that how I would code it from the basics.
So, first things first, let’s just scribble on a piece of paper, first the idea, then translate the idea into mathematical equations and later code the mathematics into a program as concisely as possible.
Ok, I have conjured a crude plan to attack this problem.
I am sure all of you have seen something like this one time or another.
In the simplest form of a neural net, there is an input layer, a hidden layer, and an output one
Had there been no hidden layer, the arrangement would have been called a perceptron, similar to Mark 1, which Frank Rosenblatt designed in 1943.
To explain in one sentence, the idea behind a neural network is, given an input X, can we do some mathematical calculations to correctly identify X or classify X into one of the predefined categories? Just as our brains would do.
Let’s take an example of a one-layer network with inputs X1, X2; hidden layer neurons H1, H2; and the output Y1, Y2.
You estimate weights w1, w2, w3 etc and calculate outputs at each of the nodes and eventually measure the final output Y* against the real value of Y. The difference between Y and Y* is the error that you use to readjust the weights and calculate a new Y*; keep doing it till you get a value of Y* that is close to Y or satisfies you.
This was a watered-down version of what happens; in reality, there are a few more steps. A more sophisticated algorithm:
- Take inputs X
- Estimation of weights — It is an entirely uninformed process, one can randomly initialise weights.
- Estimate a value for the bias term.
- Forward pass(feedforward) — With the help of initialised weights, estimate the output at each of the nodes.
- Calculate the final outputs(value of y).
- Find the error — estimated value by your network minus the actual value you want to observe.
- Backward pass (backpropagation) — with the help of the error calculated, readjust the weights.
- Keep repeating steps 3 to 6, till the desired value is achieved.
This is a regression equation which is a linear combination of inputs and different weights(betas).
For figure 1, one can see that the output at H1 is:
output at H2 is : x1*w3 + x2*w4 + b1 and so on.
At any point, it is a linear combination of the weights and the inputs. If we only take outputs at different nodes of a neural network, it will work as nothing but a group of fancier and complicated linear regression. To avoid that, we need to introduce non-linearity and that’s where the activation functions come into the picture. There are many popular functions such as sigmoid, tanh, softmax, ReLu etc.
The introduction of activation function in a neural network helps:
- Make the model non-linear.
- Imitate the neurons in the brain; the neurons in the brain activate when certain threshold conditions are met; an activation function brings a similar threshold to the neural network’s functioning.
- Map the output between 0 and 1.
Once we apply the activation function e.g. sigmoid, the output at a node will be:
Once all these values at all the nodes are calculated, the network will estimate the final output and calculate the error between the desired and the observed value.
Dataset: MNIST handwritten digits.
The license of the MNIST handwritten dataset — Creative Commons License http://www.pymvpa.org/datadb/mnist.html
There are 42,000 images and each of them has the dimension 28 x 28 = 784 pixels and they need to be classified into the 10 classes — 0, 1, 2, 3,…9.
The sizes of the various layers are important and need to be calculated beforehand.
Converting to Mathematics
The first layer of the network is the input layer, X to which a linear combination of weights and biases are added to get the linear output Z.
The output Z at the first layer is made non-linear with the help of an activation function, here ReLU is used.
The non-linear output of ReLU is fed to the second hidden layer of the network to which a softmax non-linearity is added and finally, output, A2, is produced.
Softmax will convert the output vector into a vector of probabilities.
A sidenote, write latex easily on https://latexeditor.lagrida.com/
Till now, we were moving in the forward direction.
Once we have gotten the output, we’ll measure the error and conduct the backward propagation.
The weights that are chosen randomly at first are adjusted during the backpropagation. The weights decide the feature’s importance: the higher the numerical value of the weight, the more important the feature is.
Initially, we don’t really know what weights to assign, so we start with random weights but backpropagation works as a calibration mechanism that helps in updating the weights and bringing the network and the entire system to a balance.
The following partial derivatives are calculated, they help in ascertaining how a small change in features such as Z, b, or W impacts the error function.
The above is done to calculate errors in weights and biases and thus update them.
This all seems like a lot of maths. In simple words, we would be updating the weight and biases using the following equation:
New weight = old weight — learning rate * partial derivative of the total error with respect to that weight
e.g. in Figure 1, if we want to update weight 5, then
We know the total error, we just need to partially differentiate the error function with regard to w5 but because there aren’t any w5 terms in the error function, we will use the chain rule of derivatives to find the individual partial derivatives and multiply all of them to get the relevant value(as we have done in the equation set 4 above)
Watch the 3B1B video for more understanding of how backpropagation calculus works.
After reading the data from the repo, let’s divide it into training and test sets
One very important step is to normalise the train and test sets, without which training will be either slow or just not effective at all.
X_train = X_train/255.
X_test = X_test/255.
According to the equation set 1 above, let’s initialise the params and define a function for the forward pass
By this time one iteration of the forward pass has been defined, let’s define the methods for backward propagation and update the parameters according to the equation sets 2 and 3.
Let’s define a gradient descent method that will call the forward pass and the backward pass methods in it and eventually update the params.
The first run reaches an accuracy of 88%
And using the trained learner on the test set shows a few misclassifications but mostly correct ones.
With the help of PyTorch, Keras, or TensorFlow, an accuracy of 95% can be reached with some tuning while here with basic mathematical know-how, we touched 88%. It’s not that bad!
More importantly, we were able to work out the forward propagation, and backpropagation mathematics, codify them correctly, and apply them to a real-world example dataset.
The codebase is present on my GitHub repo.
If you know of some easier way for backpropagation(or a more straightforward interpretation), then please drop a line or connect.