## Preserving information is another way of remembering

*Inside a packed conference room, the editor paused when he realized that one of the journalists was not taking any notes. Surprised, he asked why, and even more surprised, he listened as the journalist recited every word as if it had been written down somewhere. Shortly after, **Solomon Shereshevsky**, the journalist, visited a memory specialist and asked him why he should have taken notes instead of memorizing everything he heard. In fact, he wanted to know why anyone should take notes instead of relying solely on their memories. **Alexander Luria**, the psychologist, talked to Salomon for almost 30 years. During this time, Alexander witnessed how Salomon could easily use his memory to recite entire poems, sequences of numbers and words, complex mathematical formulas, and entire book passages written in foreign languages. Alexander documented all his experiences and conversations with Salomon, which inspired many other research projects, **essays**, **films*,* and even a **short story** by the great **Jorge Luis Borges**. It also inspired this article about mnemonists, recurrent neural networks, and the inevitable tendency to explain the intuition behind machine learning algorithms.*

Mnemonists are individuals with a strange ability to remember extensive pieces of information like lists of numbers or passages in a book. Some, such as Solomon Shereshevsky, are born that way, while others rely on rules or techniques, called mnemonics, to help them remember long lists of data. Whether natural or trained, mnemonists can remember sets of data so big that they would look impossible to memorize to any normal person. In the last USA Memory Championship, the final competition consisted of memorizing the order of 104 poker cards in just 5 minutes! How do they do it? Although explaining what is happening inside their brains is definitely challenging and not the purpose of this article, creating an analogy between mnemonists and recurrent neural networks is possible. Let’s start with artificial neural networks; what are they, and how do they work?

An Artificial Neural Network (ANN) is a group of connected units (neurons) that receive and transmit signals. The way in which an ANN works resembles the process in which the neurons in our brain communicate. It all starts when a set of input data goes through a network of units distributed in different layers. The input data is then transformed according to the activation function of each unit and the weights and biases that connect all the units. The final layer of the network takes the output and compares it against the real observed data. The differences between the output and the real data are used to adjust the weights and biases that connect all the units. This adjustment is repeated multiple times until the ANN is “trained”. So, the learning process in an ANN consists of multiple iterations in which the weights and biases of the network are changed until the ANN is able to reproduce the observed data with a fair degree of confidence. The learning process that occurs in the brain and the one that is coded into an ANN are definitively different. This difference has been the main topic of discussion in many other research projects and articles. In spite of their differences, ANNs are the basis of Deep Learning and the key behind multiple artificial intelligence apps and processes used every day.

The way in which an ANN works has been covered extensively in other sources, and it would be a good idea to have a solid understanding of ANNs before continue reading this. Here is a very brief explanation of how an ANN works: Figure 1 shows a diagram of an ANN with an input layer, one hidden (or intermediate) layer, and an output layer. Note how each unit is connected through a network of weights and biases. The two input values go into the single unit in the hidden layer after being multiplied and summed by the correspondent weights and biases. Once inside the unit, the activation function transforms the input value according to a pre-established function. The output of this unit is now the input for the next unit in the output layer. The weights and biases of the next layer also apply this time as well as the activation function. So, after **a1** is multiplied and summed by the weight and bias, it goes into the last unit, where it is converted into **a2** after applying the activation function.

The process explained previously is known as forwarding propagation. Backward propagation, on the other hand, is the process in which the weights and biases are modified according to the differences between the output values and the observed values. These differences are propagated across all the layers and connections in the ANN. The cycle of forwarding and backward propagation is repeated multiple times until the difference between the output and the observed values is minimized. An outcome is a group of weights and biases that take the input data and transform it into the observed values or a good approximation of them.

Up to this point, the described neural network can recognize patterns and predicts an output according to a set of input values. Nothing has been said so far about the order in which the input data is presented. What about a sequence? Could an ANN just like the one described know what the next number that goes after a long list of numbers is? What if the order of the input data does matter? In a Recurrent Neural Network (RNN), the order in which the data is presented affects the network parameters. As mnemonists, RNNs are trained to reproduce long sequences of data. RNNs are then used to predict what the next value will be. There are many types of RNNs that are more advanced than what will be explained here: such as Long short-term memory (LSTM) and Gated Recurrent Units (GRU). Some of the RNNs applications are time series prediction, language modeling, speech recognition, and machine translation. These applications are similar in the way in which the input data is presented: the RNN takes a sequence of numbers or words and predicts which is the next most probable one. Contrary to the ANNs, the RNNs have some kind of memory that uses the previous values to predict which is the next one. How does this “memory” work?

The usual diagrams of RNNs are similar to Figure 2. The rolled version of the RNN shows a network that contains an input layer with 4 units, a recurrent hidden layer, and an output layer. This recurrent hidden layer interacts with each of the units in the input layer and carries information from each of these interactions into the next one. This is shown in the unrolled version of the RNN.

The previous diagram is used in many sources and articles related to RNNs. An alternative visualization is presented in Figure 3. Contrary to what is done in ANN, the units in the input layer are not read simultaneously. Instead, each of them goes into the unit in the hidden layer sequentially. This means that X1 is connected to the hidden layer using the weights (w1) and biases (b1) and then transformed according to the activation function of the unit. After this, X2 follows the same process, and then X3 and X4. The key to an RNN is how the information is transferred from one unit to another, which represents the “memory” component of this network. The output of each of the interactions with the hidden layer (a1,a2,a3) is also included in the calculations. Note how each of the input units is multiplied by the weight, summed by the bias and then summed by a “recurrent weight” (wr) that is the output from the previous unit. So, a2 depends on X2, w1, b1, and a1; a3 depends on X3, w1, b1, and a2; and a4 depends on X4, w1, b1, and a3. The memory in an RNN consists of keeping track of the activation value from the previous input unit.

Another important characteristic of an RNN is that its weights and biases are shared across each of the layers. This means that all the units in the input layer are multiplied and summed by the same weights and biases, whereas in an ANN, each connection between units across different layers has a different weight. This detail is important at the moment of training an RNN: fewer weights and fewer biases mean that there are fewer parameters to tune, and generally, this would lead to a faster training process.

As is the case in other types of networks, an RNN can include not one but multiple units in the hidden layer. Having more units will increase the number of weights and biases associated with the hidden layer. However, an increment in the number of units can also improve the performance of the RNN. Figure 4 shows a diagram of an RNN that contains two units in the hidden layer and two units in the input layer. It is important to note how the hidden layer now contains 6 weights and 2 biases. There are 2 weights related to the connections between the input layer and the hidden layer and 4 “recurrent” weights that are associated with the connections among the units in the hidden layer. As it was explained previously, these 4 weights carry the information that is outputting the hidden layer units in their interactions with the input layer.

One of the multiple applications of neural networks is to build a regression model with the purpose of using it later to predict new values. This can also be done through linear and non-linear regressions. In a dataset such as the one shown in Figure 5, an ANN can model the behavior between **x** and **y** such that, after the model is trained, new values of **y** (target) can be calculated from new values of **x **(input). The result of training the ANN is a set of weights and biases that are able to reproduce the behavior of the input and target variables. The more non-linear this relation is, the more units and layers are needed in the ANN. So, in this case, an ANN will use both the input and target variables. An RNN, on the other hand, models the behavior of the target variable as a sequential list of numbers.

In an RNN, the target variables are grouped according to the sequence length. The example shown in Figure 5 uses a sequence length of 3. This means that three values are used as input, and the following fourth value is the target. The area highlighted with the dotted square contains 7 data points. For a sequence length of 3, there are 4 sets of inputs and targets that can be defined. In each of these sets, the input consists of 3 values, and the target contains a single value. So, the sequence length defines how many inputs and targets can be used to train and test the RNN. For instance, in a dataset of 1000 points and a sequence length of 5, the number of inputs/target pairs will be 995.

Figure 5 shows that the data preparation process in an RNN is different from the one applied to an ANN. In an RNN, data needs to be separated according to the sequence length before entering the network. Once the data is ready, the RNN will process each one of the sequences separately. In each of these processes, the RNN will compare the output to the target value. Once all the sequences have been processed and their outputs calculated and compared to the targets, the RNN will have a measure of how far from the expected value is. This difference is then used during the backward propagation process to determine new weights and biases.

The first step in any network is to propagate the input across the layers using weights and biases. Figure 6 contains a graphical example of how this process works in an RNN. This example shows an RNN with a single-unit hidden layer and an input that consists of 3 values. The activation function for this layer is hyperbolic tangent (tanh). After the hidden layer, there is an output layer with a single unit and a linear activation function. All the equations and numbers written in red, represent the calculations that are performed at each of the units. Note how the first unit of the hidden layer does not contain a value for wr. However, this value is always present for the rest of the units. wr represents the memory component of this network since it keeps track of the previous activation.

As with other types of networks, the next step after the forward propagation is comparing the output with the target value. This is done using an error function such as the mean square error (MSE). This function is calculated for each of the output-target pairs, and then a loss function calculates a single loss value for the iteration. The loss indicates how close the output is to the expected value. The loss value is also used at the start of the backward propagation process.

The backward propagation or backprop is perhaps the most convoluted part of an artificial neural network. The main idea behind this process is to distribute the difference between the outputs and the targets among the network of weights and biases. One of the ways of distributing or backward propagating the loss is using gradient descent. This might not be the best solution for all cases. However, it is one of the most common backprop algorithms. It is important to note at this point that backprop is not the only training algorithm for neural nets. The problem of finding the best combination of weights and biases is essentially an optimization problem that can be solved using evolutionary algorithms or any other type of derivative-free optimization algorithms. Explaining the backpropagation process with gradient descent is a good way of understanding what is going on but it should not be taken as the only solution.

The backpropagation process in an RNN is similar to the one performed in an ANN. However, in an RNN, the backprop process takes into account that there are multiple forward propagation processes at each iteration. So the backprop is done for each sequence going backward in time, and that is why it is called BPTT. The details of the process are better explained in this Python notebook and in other sources. The main result of the BPTT is a gradient for each of the weights and biases. This gradient is then subtracted from the original value of the weight/bias to generate a new weight/bias that will be used in the next iteration. At the end of the process, the weights and biases should be adjusted in such a way that the output of the RNN resembles the target values.

Training an RNN is not an easy task. Although the gradients are easy to calculate using the BPPT algorithm, the derivative of the loss function can be really big in comparison to the hidden activations at earlier times. Since the loss function is very sensitive to these small changes, it becomes discontinuous (Sutskever, 2013). In addition to this, RNN also presents problems of vanishing and exploding gradients. This can be solved by clipping the gradients to a default value.

This section contains some examples of the application of RNN in regression problems. As mentioned before, this is not the only application of the RNNs but is a good starting point to understand the process before embarking on other, more challenging problems. **All the examples presented here are fully explained in this ****Python notebook**. In addition to the complete explanations, this Python notebook contains an RNN built step-by-step. This RNN contains a single hidden layer (tanh) and an output layer (identity) with one unit. The number of units in the hidden layer can be changed in the code. Figure 7 shows a diagram of the RNN. An RNN such as this one is usually called vanilla RNN, and it will be used in the following examples.

This is a very simple first example. The idea is to find an RNN capable of reproducing the behavior of the sin(x) function. The main input is a table with values of x and y according to sin(x). Figure 8 shows the plot of this function. The following sections present different approaches to model this function: vanilla RNN, Keras RNN, and Multi-layer Perceptron regressor (sklearn).

Before implementing this RNN, there are a few things that should be done as preparation. The first one is to convert the data to a suitable format so that the RNN algorithm can read it. The data presented in Figure 8 can be extracted as a table with two columns: one for the independent variable and one for the dependent variable. The input for the RNN consists of the dependent variable column only. The sequence length defines how many partitions are used to divide the column. Figure 9 shows that in a dataset with 7 values, a sequence length of 3 generates 4 samples. These are the samples that will be entered into the RNN.

The first subfunction included in the Vanilla RNN Python code is called “PrepareData”. This subfunction takes a .csv file with a table just like the one on the left side of Figure 9 and converts it into multiple arrays. This is a common practice in machine learning algorithms that is used to determine how effective an algorithm is without generating new data. A final important point about this step is the need to scale the data before feeding it into the RNN. In this example, all data is scaled to a minimum of -1 and a maximum of 1. However, other types of scaling processes can be used as well.

Before the scaled data enters the RNN, the parameters need to be initialized. The initialization process assigns random values to the weights and zeros to the biases. To do this, it is important to know the RNN architecture since that will define how many values are needed in the initialization. This example involves 5 parameters: wx, bx, wr, wy and by. The first three are associated with the hidden layer, and the last two are with the output layer. After initializing the parameters, the input data goes into the network and triggers the loop of forwarding propagation, calculating loss, and backward propagation.

Figure 10 shows the result of training the Vanilla RNN using the sin(x) function. The Vanilla RNN does a fairly good job of approximating the target points. It is important to note that the results shown in Figure 11 were re-scaled to match the input data. This means that after calculating the output, the values were converted from the [-1,1] scaling into their original form. This first example of the Vanilla RNN is using a sequence length of 2 and a single unit in the hidden layer. With this configuration, the RNN contains 2 weights and 1 bias for the hidden layer and one weight and one bias for the output layer. That is 5 parameters in total.

Figure 11 shows the behavior of the loss function during the training process of the RNN. The loss function starts at a high value and is progressively reduced until it reaches a minimum. At this point, it is important to mention that the way in which the Vanilla RNN was defined makes the results highly sensitive to the initial values of the parameters. Other implementations of RNNs (such as the one from Keras) deal with this problem. In this Vanilla RNN, different initial combinations of parameters can lead to different results.

Figure 12 shows the results of training an RNN using the Keras library. These results are comparable to the ones presented in Figure 10 since the RNN contains one unit in the hidden layer with a tanh activation function and a sequence length of 2. The way in which the RNNs defined in Keras work makes them less dependent on the initial set of parameters which is a problem that is not addressed in the Vanilla RNN.

The sin(x) function can also be modeled with a conventional ANN. However, it will need more parameters to be tuned. The ANN shown in Figure 14 is using three hidden layers with 25, 15, and 5 units each. It is important to remember that this ANN takes as input data the x values that are presented on the horizontal axis. This ANN is not considering the order of the target values since it works directly with the relation between input (x) and output (y).

The examples presented up to this point are all referred to sin(x), which is a very simple function. However, RNNs can also be used to represent more complex behaviors. The next section contains data related to the monthly production rate of an oil well. This example is a good way of testing an RNN since the data does not follow any pattern or cycle, and it cannot be modeled by a function.

For this application of the Vanilla RNN, the sequence length is 15, and the number of units in the hidden layer is 1 (Figure 14). Note how, although there is a clear declination trend in the data, the points look really uncorrelated. However, the RNN is able to match some of the points. The worst part is towards the final period, where the results from the RNN are considerably far from the target.

Figure 15 shows the results of training the Vanilla RNN to reproduce the oil rate date but this time using a sequence length of 5 and 2 units in the hidden layer. The results seem slightly better than before. As is the case with other types of neural networks, tuning the meta parameters makes a big difference. In the Jupyter notebook that is associated with this article, it is possible to try different combinations of units in the hidden layer, sequence length, learning rate, and the number of epochs.

A similar scenario to the one just presented was run using Keras (Figure 16). It has 2 units in the hidden layer and a sequence length of 5. Results are similar to the Vanilla RNN, even in the final stage.

Figure 17 shows the results of training a conventional ANN to reproduce the oil rate dataset. This example is using three hidden layers with sizes of 100, 50, and 25 units each. The general behavior of the oil rate is well represented by the ANN. However, as happened in the previous example, the small details in the data are not recognized by the ANN. This performance is closer to the results that could be obtained with a non-linear regression than with an RNN.

The main purpose of this article was to give a clear explanation of how RNNs work and how their forward and backward propagation process is slightly different from a conventional ANN. The main characteristic of an RNN is related to its recurrent units. These units keep track of the previous activations, which gives RNNs a special type of “memory”. This feature allows RNNs to reproduce a more detailed behavior than conventional ANNs. The examples presented show how an ANN does a good job of reproducing a general trend. However, if the task is to reproduce point-by-point values, then it is better to switch to an RNN. This explains why RNNs are used in language modeling, speech recognition, and other similar applications. So, although RNNs’ behavior is not as impressive as Solomon Shereshevsky’s skills, they can still be used to solve many problems faced on a daily basis. After all, it is improbable to need someone able to recall conversations that happened 12 years ago. It is more useful to have an algorithm that can “guess” what the next word will be in a text message written in a hurry.