In data science and machine learning, we often represent data as vectors and matrices. In mathematics and physics, vectors are defined as quantities that capture a magnitude and a direction (e.g., a distance vector). However, often the data we work with don’t necessarily follow this definition of a vector, yet we still represent the data with vectors. For example, we may represent data on a person’s demographic information (e.g., race, age, gender, etc.) as a vector, yet there is not a pure geometric interpretation of magnitude or direction.
Similarly, matrices, in mathematics, are meant to represent linear mappings, which is defined as the mapping between two vector spaces that preserve vector addition and scalar multiplication. Yet, the context with how matrices are used in data science / ML is different from this formal mathematical definition.
Given this difference, why then are vectors and matrices so widely used when representing data. In this article, we’ll explore several reasons explaining this.
When working with data, we often want to manipulate them and / or feed them into machine learning models. This process involves a lot of computation and will often require adding and multiplying many numbers. For example, in building a recommendation system for movies, you might collect data on how long users viewed each movie in the library. Then you can recommend movies that on average have a higher watch time (as that might translate into better engagement). This average is calculated for each movie by adding all the watch time across all the users and dividing by the number of movies. Doing this process element wise can be slow, especially when the number of users and movies get high (such as the case for Netflix which has well over 100 million subscribers and has over thousands of titles).
However, computer scientists have developed extremely efficient algorithms for linear algebra. Addition and multiplication with vectors and matrices are much faster than conventional element wise addition / multiplication. For Python, the NumPy library, which is for scientific computing and linear algebra, provides faster speed and efficiency. Re-visiting our recommendation system problem, we can think of each user being associated with a vector of watch times with dimensions n, where n is the number of movies. Then our data will be a matrix collection of these vectors, with n rows and m columns, where n is the number of movies and m is the number of users. To find movies to recommend, we can average along the rows to find the average watch time for each movie across all users, then sort the movies by the highest average watch time. Implementing this problem with vectors and matrices allows for faster computation because of highly optimized algorithms.
To demonstrate, here’s a small script comparing the time it takes to calculate the mean along the rows using regular Python and the NumPy library (which has optimization for matrices and vectors). To assess computational efficiency, we’ll measure the time it takes for the program to run for a dataset of 500 movies and 200 users.
Running this above code ten times and averaging the results, regular Python took 9.088 milliseconds whereas NumPy took 0.427 milliseconds (ran on Google Colab). The NumPy implementation, in this instance, is roughly 20 times faster than regular Python.
Taking it one step further, below we plot the time for the Python and NumPy implementations to calculate the mean while varying the number of users between 1 to 1000 while keeping the number of movies at 500.
As we expand the amount of data, the delta between regular Python and NumPy gets bigger. We can also visualize this by plotting the ratio between two implementations.
This ratio keeps increasing as our data increases, demonstrating the efficiency gain of using NumPy. For extremely large data sources or complex models, this efficiency gain will get even bigger hence more valuable. Consider the realm of big data, which is becoming more prevalent, where there are billions to trillions of data points. Also consider Large / Deep Neural Networks models, which are composed of potentially millions of nodes / parameters, with each multiplying and adding weights and biases (for example, the GPT-3 language model has over 175 billion parameters).
Another advantage of using vectors / matrices to represent data is that we can leverage the tools of linear algebra and mathematics. A great example is in computer vision where matrices are used to describe image transformations (e.g., translations, rotations, reflections, affine, projective, etc.)
Focusing on image rotation, the goal is to determine a function to use for each pixel from an image to rotate by some angle. In linear algebra, there are rotation matrices which are used to rotate vectors / matrices. By representing images as matrices, we can leverage rotation matrices. Similarly, there are matrices for translation, reflections, and affine transformation.
Furthermore, representing images as matrices also helps to do projective transformations, which are mappings of lines from one plane to another plane. This is useful for image stitching and making panoramic photos. Additionally, there are further applications when working with 3D images / graphics.
The last benefit that we’ll cover is regarding communication. When working with complex data situations, expressing concepts using vectors and matrices can be more convenient, clear, and concise. Instead of giving each data point a name, we can group data into named vectors or matrices. Moreover, we can also express operations on the data using vector / matrix conventions.
For example, consider the example of doing multiple linear regression with 5 feature variables. That could be expressed as:
With vectors / matrices, we can convey the same idea as (where the features and coefficients on the features are now vectors):
Notice that this representation is much shorter and still captures our linear regression model. Additionally, this representation would still work if we had more variables (it would be conveyed the same for 10 or 1000 feature variables). Furthermore, vectors and matrices can be used to express many data operations and models (e.g., logistic regression, random forest, neural networks, etc.)
Moreover, the language and convention of vectors / matrices is very ubiquitous across many fields (e.g., physics, engineering, computation, etc.). That means practitioners are generally familiar and it reduces cognitive burden (as they don’t have to learn a new data / model convention).
Many operations with data and models are expressed in terms of vectors / matrices. The reason is that data represented in vectors and matrices enables efficient, faster computation, access to linear algebra techniques, and better communication.