Jupyter Notebook is now the de facto lingua franca for data science. What happened to MATLAB?
The last time I used MATLAB was late in the 90s. I was finishing up my Engineering degree and I simulated a sigma delta analog to digital converter. I remember using both MATLAB and Simulink at the university. That was just the last of the many times that I used MATLAB at the university.
I shelved it until a few years ago when I rediscovered it again by following Andrew Ng’s Machine Learning course in Coursera. I have used mostly Numpy and recently began re-exploring MATLAB.
While the industry and media seem to have promoted Python, Pandas, Numpy and the companion ecosystem as ”the right choice” and have somehow faded away MATLAB, I am wondering if that is due to the first ecosystem being more useful and productive over MATLAB or there is some bias towards the open source ecosystem.
Some areas of Data Science, especially when you are learning the field are directly related to directly handling Matrices.
Linear Algebra is a key component of ML, but not only that, there are many situations where due to the large amounts of data a matrix approach is recommended. Hence being able to manipulate matrices easily is a key need in Data Science. Not every time, not for everyone, but it is a relevant need often.
Despite I start to have a certain amount of flight hours using Jupyter Notebooks, Pandas and Numpy, I still find their APIs and syntaxes non-intuitive. This happens especially when I try to translate and visualise the Matrices while using Numpy.
Numpy library is extremely powerful and I doubt there is anything that you can do with MATLAB that you can not do with Numpy. Numpy can also be enriched with C and Fortran under the hood for those tasks that can not be achieved with Numpy. But still, Numpy is a library added to a general purpose high-level language and no matter how much time I spend with Numpy I still find it non-intuitive sometimes.
MATLAB is a Matrix manipulation language (its own name is the contraction of Matrix Laboratory) and its syntax is far more intuitive and closer to what it represents: a language intended for linear algebra by design.
To me, this is an important point but this point seems to be less relevant for the market as I have never read anyone making this observation, but I am sure I am not the only one noticing this.
Research around Data Science is an iterative process. The workflow is different from the one used for software development. There is always input data, algorithms or mathematical methods and interactive plots and visualisations. Any other requirements are either largely diminished or can be completely ignored in Data Science.
A significant difference is that also the software approach is procedural programming. Despite many data scientists employing object-oriented techniques while writing their Notebooks, that is, in my opinion, a reflection on their age (younger people have not to know the times when everything was procedural) rather than an actual need (why do you want objects when you have a single thread of execution which is fairly linear?).
The above scenario makes Notebooks the perfect tool. It is a linear procedural execution where you can do iterations and changes on the fly on a specific block of code. The development is not complicated and the whole structure is simple. It has limitations, but overall they are a simple and useful tool.
This interactive environment comes from the early ages of computing. In fact, MATLAB has been always interactive, and even programs such as Macsyma — which was developed in 1968 — were already interactive. Mathematica also introduced the concept of Notebook in 1988, so if anyone thinks that the interactive development process is a modern Python thing, it is not.
However, this approach of visual coded blocks is somehow more unique or better achieved in Jupyter Notebooks, and it is useful and productive. It is clearly one of the reasons for its success. MATLAB does have now a similar environment called Live Editor but it did not have that feature until 2016. Before that, it was just programs and interactive command-line.
I am currently exploring this MATLAB version of Notebooks (which as I mentioned is called Live Editor) and I find it a very useful tool.
I think the lack of Notebooks was one reason for the diminishing of MATLAB and the adoption of Jupyter Notebooks with the Python ecosystem. I do not conceive now any research or exploration in the area without them and I think this is shared by many people working in the field.
The Pandas dataframe is the other “big” thing of Jupyter Notebooks. I tend not to use them when possible because I handle very large amounts of data and Pandas is not efficient like Numpy.
Some people use them as augmented matrices, but you can only do that with limited amounts of data.
Pandas are a greatly simplified way to handle heterogeneous data types in a tabular form. I specifically use them as an orchestration table to record the results of the analysis done in quantitative finance, while the actual market data is always a separate Numpy ndarrays data layer. That gives you speed when dealing with granular data (in Numpy) and the flexibility required to record and track aggregated results (in Pandas). I find this two-level approach convenient and so far I have not found a better arrangement. This might work only for some scenarios though.
MATLAB can be considered to be equivalent to Numpy, and lacked the equivalent Pandas Dataframe. Now it has Tables and Timetables, which can be seen as equivalent versions of Pandas.
While there are many caveats discussed on MATLAB forums about Tables and Timetables performance, I think the same applies to Pandas dataframes. Tables and Timetables seem great, and specifically the functions associated with Timestamps and time zone handling are now much better in MATLAB than anywhere else — I have that specific need so it is sometimes I pay attention to— .
While Pandas can be a must sometimes, more careful usage of plain arrays can also lead to structures equivalent to Pandas Dataframe. It is not that uncommon to name columns with separate variables that assign specific matrix columns indexes to names. You can do that both in MATLAB matrices and in Numpy arrays. By doing that you have full performance.
That seems odd and pretty primitive, but in the context of data science where everything needs to be well structured poses a less difficult problem. Numpy arrays have also the custom dtypes definition which so far I have not seen in MATLAB (maybe the feature is there and I do not know it).
ML and AI cope a large amount of attention now so how well or easy each environment copes with it is very relevant.
The amount of state-of-the-art libraries and frameworks that can be incorporated into Jupyter Notebooks is massive. It is likely the environment with more flexibility and options, and all these underlying libraries and frameworks are low-level coded and highly optimised, even for GPU and parallel deployment. Still, its integration with Python is a bit artificial, as Python is a general-purpose programming language and not a matrix language. I suspect that as one gains more screen hours that will become less of a problem.
MATLAB has incorporated several Toolboxes to deal with ML and AI. I am still exploring those (maybe material for another article) but MATLAB has been behind to provide on time for the rich ecosystem available in Jupyter Notebooks.
Still implementing bare algorithms and numerical methods in MATLAB is far easier than doing the same thing in Python, where you normally rely on ready-to-use libraries. Likely it is this “production ready” ecosystem that has provided a larger base of users. Normally users will be interested in just deploying a neural network, either because they already passed the stage of understanding the underlying mechanisms of how they work or because they are comfortable with a more superficial knowledge of their internals.
Jupyter Notebooks and the entire ecosystem is open sourced and free while MATLAB is a commercial product. So here you are comparing a fairly expensive product (MATLAB itself is not expensive, but depending on how many licenses and toolboxes are required its costs can grow significantly) with an open source free ecosystem.
This point largely depends on the goals and scenario and it is worth discussing.
When MATLAB is used as a learning tool or as a personal tool MATLAB is pretty inexpensive.
At the time of writing the article, a bundled student version of MATLAB, Simulink and the 10 most common toolboxes costs around 75$, with each additional toolbox being under 10$. That is almost a symbolic fee given the complexity and intended audience of the software. If you are enrolled in any degree program, having MATLAB to study and explore is almost free. Universities will likely also provide access to it through their academic licenses.
For personal non-commercial usage costs are larger, MATLAB itself is not expensive but the toolboxes will make the final bill a bit more expensive, but still, you will find affordable prices around $300–500 depending on the number of toolboxes you want to add. If you need more toolboxes prices will rise. It is not like the student version but the costs are still pretty reasonable and affordable for personal usage.
The real thing — using MATLAB in the industry — can represent a relevant cost or not depending on the specific needs (how many licences and how many toolboxes). The costs are obviously much larger but at the very same time if you are using MATLAB commercially is because you are in a high-value added industry (otherwise why would you use it) and the costs will likely be still low when compared with labour — your salary, contract rate or service fees — . Here MATLAB itself is pretty cheap (around $2000), but MATLAB without the Toolboxes is likely not that useful anymore.
It is difficult to find people talking about MATLAB, it is no longer a trend, it is not cool, and it costs money (versus the free open source). So it might seem that it is dead. Even on the MATLAB website, there is an article talking about Python and MATLAB where it is explained that it best cooperates with Python (which does not sound bad, in the end it is just another tool to be used). But it surprises me to find such a statement in the vendor’s website.
But despite what one might think about MATLAB becoming a legacy tool, the Mathworks financial statements tell another history. As of today, MATLAB is a healthy company employing around 5000 employees worldwide, with a large customer base and with a revenue consistently exceeding 1B$ during the last past years. That can change anytime, but as of today, that is a good hint that people are still spending money on MATLAB.
I suspect MATLAB is another commercial mature product, widely used but no longer trendy, that is useful and has its place, but that does not capture that much attention in the media and focuses on institutional and large customers. I personally find it useful at least in my stage of knowledge as it makes it easier to handle the type of data I handle and to visualize the relationship of the underlying maths with ML and AI.
That might change once I move to a more mature stage of knowledge where I might value other aspects such as being able to deploy models or have more ready-to-use libraries.
I am sure that, as with any other technology, it will have supporters and detractors, but to me, as of today, it does not seem dead. Mathworks financials will be otherwise telling a different history.
To me it is a tool worth exploring, how far and to which extend I will use I do not know yet.
Aside from the already mentioned low-cost options to evaluate MATLAB (student license and home license), there is a 30 days trial and if I understood it correctly the information provided by Mathworks when you create an account MATLAB grants you 20 hours/month access to their online cloud MATLAB version.
I have explored several tutorials on YouTube and also read a couple of books. As introductory/refresh material I strongly recommend the official MATLAB Onramp Course. It is interactive and extremely well crafted both in content and interactiveness. Going through it takes 2 hours and there are additional Onramp courses on ML and AI which I intend to complete next.
Opinions are my own and do not represent any view from any of my customers or employers (past or present) nor are linked or related to any particular professional activity (past or present) linked to me.
I am not, in any way, involved directly or indirectly to any of the companies behind the commercial products here mentioned.
MATLAB and Simulink are registered trademarks by the MathWorks.
Mathematica is a registered trademark from Wolfram.
The information provided here does not represent any buying advice or product benchmark or detailed comparison. Costs and information on any commercial product are approximate and are not meant to be accurate or complete, for detailed information on functionalities, prices or other specifications the reader is refered to the vendors of the different products.