As my team and I build Cerebrium, the question that I constantly receive from customers, investors, and machine learning enthusiasts alike isn’t simply a matter of design decisions and our product differentiation. Rather, it’s a strategic crossroad: Will the future of machine learning, currently synonymous with LLMs and Generative Models, revolve around the rule of a few dominating models, or could we find a constellation of finely-tuned, diverse models?
Rising from this debate, an even more complex question forms: Where should we stand on the open-source versus closed-source battle lines? These aren’t just theoretical questions — at Cerebrium we like to think we will be thought leaders in this space and so the direction we choose will help shape the next revolution of this technology.
In an interesting article by Clemens Mewald, the Head of Product at Instabase, titled “The Golden Age of Open Source in AI is coming to an end”, he mentions some very interesting points, namely:
- Companies are releasing open-source models under more restrictive licenses since they are deterred by the fact that others might have more commercial traction than them. For example, OpenAI open-sourced GPT-2 but won’t do the same for GPT-3 and GPT-4 due to competition and the need to gain commercial traction.
- Researchers prefer to work in places where their work can be published so they can be recognised by industry in order to command higher salaries (and clout). This was prevalent with the team from Mistral AI where one of the founders was the Team Lead for the LLaMA model from Meta and was one of the main reasons they were able to raise an $113 million seed round.
While I agree with most of the above, I still see a world in which open-source machine learning is a core part of the growth and innovation. I have four thoughts on this:
1. Open-source and closed source won’t be mutually exclusive!
A simple example I like to refer to is to the early 1990’s, the start of the internet. Netscape, AOL and Yahoo made the internet popular by making it accessible and useful and in turn used their commercial traction to fund the research and development of some of the most impressive technologies and products today. However, the TCP/IP protocol and Apache HTTP server were open-source developments that without which, companies like Netscape and Yahoo, may have never existed or had the impact they did.
The majority of large language models today are built on the transformers architecture, which was first detailed by Google in 2017. If they did not publicly release their paper, “Attention is all you need”, then we would most likely not be experiencing the growth in LLM’s we see today. I suspect Google will continue to release new research consistently. However, they will leverage their expertise and team to create commercially beneficial applications and nuances in their products.
Without the potential for a large impact and commercial traction, how can we expect to fund this research and innovation industry-wide, — after all, doesn’t monetary gain prove value?
2. Can Open-source and Closed-source models coexist?
Yes they can — shown above they are two sides of the same coin. This has been proven with software today, and there are numerous benefits and shortcomings to both.
Closed-source solutions, like OpenAI’s GPT4, offer an easily accessible LLM to a broad user base, demonstrating impressive versatility across a range of applications. However, it can’t handle 100+ languages with the same performance; it can’t handle mathematical and physics use cases well; and lastly, it is very non-deterministic. These make GPT4 unsuitable for a large proportion for many industries.
Open-source models allow for flexibility, accessibility, innovation, security, and explainability, all of which are extremely important for companies operating in the enterprise space. It is also important to note that even though these models are open-source, the community of contributors can be a moat and offering a managed service of this product or extra paid features can be commercially viable.
Open-source platforms such as GitHub and Docker are doing $1 billion and $135 million in annual revenue. They are based off popular open-source projects but are able to offer managed services on top of this in order to gain revenue.
3. Why multiple, finely-tuned Models will Outweigh a Few dominating models?
The generalised, multi-purpose monoliths like GPT4, while impressive, have their limitations — they might struggle to manage hundreds of unique use cases, especially in the world’s extensive array of languages. These all-encompassing models will strive for mass consumer use cases yet, they will underperform when it comes to offering specialised solutions tailored to specific business needs. Being a multi-purpose model adds to the complexity – size, cost, data requirements as well as longer latency times however this could become mute as new research is developed and deployed.
At Cerebrium, we have seen many customers use multiple models in their workflows, both off the shelf as well as fine-tuned, that improve the functionality of their use case. Fine-tuned models are cheaper, faster and perform better when built for certain use cases. Additionally, business requirements differ based on their industry, size, consumers profiles, data at their disposal, process flows, geographic location etc. Foundational models will be like web frameworks (React, Next.js, Vue) however, the way that companies implement them, fine-tune them and stitch it all together will be very different. This will be the technology differentiator for most businesses.
4. Open-source is the quickest way to meet ML engineer demand
Machine learning is a discipline that requires understanding across mathematics/statistics, chip infrastructure and software. This makes it a difficult and complex discipline to understand. It is estimated there are ~500,000 machine learning engineers and researchers and 25 million software developers worldwide. In order to maintain the pace of innovation, we simply cannot afford to wait for another half-million ML engineers to surface over the next three years. Instead, it’s crucial to empower our vast community of software engineers to harness the wealth of available research and turn it into accessible, practical solutions.
Additionally, if we had to continue in a closed source manner, every industry would be affected piecewise instead of in parallel. Industries with the largest number of experts would evolve first, as well as industries in which the largest monetary gain could be realised. If our goal is to create impact industry-wide in the shortest time possible, then enabling as many people as possible to build with machine learning advancements is the only way to do that — this is one of our North stars at Cerebrium;.
5. A new business model for a new platform shift
With each significant technological breakthrough come new, innovative business models. A relevant example is the pivotal move of businesses requiring on-premise infrastructure to leveraging the cloud. Businesses transitioned from large upfront expenses to usage-based pricing, which allowed technologies to be more accessible and allowed the business to have a larger Total Addressable Market (TAM) from day one.
One of the areas I think we will see business model innovation in is in contribution-based pricing — users and companies will be priced based on the cost of contributions of multiple stake holders to build the underlying product. Contributions in the form of raw data, curated and bespoke models, licenses, and even the underlying infrastructure itself, will help push the industry into a not only being more performant and useful, but also more democratised and accessible.
With a pricing model that resembles something of this nature, open-source models could be compensated for their contributions, and therefore researchers and research labs alike would have an incentive to release their work.
In closing, as an engineer, I tend to always be on the side of open-source. However as an entrepreneur I also see the value in commercial traction and creating some proprietary software. There are still many questions to be answered and as we progress some of the points above might age badly. However, based on what I have seen companies implement in production, the decision making process behind the technologies they are using and the use cases that are providing real tangible value, there is still a lot of work to do and the only way to get there is in a combination of open-source and closed source development.
In conclusion, my perspective as an engineer naturally gravitates towards open-source. Yet, as an entrepreneur, I recognise the commercial relevance and the need for proprietary software. The open-source vs. closed-source debate is complex, ever-evolving, and many questions remain to be answered. Nevertheless, one thing I believe is: the future of machine learning will require embracing both open-source and closed-source development. Through that delicate balance, we can keep up with demand, foster innovation, and build tools that will drive tangible benefits to diverse use-cases. We have a lot of rewarding work ahead of us at Cerebrium so if you are interested, please reach out — we are hiring.