Image by Editor
I’ve been reading, writing, and speaking since late last year on the intersection of open source software and machine learning, trying to understand what the future might bring.
When I started, I expected that I would be talking mostly about how open source software is used by the machine learning community. But the more I’ve explored, the more I’ve realized that there are a lot of similarities between the two areas of practice. In this article I’ll discuss some of those parallels — and what machine learning can and can’t learn from open source software.
The easy and obvious parallel is that both modern machine learning and modern software are built almost entirely with open source software. For software, that is compilers and code editors; for machine learning, it is training and inference frameworks like PyTorch and TensorFlow. These spaces are dominated by open source software, and nothing appears ready to change that.
There is one notable, apparent exception to this: all of these frameworks depend on the very proprietary Nvidia hardware and software stack. This actually is more parallel than it might look at first. For a long time, open source software ran mostly on proprietary Unix operating systems, sold by proprietary hardware vendors. It was only after Linux came along that we began to take for granted that an open “bottom” of the stack was even possible, and much open development is done these days on MacOS and Windows. It is unclear how this will play out in machine learning. Amazon (for AWS), Google (for both cloud and Android), and Apple are all investing in competing chips and stacks, and it’s possible that one or more of those could follow the path laid by Linus (and Intel) of freeing the entire stack.
A more critical parallel between how open source software is built, and how machine learning is built, is the complexity and public availability of the data that each are built on.
As detailed in this preprint paper “The Data Provenance Project,” which I co-authored, modern machine learning is built on literally thousands of data sources, just as modern open source software is built on hundreds of thousands of libraries. And just like each open library brings with it legal, security, and maintenance challenges, each public data set brings with it the exact same set of difficulties.
At my organization, we’ve talked about open source software’s version of this challenge as being an “accidental supply chain.” The software industry started building things because the incredible building blocks of open source libraries meant that we could. This meant the industry started treating open source software as a supply chain—which came as a surprise to many of those “suppliers.”
To mitigate these challenges, open source software has developed lots of sophisticated (though imperfect) techniques, like scanners for identifying what is being used, and metadata for tracking things after deployment. We’re also starting to invest in humans, to try to address the mismatch between industrial needs and volunteer motivations.
Unfortunately, the machine learning community seems ready to plunge into the exact same “accidental” supply chain mistake—doing lots of things because it can, without stopping to think much about the long-term implications once the entire economy is based on these data sets.
A last important parallel is that I strongly suspect that machine learning will expand to fill many, many niches, just as open source software has. At the moment, the (deserved) hype is about large, generative models, but there are also many small models out there, as well as tweaks on larger models. Indeed, hosting site HuggingFace, machine learning’s primary hosting platform, reports the number of models on their site is growing exponentially.
These models will likely be plentiful and available for improvement, much like small pieces of open source software. That will make them incredibly flexible and powerful. I’m using a small machine learning-based tool to do cheap, privacy-sensitive traffic measurement on my street, for example, a use case that wouldn’t have been possible except on expensive devices a few years ago.
But this proliferation means that they’ll need to be tracked—models may become less like mainframes and more like open source software or SaaS, which pop up all over the place because of low cost and ease of deployment.
So if there are these important parallels (particularly of complex supply chains and proliferating distribution) what can machine learning learn from open source software?
The first parallel lesson we can draw is simply that to understand its many challenges, machine learning will need metadata and tooling. Open source software stumbled into metadata work through copyright and licensing compliance, but as the accidental supply chain for software has matured, metadata has proven immensely useful on a variety of fronts.
In machine learning, metadata tracking is a work in progress. A few examples:
- A key 2019 paper, widely cited in the industry, urged developers of models to document their work with “model cards.” Unfortunately, recent research suggests their implementation in the wild is still weak.
- Both the SPDX and CycloneDX software bills of materials (SBOM) specifications are working on AI bills of materials (AI BOMs) to help track machine learning data and models, in a more structured manner than model cards (befitting the complexity one would expect if this truly does parallel open source software).
- HuggingFace has created a variety of specs and tools to allow model and dataset authors to document their sources.
- The MIT Data Provenance paper cited above tries to understand the “ground truth” of data licensing, to help flesh out the specifications with real-world data.
- Anecdotally, many companies doing machine learning training work appear to have somewhat casual relationships with data tracking, using “more is better” as an excuse to shovel data into the hopper without necessarily tracking it well.
If we’ve learned anything from open, it’s that getting the metadata right (first, the specs, then the actual data) is going to be a project of years and may require government intervention. machine learning should take that metadata plunge sooner rather than later.
Security has been another major driver of open source software’s metadata demand—if you don’t know what you’re running, you can’t know if you’re susceptible to the seemingly endless stream of attacks.
Machine learning isn’t subject to most types of traditional software attacks, but that doesn’t mean they’re invulnerable. (My favorite example is that it was possible to poison image training sets because they often drew from dead domains.) Research in this area is hot enough that we’ve already gone past “proof of concept” and into “there are enough attacks to list and taxonomize.”
Unfortunately, open source software can’t offer machine learning any magic bullets for security—if we had them, we’d be using them. But the history of how open source software spread to so many niches suggests that machine learning must take this challenge seriously, starting with tracking usage and deployment metadata, exactly because it is likely to be applied in so many ways beyond those in which it is currently deployed.
The motivations that drove open source metadata (licensing, then security) point to the next important parallel: as the importance of a sector grows, the scope of things that must be measured and tracked will expand, because regulation and liability will expand.
In open source software, the primary government “regulation” for many years was copyright law, and so metadata developed to support that. But open source software now faces a variety of security and product liability rules—and we must mature our supply chains to meet those new requirements.
AI will similarly be regulated in an ever-growing multitude of ways as it becomes ever-more important. The sources of regulation will be extremely diverse, including on content (both inputs and outputs), discrimination, and product liability. This will require what is sometimes called “traceability”—understanding how the models are built, and how those choices (including data sources) impact the outcomes of the models.
This core requirement—what do we have? how did it get here?—is now intimately familiar for enterprise open source software developers. However, it may be a radical change for machine learning developers and needs to be embraced.
Another parallel lesson machine learning can draw from open source software (and indeed from many waves of software before it, dating back at least to the mainframe) is that its useful life will be very, very long. Once a technology is “good enough,” it will be deployed and therefore must be maintained for a very, very long time. This implies that we must think about maintenance of this software as early as possible, and think about what it will mean that this software might survive for decades. “Decades” is not an exaggeration; many customers I encounter are using software that is old enough to vote. Many open source software companies, and some projects, now have so-called “Long Term Support” versions that are intended for these sorts of use cases.
In contrast, OpenAI kept their Codex tool available for less than two years—leading to a lot of anger, especially in the academic community. Given the rapid pace of change in machine learning, and that most adopters are probably interested in using the very cutting edge, this probably wasn’t unreasonable—but the day will come, sooner than the industry thinks, where it needs to plan for this sort of “long term”—including how it interacts with liability and security.
Finally, it’s clear that—like open source software—there is going to be a lot of money flowing into machine learning, but most of that money will pool around what one author has called the “processor rich” companies. If the parallels to open source software play out, those companies will have very different concerns and spending priorities than the median creator (or user) of models.
Our company, Tidelift, has been thinking about this problem of incentives in open source software for some time, and entities like the world’s largest purchaser of software—the US government—are looking into the problem as well.
Machine learning companies, especially those seeking to create communities of creators, should think hard about this challenge. If they’re dependent on thousands of data sets, how will they ensure those are funded for maintenance, legal compliance, and security, for decades? If large companies end up with dozens or hundreds of models deployed around the company, how will they ensure those with the best specialist knowledge—those who created the models—are still around to work on new problems as they are discovered?
Like security, there are no easy answers for this challenge. But the sooner machine learning takes the problem seriously—not as an act of charity, but as a key component of long-term growth—the better off the entire industry, and the entire world, will be.
Machine learning’s deep roots in academia’s culture of experimentalism, and Silicon Valley’s culture of fast iteration, has served it well, leading to an amazing explosion of innovation that would have seemed magical less than a decade ago. Open source software’s course in the past decade has perhaps been less glamorous, but during that time it has become the underpinning of all enterprise software—and learned a lot of lessons along the way. Hopefully machine learning will not reinvent those wheels.
Luis Villa is co-founder and general counsel at Tidelift. Previously he was a top open source lawyer advising clients, from Fortune 50 companies to leading startups, on product development and open source licensing.