We present a prototype of ChemXOR, a tool to build encrypted AI models for drug activity prediction
Our full registered name is ‘Ersilia Open Source Initiative’. We have participated in public community efforts such as the Open Source Malaria and the Open Source Antibiotics contests, we have enrolled in incubators programs at the Software Sustainability Institute, Open Life Science and Code for Science & Society, and we release software under the most permissive licenses, allowing for unrestricted usage, reproduction and modification of our digital assets. So it is clear that we take the principles of open access and open science very seriously (perhaps too seriously, since we publish progress by the day, even when things are not yet finished and documented, and are buggy, and dysfunctional). Staying put with these principles has resulted in the most gratifying and unexpected experiences for our tech non-profit, but it is also fair to say that, funding-wise, it may have been a self-imposed limitation. Sometimes I fear that Ersilia is a charity that looks like a start-up, which is probably not very appealing to the average funder. We offer high risk, little track record and no return on investment. Therefore, we are not in the radar of venture capitalists (we knew that), but we also struggle to convince philanthropies and grant donors in general, big or small, who (understandably) expect to see some sort of previous evidence, something that helps them imagine how their contribution will translate into tangible benefits for society.
It is an irony that the first grant we were ever awarded was titled ‘Privacy-preserving AI for drug discovery’. It’s a Biopharma Speed Grant given by Merck. ‘Privacy’ is not a word that a charity like ours should be using too often, certainly not as the first word of its first-ever funded project, and certainly not in the context of drug discovery. I am not referring to privacy of patient data here, anonymity of clinical samples, sensible personal details, et cetera. I am talking about privacy of drug molecules, keeping them secret, embargoed, locked behind the walls of intellectual property and at the mercy of the rules of business. I get why these things need to exist in the world, and we have no plans to confront them, but they are not something we should be contributing to. Not given our discourse and bare-minimum resources. It sounds like a complete contradiction, a sudden turn in our roadmap, a detour, at best.
I guess I am writing this blog post to explain why, looking at the bigger picture, we believe that ‘privacy-preserving AI’ can effectively contribute to data accessibility and, ultimately, benefit researchers working in low-resource settings. By ‘data’ I mean results obtained from costly experiments carried out in laboratories and hospitals (of the Global North), to test the efficacy and safety of potential drug candidates. Considering only experiments available in the scientific literature, we can already gather millions of compounds, and tens of millions of data points, contributed collectively by the scientific community over decades of research. Access to these datasets is fundamental to build AI-based solutions and, hopefully, accelerate the discovery of the drugs of the future. Small, underfunded laboratories need these public data to keep going, especially if they do not have the capacity to produce much experimental data in-house. Tens of millions of publicly available data points may sound like a lot, but it’s not, really. It is only a small, significant but insignificant portion of information that exists out there.
Most datasets are siloed in the computers of pharmaceutical companies, obviously. The world of science would be transformed (for the best) if all of this knowledge was unlocked. But I don’t see this happening anytime soon. The process to discover a drug is essentially a sequence of filters, starting with millions of candidate compounds and ending, if there is luck, with one molecule in the market. At each filtering step, there is a specific assay asking a specific question. Is the molecule soluble in water? Does it kill the pathogen in vitro? Is it toxic in human cells cultured in the laboratory? And in mice, is it toxic in mice? At what dose? And below this dose, does it still kill the pathogen in the blood of infected mice? And so on, and so on, all the way to clinical trials. If you think about it, discovering a drug is the quest for identifying bad candidates as soon as possible. Only one out of millions of molecules will make it through the filters, so why are these pharmaceutical companies not publishing the data associated to discarded compounds? It’s trash for them, anyway, and it would be a precious gift to the scientific community, especially to AI practitioners (and research parasites like us). If I make an effort, I can understand why these data need to be secured (as I will explain in the following paragraph) but, to be completely honest, I think that archiving experimental results is a disgraceful practice overall. I’ve dealt with enough private pharmaceutical data in my career to realize that, in many occasions, all the secrecy doesn’t make sense and is just a by-default stance. But, as we know, secrets make the unremarkable look remarkable, and if you happen to share the secret eventually, you may ask for another one in return. Anyway.
I was saying: if I make an effort, I can understand why pharmaceutical companies are reluctant to share their historical, archived datasets. Arguably, the most important asset of these companies is the collection of candidate compounds that is piped through the drug discovery process. This collection is often shared between projects, and represents an ever-increasing corpus of medicinal chemistry know-how. Revealing the identity of these compounds would cause an immediate, catastrophic loss of competitive advantage, and it would sabotage patents and market exclusivity. The major pharmaceutical companies have expressed publicly their willingness to adopt the principles of open science, but undermining their central pillar is probably too much of an ask. So the question becomes: is there a way to release these archived datasets without unveiling the identity of the compounds? An effective way, something that the scientific community can actually use and exploit.
Sure, there is a way. AI models are nothing but algorithms that have been instructed to ‘learn’ from a given dataset. An AI model built on, for example, antimalarial activity data must apprehend the molecular traits that make a good antimalarial compound, and it must do it in the light of the training dataset. From a user perspective, querying the resulting AI model — in this case, inputting a molecule of interest and getting as output a prediction of its antimalarial potential — is a perfectly valid thing to do. The beauty of it is that an AI model is simply a set of numbers, matrices and algebraic operations, so the structure of the compounds used to train the model is not explicitly displayed in it. The user does not know what molecules went into building the AI model, but is nonetheless using it, because there is legitimate value in it. There is value in a tool that has learned, by whatever means and in the light of whatever protected training data, the molecular traits that make a good antimalarial.
So a fabulous event would be that every pharmaceutical company trains one AI model based on each of their archived datasets, and then releases the resulting repertoire of AI tools for anyone to use. This would effectively unlock their private data without compromising intellectual property. It is not clear to me how we should be incentivizing this from the public sector, but still, I think it would be a spectacular event to witness. One year ago, at Ersilia we decided to push in this direction. However, also from our end, and from a technical perspective, the whole thing is more nuanced than it seems. As it turns out, it is not impossible that a malicious user will able to reverse engineer the original training data based solely on the AI model architecture and parameters. That is, a malicious user, upon querying the AI model insistently, and observing it, and following a specific strategy, may be capable of inferring the identity of the molecules seen by the AI model at training time. The shadow of this possibility, no matter how unlikely this possibility is, poses an unaffordable risk to any pharmaceutical company.
While reflecting on malicious users, we also understood benign users a bit more. After all, they will be the overwhelming majority: scientists doing honest work and hoping to gain insights from the newly released AI models based on pharmaceutical company data. Perhaps, we thought, we should also be respecting the privacy of these scientists (the users), in addition to the privacy of the companies (the data providers). Scientists and academic research institutes have the right to have their own intellectual property protection agenda. They would probably hesitate before an AI model served on the cloud, or hosted by a private company, unless secrecy on their input molecules is fully guaranteed.
So an ideal setup would be an ecosystem of encrypted AI models (with no risk of reverse engineering), deployed such that users can use them in a private mode, if they wish to. This is what we tried to achieve with the development of ChemXOR, a Python library for privacy-preserving AI focused on drug discovery applications. Our colleague Ankur Kumar deserves great credit for it. ChemXOR does the following:
- It offers a framework to train AI models for compound activity prediction, based on a basic set of small molecule descriptors and neural network architectures (simplicity for the data scientist).
- It automatically encrypts the resulting AI model parameters (privacy for the data provider).
- It encrypts user input and returns an encrypted output (the prediction) than can only be decrypted by the same user. All of this happens automatically (privacy for the user).
Code and documentation are available here and here, and you can also find a more detailed report here. ChemXOR is just a prototype, something we offer as a tech nonprofit because we genuinely/naively believe in this concept, at the junction of data privacy and open access, applied to drug discovery. We are willing to develop the tool further but, to be completely honest, we first need to know whether there is interest in it. We may as well be daydreaming here (it has happened to us multiple times), and we are not in a position to call pharmaceutical companies to action. So unless there is an authentic desire to share data (i.e. to contribute AI models) in these private stakeholders, the whole effort will be futile. Please reach out to us if you have thoughts on the matter. We’ll be happy to discuss them and, hopefully together, strategize the next small step towards making pharmaceutical company data available to all.