Comparative analysis with other techniques for small molecule drug discovery
All AI techniques rely on thorough validation, and AI-based drug discovery is no exception. Receptor.AI pays special attention to experimental validation and testing of all pieces of technology which are used in our SaaS platform and in-house services.
If we are speaking about virtual screening, a core technology of our platform, there are two different measures of its performance. The first one is the ability to distinguish “binders” from “non-binders”. In other words, the ability to filter out the molecules unlikely to bind to the target protein while keeping those with a significant binding propensity. The fewer non-binders appear among the top-ranked molecules, and the fewer good binders are missed, the better the virtual screening method.
The second measure is the correct ranking of molecules according to their affinity and/or activity. The most affine molecules should, on average, be ranked higher than the less affine ones. The higher the correlation between real and predicted binding affinities and/or biological activities, the better the methods.
These two performance metrics are usually suitable for two different stages of virtual screening. The first one is more relevant for an initial screening, which is designed to scan huge chemical space and select potential binders quickly and with reasonable precision. The second one is usually applied to the secondary screening in the selected pool of potential binders, which has to prioritize the compounds with the best characteristics for further development.
Our stack of technologies is designed to follow the idea of the virtual screening funnel based on a holistic approach. This gives us a significant advantage of applying the whole spectrum of algorithmic and AI techniques for achieving the final goal: finding highly efficient, safe and selective hit compounds in a minimum time.
The number of compounds under consideration decreases approximately one order of magnitude in each subsequent stage of the pipeline. The funnel starts with chemical space, which could be pre-processed and clustered in a smart way for achieving unprecedented screening performance (multi-billion databases could be screened in just a few hours). After that, the initial AI-based virtual screening module is applied. The initial screening results are filtered with an advanced AI-based ADME-Tox module consisting of 38 predictive endpoints and fed into the selectivity prediction module. After that, the secondary screening, which is based on fully automated docking with AI rescoring, is performed, and the final set of ranked hit candidates is formed.
The stage of initial screening is represented by two drug-target interaction models: 3DProtDTA and FB-DTI, which are applied in parallel in a consensus mode.
The 3DProtDTA model is trained on a large set of protein-ligand pairs with known binding affinities, but the model does not encode physical protein-ligand interactions directly. The protein structure, dynamics and metadata from one side and the ligand chemical structure and fingerprints from the other side are encoded separately into the graph neural networks with different architectures, which are then merged on the level of dense neural network layers.
Such architecture allows working with the protein-ligand pairs lacking the co-crystalized structure, which significantly increases the number of available training pairs. This model is agnostic to the ligand binding site and can operate even if there is no data on where the ligand binds to the target protein.
The fragment-based drug-target interaction model (FB-DTI) is based on another idea. It depends on the exact binding pocket location and evaluates the propensity of small molecular fragments to bind with different sub-pockets within a pre-selected binding site. After that, the best fragments are stitched together according to the set of compatibility rules learned from the combinatorial chemistry to get the whole ligand and its corresponding binding affinity prediction.
This technique allows working with proteins with no known ligands, protein complexes with the binding pockets located between their subunits and other non-trivial cases.
Moreover, if the binding pocket is unknown, it is possible to perform an unbiased pocket prediction by screening a small “trial” database against all surface areas of the protein, which have at least some “pocket-like” properties. After that, the surface spots with the highest affinities are marked as “pocket candidates” and are screened with larger chemical space. After a few iterations of such a procedure, it is possible to predict the binding pocket or several alternative pockets, even for the most challenging proteins and protein complexes.
The parallel usage of both DTI techniques allows combining their strong points while compensating for their weaknesses. In addition, the very high screening speed of both models allows using them for proteome-wide assessment of selectivity for a large set of hit candidates. Each candidate molecule is screened against ~10k proteins in our platform to determine its selectivity against the target of interest and propensity for off-target interactions.
The stage of secondary screening is represented by fully automated docking with AI-rescoring. We utilize well-established docking techniques based on genetic algorithms. On top of this, we apply the AI-based rescoring function, which evaluates the docking poses and updates their scores to provide better correspondence to experimental affinities. The rescoring AI model is trained on a high-quality subset of data, used in training the DTI models, but the model is based on a different architecture (the computer vision CNN model), which is tuned for getting higher precision in qualitative discrimination of the binding energies.
In order to test the performance of model architectures for initial screening, we performed two experiments using different test datasets.
The first experiment was done with two widespread benchmark datasets for AI-based drug-target affinity predictions referred to as “Davis” and “KIBA”. The Davis dataset contains the pairs of kinase proteins and their respective inhibitors with experimentally determined dissociation constant (Kd) values which were used as labels for benchmarking. The KIBA dataset comprises scores originating from an approach called KIBA, in which inhibitor bioactivities from different sources such as Ki, Kd and IC50 were used as labels for benchmarking.
We compared our 3DProtDTA model with 8 state-of-the-art open-source AI algorithms for drug-target affinity prediction using the same training set, test set, and performance metrics.
We have shown that our approach outperforms all competitors by a significant margin, ensuring that our model architecture and training protocol are top-notch.
In the second experiment, we tested the ability of 3DProtDTA to discriminate binders from non-binders on a large in-house test dataset containing 6,618 unique proteins and 80,079 unique hit compounds with known affinities. This translates to 157,809 experimentally validated protein-ligand pairs (the binders), which were augmented by 1,408,400 non-binder pairs, which are used as negative controls. The latter were composed of experimentally validated pairs with non-active compounds and randomly generated pairs.
We computed the Precision-Recall curve, which is routinely used to evaluate the performance of predictive AI models. The area under this curve (AUC) represents the general ability of the model to make a correct prediction. In this test, “precision” is defined as a probability of getting false positives (not affine molecule is predicted as affine), while “recall” is defined as a probability of false negatives (affine molecule is predicted as not affine).
Our model has an AUC=0.917, which means that it predicts the correct affinity in almost 92% of cases.
In order to test the secondary screening performance, we took four common proteins with a significant number of known ligands having reliable binding affinities. The goal of the secondary screening is to rank selected hit candidates, which are found by initial screening among the huge chemical space, so the fine-grained placement of the ligand according to their affinities is important for their correct prioritization.
We selected 16 most widespread docking techniques dedicated to predicting the ligand poses and affinities. Some of them are based on AI scoring functions, which makes them especially interesting for us.
From our side, we tested not only Receptor.AI docking with AI rescoring (which is our dedicated method for secondary screening) but also our DTI and FB-DTI models, as well as the consensus model of DTI and docking with AI rescoring.
There is an elaborate framework of consensus functions used in our technology stack. For example, DTI and FB-DTI models are balanced by giving them different weights depending on the number of ligands for a particular protein, reliability of its binding pocket annotation, size of the binding pocket and user preferences. Such smart weighting allows automatic prioritization of the most relevant and reliable DTI model for a given protein target. Another proprietary consensus function is used to combine the results of DTI models with docking scores. This function is designed in a semi-automatic way by tuning a large number of parameters and sampling thousands of possible functional forms.
It is necessary to emphasize that the DTI models are designed for initial screening, so they are not required to be highly performant in the correct ranking of the molecules with significant binding affinities. For such techniques, it is crucial to discriminate binders from non-binders, but they may not rank binders as precisely as dedicated docking techniques.
First, we augmented the sets of known ligands for selected proteins with a large number of decoys (which are guaranteed to be non-binders) and checked whether our DTI model recovers real ligands out of decoys. The results are expectably excellent — the top 20 compounds contain all 10 out of 10 known ligands for three proteins and 13 out of 16 for the fourth one.
Then, we evaluated the binding scores for known ligands using our techniques and all 16 competing docking techniques and compared the correlations between predicted and experimental values for all of them.
Quit surprisingly, our DTI and FB-DTI techniques, which are not designed for the correct fine-grained ranking of compounds with high binding affinities, perform on par with the best dedicated docking techniques.
Out in-house docking with AI rescoring is a bit better than this, while a combination of DTI with docking and AI rescoring gives the best possible result.
This is a remarkable result, which shows that Receptor.AI virtual screening techniques could compete with dedicated docking algorithms in their ability to rank the ligands with high binding affinity correctly, while their combination with docking and AI rescoring function outperforms the competitors.