Rapid traversal of vast chemical space using machine learning-guided docking screens

Rapid traversal of vast chemical space using machine learning-guided docking screens


Play all audios:

Loading...

Download PDF Article Open access Published: 13 March 2025 Rapid traversal of vast chemical space using machine learning-guided docking screens Andreas Luttens  ORCID:


orcid.org/0000-0003-2915-79011,2,3, Israel Cabeza de Vaca1, Leonard Sparring1, José Brea4,5, Antón Leandro Martínez  ORCID: orcid.org/0000-0002-1595-34594,5, Nour Aldin Kahlous  ORCID:


orcid.org/0000-0002-7744-14911, Dmytro S. Radchenko  ORCID: orcid.org/0000-0001-5444-77546, Yurii S. Moroz  ORCID: orcid.org/0000-0001-6073-002X6,7,8, María Isabel Loza  ORCID:


orcid.org/0000-0003-4730-08634,5, Ulf Norinder  ORCID: orcid.org/0000-0003-3107-331X9 & …Jens Carlsson  ORCID: orcid.org/0000-0003-4623-29771 Show authors Nature Computational Science volume


 5, pages 301–312 (2025)Cite this article


19k Accesses


2 Citations


27 Altmetric


Metrics details

Subjects CheminformaticsComputational chemistryMachine learningStructure-based drug designVirtual drug screening Abstract


The accelerating growth of make-on-demand chemical libraries provides unprecedented opportunities to identify starting points for drug discovery with virtual screening. However, these


multi-billion-scale libraries are challenging to screen, even for the fastest structure-based docking methods. Here we explore a strategy that combines machine learning and molecular docking


to enable rapid virtual screening of databases containing billions of compounds. In our workflow, a classification algorithm is trained to identify top-scoring compounds based on molecular


docking of 1 million compounds to the target protein. The conformal prediction framework is then used to make selections from the multi-billion-scale library, reducing the number of


compounds to be scored by docking. The CatBoost classifier showed an optimal balance between speed and accuracy and was used to adapt the workflow for screens of ultralarge libraries.


Application to a library of 3.5 billion compounds demonstrated that our protocol can reduce the computational cost of structure-based virtual screening by more than 1,000-fold. Experimental


testing of predictions identified ligands of G protein-coupled receptors and demonstrated that our approach enables discovery of compounds with multi-target activity tailored for therapeutic


effect.

Similar content being viewed by others Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking Article 04 February 2022 An


artificial intelligence accelerated virtual screening platform for drug discovery Article Open access 05 September 2024 The impact of library size and scale of testing on virtual screening


Article 03 January 2025 Main


The number of possible drug-like molecules has been estimated to be more than 1060, which exceeds the size of chemical libraries evaluated in early drug discovery by many orders of


magnitude1. In fact, only ~13 million compounds are currently available in-stock from chemical suppliers, which clearly illustrates the limited coverage of chemical space2. Advances in


synthetic organic chemistry have provided access to increasingly larger compound collections and make-on-demand libraries currently contain >70 billion readily available molecules3,4. The


diverse scaffolds available in these libraries represent a major opportunity for drug discovery, but identifying the compounds relevant for a specific target in this enormous chemical space


remains a major challenge.


Recently, structure-based virtual screens of ultralarge libraries have identified ligands of important therapeutic targets, demonstrating that expanding the coverage of chemical space can


accelerate early drug discovery5,6,7. The most recently published docking screens have reached billions of compounds, but evaluation of these massive libraries is demanding due to the


substantial computational resources required8,9. The make-on-demand databases will also continue to grow and probably reach trillions of compounds in the near future, which will be


unfeasible to screen even with the fastest structure-based docking algorithms. Therefore, there is an urgent need for more efficient virtual screening approaches able to evaluate these vast


chemical libraries10.


Recent breakthroughs in artificial intelligence have revived interest in using quantitative structure–activity relationship (QSAR) models in drug discovery. QSAR has been widely used by the


pharmaceutical industry to predict both on- and off-target activities, as well as physicochemical and pharmacokinetic properties11. By representing compounds using molecular descriptors,


machine learning methods can rapidly evaluate large compound databases. Traditionally, QSAR models have been trained on experimental data12, but there is an increasing interest in predicting


which compounds in make-on-demand libraries are likely to receive favorable scores from computationally expensive virtual screening methods13,14,15. This combination of machine learning and


molecular docking screening has the potential to enable virtual screens of multi-billion-scale compound libraries at a modest computational cost.


In this work, we developed an ultrafast workflow based on conformal prediction (CP) for screening of vast chemical libraries. The CP framework can be applied to any machine learning


classifier and allows the user to control the error rate of the predictions16,17. Mondrian conformal predictors provide class-specific confidence levels that ensure validity for both the


majority and minority class. This approach is therefore well suited for handling inherently imbalanced datasets such as virtual screening applications, which focus on identifying a small


number of top-scoring compounds in a chemical library18. The framework has been utilized in QSAR applications to predict pharmacokinetic properties and bioactivity19,20. Strategies to


improve the virtual screening efficiency using the CP framework have been explored, but these workflows did not achieve the efficiency required to evaluate multi-billion-scale libraries21.


Applications of more recently developed techniques such as gradient boosting, deep neural networks and transformers to early-phase drug discovery have been successful22,23,24. Here we


combined the CP framework with several state-of-the-art classification algorithms to develop a workflow for accelerated structure-based virtual screening. Our most efficient protocol


identifies the top-scoring compounds in ultralarge compound libraries and reduces the number of molecules to be explicitly docked by three orders of magnitude. We show that application of


machine learning to guide docking screens of multi-billion-scale compound databases enables efficient discovery of ligands targeting G protein-coupled receptors, which is one of the most


important families of drug targets25. In particular, our workflow can screen billions of compounds against several targets to identify ligands with activity at multiple targets relevant for


the same disease.

Results


The protocol combining CP and molecular docking to navigate ultralarge compound libraries is described in ‘Machine learning-accelerated virtual screening pipeline’ in Methods (Fig. 1,


Supplementary Section 1 and Supplementary Fig. 1). In the development of this approach, we first conducted benchmarking docking screens against eight protein targets. The resulting datasets


were used to select suitable algorithms and molecular descriptors. In the second step, the method was further optimized to enable virtual screens of multi-billion-scale libraries and applied


to predict ligands of the A2A adenosine (A2AR) and D2 dopamine (D2R) receptors.

Fig. 1: Machine learning-accelerated virtual screening workflow.


a, Selection of a target protein for virtual screening. b, Model representation of protein binding site for molecular docking calculations. c, A subset from an ultralarge (Ro4, rule-of-four)


chemical library is extracted and prepared for docking screens. d, Docking scores for compounds in the training set are generated. e, A docking score threshold splits the training set into


virtual actives (1 class) and inactives (0 class). f, Molecules are represented by descriptors (for example, fingerprints). g, Machine learning models are trained to distinguish virtual


actives from inactives. h, The trained models are used to identify a subset of predicted virtual actives in the ultralarge library. i, A set of compounds is selected for docking to the


target. Rule-of-four molecules suggested by machine learning (Ro4 ML) have an improved docking score distribution compared to random molecules in the training set. j, Post-processing of


docking results and selection of compounds. k, Compounds are selected based on visual inspection. l, Experimental evaluation of synthesized compounds.

Full size imageBenchmarking of


conformal predictors


Molecular docking screens against eight therapeutically relevant proteins were carried out to initiate performance evaluation of the CP workflow. A detailed description of the protein


targets and preparation of the molecular docking calculations is provided in Supplementary Table 1, Supplementary Fig. 2 and ‘Preparation of proteins for docking’ in Methods. Eleven million


randomly sampled rule-of-four (Ro4, molecular weight <400 Da and cLogP < 4) molecules from the Enamine REAL space were prepared for molecular docking and screened against each target. In


total, more than 493 trillion protein–ligand complexes were predicted, resulting in a final benchmarking set of 88 million unique protein–ligand complexes and their corresponding scores. For


each target, chemical structures of the compounds and their corresponding docking scores were used to create training (106 compounds) and test (107 compounds) sets for evaluating the CP


framework. The energy threshold for the active (minority) class was determined based on the top-scoring 1% of each screen.


For each protein target, we assessed the performance of three different machine learning algorithms: CatBoost26, deep neural networks27 and Robustly Optimized Bidirectional Encoder


Representations from Transformers Approach (RoBERTa)28. To explore diverse representations of small molecules, we trained our algorithms on three different types of features: (1) Morgan2


fingerprints, the RDKit implementation of the substructure-based ECFP4 descriptor29, which have consistently performed among the best features in previous virtual screening benchmarks30; (2)


recently developed continuous data-driven descriptors (CDDD)31, which provide dense latent representations of molecules; and (3) transformer-based descriptors derived from a pretrained


RoBERTa encoder, which serve as the input for fine-tuning the RoBERTa models28. Detailed descriptions of the hyperparameters used in the training of each classifier are provided in


Supplementary Section 2 (Supplementary Table 2 and Supplementary Figs. 3–5).


Five independent classifiers were trained on 1 million labeled features, of which 80% was used for proper training and the remaining 20% for calibration. The features of the compounds in the


test set (10 million compounds) were then assigned 10 normalized P values (five P1 and five P0 values) by using each individual classification model and its corresponding calibration sets.


The two resulting sets of P1 and P0 values were aggregated into a single pair of P1 and P0 values by taking the respective medians (Supplementary Fig. 1). On the basis of the aggregated P


values and a selected significance level (ε), the Mondrian CP framework was used to divide the compounds into virtual active, virtual inactive, both (meaning, either virtual active or


inactive) and null (no assignment) sets (Fig. 2a). The performance on the benchmarking set was assessed using the significance level at which the predictor resulted in the maximal number of


useful (single label) predictions, εopt (Fig. 2b). The metrics to assess the performance of the models (sensitivity, precision, efficiency and prediction error rate) are defined in ‘Training


and evaluation of machine learning classifiers’ in Methods. Following the CP framework, exchangeability between the training and test sets led to strong agreement between the prediction


error rate and the selected significance level16 (Fig. 2c). To minimize the number of compounds requiring explicit docking while maximizing predictive power, we aimed to determine the


optimal size of the training set, exploring a range between 25,000 and 1 million compounds. Improved sensitivity, precision, and significance values were obtained for all targets when


increasing the training set size (Fig. 2d–f and Supplementary Tables 3–8).

Fig. 2: Benchmarking of conformal predictors.


a–c, Summary of application of the Mondrian CP framework to one of the targets in the benchmarking set (A2AR). a, Molecules were classified into four distinct sets based on their P values


and a selected significance threshold (ε): virtual actives (blue, 1 class), virtual inactives (red, 0 class), both (purple, 1 or 0 class) and null (gray, no class assignment). Relative


fractions of each set are represented by a pie chart. b, The A2AR test set molecules were divided into four prediction sets depending on the significance level. The optimal significance


(εopt) corresponds to the value at which the maximal number of compounds has been assigned to a single-label set (meaning, either virtual actives or inactives), referred to as maximal


efficiency. c, The error rate (overall, virtual actives and virtual inactives) obtained for predictions of the A2AR benchmarking set compounds with respect to the significance threshold


(calibration plot). There was a close agreement between the significance value and the prediction error rate. d, The optimal significance level improved if the classification models were


trained on larger datasets. e, At optimal efficiency, the sensitivity values improved with increasing size of the training set. f, At optimal efficiency, the precision values improved with


increasing training set size. In d, e and f, three independent calculations (training and prediction) were performed for the eight targets (A2AR, AmpC, 5’-NT, D2R, KEAP1, Mpro, OGG1, SORT1;


described in Supplementary Section 1) and error bars correspond to the standard error of the mean.


Source data

Full size image


As the performance of the models stabilized at a training size of 1 million molecules, this size was established as the standard for the training of new models. Conformal predictors composed


of CatBoost classifiers trained on Morgan2 fingerprints achieved the best average precision and had comparable or slightly better significance and sensitivity values compared with other


combinations. In addition, this configuration required the least computational resources, both in the training of the classifier, predictions for the test set and storage of molecular


descriptors (Supplementary Table 9). Hyperparameters (class imbalance and number of aggregated models), robustness (influence of noise and scrambling of the training data), target dependency


and the exchangeability criterion were also investigated. These results are provided in detail in Supplementary Figs. 6–12.

Optimized workflow for ultralarge chemical libraries


To optimize the performance of the workflow for ultralarge databases, we conducted further analyses of datasets containing docking scores for 235 million compounds from the ZINC15 library32,


focusing on two benchmarking proteins (A2AR and D2R). For each target, a conformal predictor composed of five independent CatBoost classifiers was trained on 1 million compounds (Morgan2


representation), followed by predictions for the remainder of the library. As the docking scores were available for all the compounds in the library, efficient strategies to identify the


top-scoring molecules could be established.


In the CP framework, the selected significance level determines the size of the predicted virtual active set, which is the library that will be docked to the target. The significance level


was first set to achieve the maximal efficiency (A2AR εopt = 0.12 and D2R εopt = 0.08), and close to all compounds received a single label (>98% for both targets). CP reduced the ultralarge


library from 234 million to 25 million and 19 million compounds for A2AR and D2R, respectively, with high sensitivity values (0.87 and 0.88, respectively; Fig. 3a). The workflow would hence


be able to identify close to 90% of the virtual actives by docking only ~10% of the ultralarge library and the CP framework guaranteed that the percentage of incorrectly classified compounds


did not exceed 12% and 8% respectively. For libraries of this size, molecular docking screens of the entire predicted virtual active set would be viable. However, further reduction of the


database would be required to apply our workflow to multi-billion-scale libraries. In these cases, docking calculations for even a small percentage of the library would be computationally


demanding. In theory, decreasing the significance level should lead to a reduction of the virtual active set and enrich predictions in which the conformal predictor has the highest


confidence. This approach was evaluated by gradually reducing the significance level and assessing how the distribution of docking scores in the virtual active set was influenced. As


anticipated, lowering the significance level did reduce the virtual active set size (Fig. 3a) and led to substantial shifts of the docking score distribution toward better energies for both


protein targets (Fig. 3b). At the lowest evaluated significance level (0.01), the database was reduced to 3.0 million and 2.6 million molecules for A2AR and D2R, respectively, and the


largest shifts in docking score distributions were obtained. For example, the most populated bin in the docking score distribution for the training set was −23.8 kcal mol−1 for D2R, which


was improved to −47.7 kcal mol−1 and −50.9 kcal mol−1 for significance levels of 0.08 (D2R εopt) and 0.01, respectively. At the strictest significance level (0.01), 80% and 64% of the 10,000


top-scoring A2AR and D2R molecules (corresponding to 0.004% of the chemical library) could still be identified. These results showed that the significance level can be tuned to achieve


substantial database reduction and retain most of the very top-scoring candidates for the subsequent docking step.

Fig. 3: Machine learning performance for ultralarge docking screening


data.


Five independent CatBoost classifiers were trained on 1 million Morgan2 fingerprints from docking screens of 235 million molecules against A2AR and D2R. a, The size of the predicted active


class decreases with more stringent significance values. b, Normalized frequency distributions of DOCK scores from the ultralarge docking screen. The score distribution of the training set


is shown in gray. In color (red to blue), score distributions of molecules machine learning (ML) predicted to be active at a given (increasingly stringent) significance threshold (ε). The


pie chart represents different significance values. c, Molecules in the test set were sorted based on the quality of information (P1 − P0). The percentage recall of the 10,000 best-scoring


molecules is shown as a function of the percentage evaluated compounds in the test set. d, Two-dimensional unsupervised UMAP projection illustrating the chemical relationship in


high-dimensional feature space. Molecules prioritized by the machine learning models are more similar to training set actives than random molecules from a large-scale docking (LSD) library.


e, Fraction of machine learning-prioritized molecules that have a Tanimoto coefficient higher than a specific threshold (0.2 to 0.9, color bar) with a molecule from the training set actives.


Molecules in which the predictor is most confident are more similar to actives from the training set. f, A2AR ligands from the ChEMBL database were classified into four distinct sets based


on their P values and a selected significance threshold (ε): virtual actives (blue, 1 class), virtual inactives (red, 0 class), both (purple, 1 or 0 class) and null (gray, no class


assignment). The percentage recall is represented by a pie chart. In a, c and e, three independent calculations (training and prediction) were performed, and error bars correspond to the


standard error of the mean.


Source data

Full size image


An alternative approach to reduce the size of the set to evaluate by molecular docking is to sort the compounds based on the difference between the P1 and P0 values (the quality of


information, P1 − P0). This metric can be used to prioritize subsets in which the predictor has the highest confidence (Fig. 3c) and correlated with docking ranks (Supplementary Fig. 13).


The enrichment of the top-scoring 10,000 molecules from the A2AR and D2R screens was assessed based on prioritizing the compounds using the quality of information. Remarkably, the workflow


identified more than 90% of the very top-scoring molecules after only 3% (A2AR) and 5% (D2R) of the 234 million remaining compounds had been evaluated (Fig. 3c). Notably, reproducible recall


values were obtained by independently generated conformal predictors, demonstrating that a random selection of training set will lead to similar selection of molecules. Data-dimensionality


reduction (Uniform Manifold Approximation and Projection, UMAP) of the Morgan2 fingerprints indicated that these prioritized molecules bear structural similarity to the actives present in


the training set (Fig. 3d). This observation was also supported by analysis of Tanimoto similarity. The molecules in which the predictor had higher confidence generally showed greater


structural similarity to actives from the training set (Fig. 3e). Using the quality of information to reduce the docked set of compounds hence had a similar effect to decreasing the


significance level, and these two techniques can be combined in prospective screens of multi-billion-scale libraries. To assess whether the use of conformal predictors leads to a reduction


in structural diversity among prioritized molecules compared with large-scale docking screens, we analyzed the 1% top-ranked molecules from both approaches for D2R. Although a smaller


fraction of the 1% top-ranked compounds prioritized by the conformal predictor had unique Bemis–Murcko scaffolds (13% compared with 23% from docking), a pairwise Tanimoto coefficient


analysis demonstrated that the decorated versions of these scaffolds were not significantly less diverse than those identified by molecular docking alone (Supplementary Fig. 14).


Collectively, these results demonstrated that top-scoring compounds could be identified using the conformal predictor. To assess its ability to find experimentally confirmed actives, known


A2AR and D2R ligands from the ChEMBL database33 were evaluated. Models trained only on docking data correctly classified 92% and 86% of these ligands as virtual actives. This highlights the


importance of benchmarking against known actives to validate workflows before conducting prospective virtual screens (Fig. 3f).

Prospective virtual screen of a multi-billion-scale library


A primary goal in the development of the workflow was that the machine learning step must be able to reduce a multi-billion-scale database to a few million promising compounds, which was


evaluated for A2AR and D2R. Docking of the training set, training of the conformal predictor and predictions for 3.5 billion compounds for one target could be performed in approximately


2,500 core-hours. The significance level was set to 0.005, resulting in 25 million and 24 million predicted virtual actives for A2AR and D2R, respectively. Of these, 5 million compounds per


target were prioritized for docking calculations based on the quality of information (corresponding to a 700-fold reduction of the library), which required 10,344 core-hours per target.


Compared with explicit docking of the 3.5 billion compounds, the workflow hence achieved a 568-fold reduction of compute cost. For both targets, the docking score distribution of the 5 


million prioritized compounds was substantially shifted toward more favorable energies. For example, the most populated bin in the docking score distribution of the training set was −25.1 


kcal mol−1 for D2R, which was shifted to −51.6 kcal mol−1 for the predicted virtual actives (Fig. 4a). A large fraction of the predicted compounds (49%) had a docking score better than the


energy threshold (−49.7 kcal mol−1) used for labeling of the training set, corresponding to a 49-fold enrichment of virtual actives. Similar docking energy distributions were obtained when


only 1 million predicted virtual actives were selected for molecular docking, demonstrating that users can control the extent of database reduction and even achieve up to a 3,500-fold


decrease in library size.

Fig. 4: Prospective virtual screen of a multi-billion-scale library against D2R.


The machine learning-accelerated workflow was used to predict ligands of the D2R in a database of 3.5 billion compounds. a, Normalized frequency distributions of D2R docking scores. The


docking score distribution of the training set is shown in gray. The docking score distribution of 5 million molecules selected based on the P value difference (P1 − P0) is shown in blue.


Vertical blue lines indicate the docking scores of the experimentally verified ligands, compounds 1 and 2. b, Chemical structures and predicted binding modes of experimentally verified


ligands (compounds 1 and 2, represented as gold sticks) of D2R (represented as a blue cartoon with side chains of key residues as sticks). Key hydrogen bonds (ionic interaction) are


indicated by orange dashed lines.


Source data

Full size image


To assess whether ligands could be discovered using the workflow, we selected 31 top-ranked compounds from the D2R screen of 3.5 billion compounds and evaluated these in a radioligand


binding assay at a concentration of 10 μM (Supplementary Table 10 and Supplementary Data). Of these, compounds 1 and 2 showed significant radioligand displacement and affinity values (Ki)


values were determined for these D2R ligands (Ki = 3.0 μM and Ki = 3.8 μM, respectively; Fig. 4b, Supplementary Table 11 and Supplementary Fig. 15). To further characterize compounds 1 and


2, we performed a functional assay quantifying D2R-mediated changes in intracellular cAMP. Compounds 1 and 2 were full agonists of the D2R with potency values (EC50) values of 10 μM and 14 


μM, respectively (maximal effect, Emax = 99% and Emax = 100%, relative to the maximal effect of dopamine; Supplementary Fig. 16). These results demonstrate that our protocol enables


identification of starting points for drug discovery by docking only a small subset of compounds from a multi-billion-scale library.

Machine learning-guided design of polypharmacology


One of the potential advantages of screening multi-billion-scale compound databases is that improved coverage of chemical space could enable discovery of ligands with complex properties,


which may be difficult to find in smaller libraries containing only a few million molecules. One potential application is to design compounds with activity at multiple targets that are


relevant for the same disease, which could lead to synergistic therapeutic effects. For example, the treatment of many central nervous system disorders requires the modulation of multiple


targets (polypharmacology)34. Parkinson’s disease is a neurological disorder that can be treated by modulating the activity of D2R and A2AR. However, the identification of dual-target


ligands of A2AR and D2R is challenging because of the lack of similarity between the binding sites of the receptors35. To assess whether a machine learning approach could facilitate the


search for dual-target ligands, A2AR and D2R were prepared for docking calculations. As the treatment of Parkinson’s disease necessitates activation of D2R through agonist binding, we


generated a model of the active receptor state using homology modeling based on an agonist-bound cryogenic electron microscopy (cryo-EM) structure of the D3 subtype (Protein Data Bank (PDB)


accession code: 7CMV), which is described in detail in ‘Homology modeling of active-state D2 dopamine receptor’ in Methods (Fig. 5a). To identify compounds blocking A2AR signaling, we


prepared an antagonist-bound crystal structure (PDB accession code: 8GNE) for molecular docking. In this structure, the salt bridge formed by the residues His264 and Glu169 has been


disrupted and could potentially accommodate the ammonium cation characteristic of dopaminergic ligands. After optimizing the receptor models and docking parameters, strong enrichment of


known ligands was obtained for both targets (Supplementary Fig. 17). A new training set of 1 million random molecules from a lead-like library (see ‘Docking library preparation’ in Methods)


containing more than 3 billion compounds was then docked to both receptors. As anticipated, the resulting docking score distributions illustrate that compounds that bind to both targets


would be difficult to identify in small libraries. The overlap between the top-ranked compounds (top 1%) was less than 0.02% (Fig. 5b). For both targets, conformal predictors were trained


and used to predict the remaining billions of compounds in the lead-like library. To enhance the likelihood of obtaining favorable docking scores across both targets, the predicted molecules


were ranked according to the sum of their quality of information (PA2AR,1 – PA2AR,0 + PD2R,1 – PD2R,0: in which Pφ,1 and Pφ,0 correspond to the confidences that a specific molecule belongs


to the active and inactive class, respectively, for target φ) and the 5 million top-ranked compounds were then prioritized for explicit docking. The docking score distributions of the


machine learnin


g-prioritized compounds were substantially shifted toward better energies for both targets (17- and 34-fold for A2AR and D2R, respectively; Fig. 5c), leading to an enrichment of dual virtual


actives. More than 3.8% of the 5 million docked molecules had energies better than both score thresholds (top 1%), corresponding to a 191-fold enrichment of dual virtual actives compared


with docking of a random library. The molecules were subsequently sorted according to the sum of their individual docking ranks (rankA2AR + rankD2R), and the top-ranked molecules were then


visually inspected for their complementarity with the respective binding sites. Encouragingly, the molecules formed hydrogen bonds with residues known to be important for ligand binding to


the A2AR (Asn2536.55) and D2R (Asp1143.32) (Ballesteros-Weinstein residue numbering scheme36 denoted as superscripts). A set of 45 compounds was prioritized for make-on-demand synthesis and


successfully obtained within 4–5 weeks. The compounds were first tested in an A2AR radioligand binding assay at a concentration of 20 μM (Supplementary Table 12 and Supplementary Data). Of


these, the binding affinities of four compounds (4–6) that showed significant radioligand displacement were determined, leading to Ki values ranging from 1.3 μM to 20 μM (Supplementary Table


13). Compounds 4–6 were subsequently tested at D2R at a concentration of 20 μM, which showed that compound 5 also binds to this target (Ki,D2 = 14 μM, Ki,A2A ≈ 20 μM; Fig. 5d, Supplementary


Table 14 and Supplementary Fig. 18). The dual-target compound 5 was predicted to form hydrogen bonds with orthosteric binding site residues Asn2536.55 and Asp1143.32 for A2AR and D2R,


respectively (Fig. 5e,f). These observations indicate that our virtual screening workflow can be applied to identify starting points for the development of drugs with multi-target profiles


tailored for treatment of complex diseases.

Fig. 5: Identification of a dual-target ligand by screening of multi-billion-scale libraries.


a, Homology models (gray lines) of the active D2R were constructed using a cryo-EM structure of the D3 dopamine receptor complexed with the Gi protein (PDB accession code: 7CMV) as a


template. The final model used for prospective screening is depicted in blue. b, Two-dimensional density-normalized scatterplot of docking scores of 1 million random molecules present in the


training set. Normalized frequency distribution of molecules docked against D2R (blue) and A2AR (red). Dashed lines represent the score threshold (top 1%) that divides the datasets into


virtual actives and inactives. c, Two-dimensional density-normalized scatterplot of docking scores of 5 million molecules prioritized by machine learning models. For both A2AR and D2R, 5


independent CatBoost classifiers were trained on 1 million random molecules from a docking screen. Normalized frequency distribution of molecules docked against D2R (blue) and A2AR (red).


The score distributions of the training set compounds are shown in gray. The docking scores of compound 5 are shown as a vertical blue line for D2R and a horizontal red line for A2AR. d, The


chemical structure of compound 5, a A2AR–D2R dual-target ligand, is shown. For both targets, the docking scores (in kcal mol−1), P values (1, confidence in being a virtual active; 0,


confidence in being a virtual inactive) and Ki values (µM, determined from 3 independent experiments) are shown. e, Predicted binding mode of compound 5 in the A2AR orthosteric site. f,


Predicted binding mode of compound 5 in the D2R orthosteric site.


Source data

Full size imageDiscussion


The rapid expansion of commercial chemical libraries has sparked the development of diverse structure-based virtual screening methods aiming to reduce the computational cost of exploring


chemical space. Several of these approaches incorporated machine learning techniques to efficiently evaluate libraries ranging from millions to billions of compounds13,14,15,37. Compared


with previously developed methods tackling ultralarge libraries, our approach is based on CP, a robust framework that enables control over the error rate of the predictions16. We extend the


application of conformal predictors to multi-billion-scale libraries by leveraging class-specific confidence levels to identify top-scoring molecules. Our approach achieves equivalent or


better recall values and database reductions as other workflows, without the need for resource-intensive active learning. Our comparison of substructure-based fingerprints with more recently


developed data-driven descriptors demonstrates that traditional fingerprints suffice for this type of application, in agreement with a recently published study37. Integrating these key


advantages led to a workflow capable of traversing vast chemical libraries but requiring only modest computational cost associated with the training and prediction steps. Notably, our


results demonstrate that the effectiveness of machine learning as a proxy for molecular docking is target dependent, underscoring the need for large and diverse benchmarking sets to achieve


generalizable methods. To catalyze further development of methods in this field, we share both our virtual screening workflow, which is compatible with any docking software, and extensive


benchmarking sets for eight diverse protein targets.


Other methods to explore large chemical libraries via molecular docking use a hierarchical approach38,39, which is fundamentally different from the machine learning techniques. For example,


the V-SYNTHES method is based on the same principles as fragment-based ligand discovery. First, a small set of fragment-sized compounds called synthons, which represent substructures of


larger compounds in the multi-billion-scale library, is docked to the binding site. In a second step, larger molecules that embed top-scoring synthons are docked to identify compounds with


improved docking scores. Currently, hierarchical approaches do not require the use of machine learning because the number of synthons and their corresponding elaborations are small enough to


allow for explicit molecular docking. Furthermore, as the synthon library is not exchangeable with the database of fully enumerated molecules, a conformal predictor trained on synthon


docking results is unlikely to accurately predict top-scoring compounds. As the libraries continue to grow, a comparison of the effectiveness of the hierarchical and machine learning


approaches could reveal how these complementary techniques could be combined to further accelerate virtual screens.


A unique aspect of our work is the combination of a powerful machine-learning method with experimental evaluation of predictions, which reveals the potential and limitations of this


approach. Our first screen showed that agonists of D2R, an important drug target for neuropsychiatric and neurodegenerative diseases, could be identified by the workflow. Interestingly, the


hit rate from the docking screen was comparable to previous screens using smaller libraries40. Although screens of large libraries has yielded exceptionally potent compounds, these results


suggest that further progress may be partially limited by the accuracy of molecular docking. As recent studies have shown, flaws in docking scoring functions may lead to an accumulation of


false positives among the top-scoring compounds in large libraries10,41. By reducing the number of compounds requiring explicit docking, our approach enables resources to be reallocated


toward accurate re-scoring of top-ranked compounds using more advanced methods. Despite these challenges, our second prospective screen for multi-target ligands also illustrates the


tremendous opportunities provided by access to larger libraries, which will soon reach the trillion scale. Encouragingly, prospective screens against A2AR and D2R identified a dual-target


ligand, which represents a promising starting point for the development of drugs for the treatment of Parkinson’s disease34. Hence, expanding the sampling across broader regions of chemical


space can enable the discovery of ligands with complex properties that may not be found in smaller libraries. A further extension of our virtual screening approach could involve the


multi-objective design of ligands with specific selectivity, physiochemical and pharmacokinetic properties by integrating conformal predictors trained for different tasks. Collectively, our


results demonstrate how docking screens guided by conformal predictors can accelerate the development of small-molecule therapeutics.

MethodsDocking library preparation


The Enamine REAL Database (November 2019 version, 12.3 billion compounds) was reduced to a rule-of-four chemical subspace by excluding compounds with a molecular weight over 400 Da and cLogP


over 4 using RDKit42. The rule-of-four subspace contained 3,541,746,925 compounds. A random sample containing 15 million compounds (0.4%) from this library was obtained after shuffling the


Simplified Molecular Input Line Entry System (SMILES) with Terashuf43. Molecules were prepared for docking using DOCK3.7 standard protocols44. CXCalc (ChemAxon’s Marvin package Marvin


18.10.0) was used to calculate predominant protomers at relevant pH levels (6.9, 7.4 and 7.9). Conformational ensembles were generated with OMEGA (OpenEye, version 2020.2) and were capped at


400 conformations and an inter-conformer root mean squared deviation (r.m.s.d.) diversity threshold of 0.25 Å. In-house preparation of these rule-of-four molecules approximately took 18 


CPU-seconds per compound. Preparation of 1 million molecules present in our training set for docking was completed in approximately 5,000 CPU-hours. In the prospective screens for


dual-target compounds, the Enamine REAL Database (November 2022 version, 33.5 billion compounds) was reduced to a subspace of 3,137,276,984 lead-like molecules (ZINC Database definition: 20 


≤ heavy atom count ≤ 25 and −5 ≤ cLogP ≤ 3.5) using a similar procedure to that described above. The conformational ensembles of lead-like molecules were capped at 200 conformations and an


inter-conformer r.m.s.d. diversity threshold of 0.5 Å. A random sample of 1 million rule-of-four WuXi GalaXi molecules (March 2024 version) was prepared using the same protocol as described


above. The WuXi GalaXi rule-of-four space contained 1,371,598,090 compounds.

Molecular descriptors


Canonical SMILES were used to generate three different molecular descriptors as input data for the machine learning classifiers. Morgan2 descriptors were generated using RDKit29,42.


Continuous data-driven descriptors were generated using the CDDD Python library31. The RoBERTa model generated descriptors directly from the SMILES. We used a pretrained RoBERTa model45 to


generate the internal encoded representation of each molecule using the simpletransformers Python library46.

Preparation of proteins for docking


Experimental structures of the selected protein targets were extracted from the PDB. Details regarding preparation of crystal structures for molecular docking are provided in Supplementary


Table 1. Unless stated otherwise, water molecules and other solutes were removed from the experimental structures. The N- and C-termini were capped with acetyl and methyl groups,


respectively, using PyMOL47. The atoms of the bound ligands were used to generate matching spheres in the binding site. DOCK3.7 uses a flexible ligand algorithm that superimposes rigid


segments of a molecule’s pre-calculated conformational ensemble on top of the matching spheres44. Histidine protonation states were assigned manually after visual inspection of the


hydrogen-bonding network. The remainder of the protein structure was protonated by REDUCE48 and assigned AMBER49 united atom charges. The dipole moments of key residues involved in


recognition of the bound ligands were increased to favor interactions with these (Supplementary Fig. 2). This technique is common practice for users of DOCK3.7 to improve docking performance


and has been used in previous virtual screens50. The atoms of the co-crystallized ligands were used to create two sets of sphere layers on the protein surface (referred to as thin spheres).


One set of thin spheres described the low dielectric region of the binding site. A second set of thin spheres was used to calibrate ligand desolvation penalties. Scoring grids were


pre-calculated using QNIFFT51 for Poisson–Boltzmann electrostatic potentials, SOLVMAP52 for ligand desolvation, and CHEMGRID53 for AMBER van der Waals potentials. For each protein target,


known ligands were retrieved from the ChEMBL database or previous studies, followed by generation of property-matched decoys according to the procedure described in ref. 54. Actives and


decoys control sets were docked to the protein structures to evaluate the influence of different docking grid parameters (the radii of electrostatic and desolvation thin spheres). Finally,


ligand enrichment and predicted binding poses were used to select the optimal grid parameters50.

Homology modeling of active-state D2 dopamine receptor


Alignment of the human D2 and D3 dopamine receptor sequences was performed using the GPCRdb (https://gpcrdb.org/)55. One hundred homology models of the activated D2R bound to agonist ligand


(PD128907) were constructed using MODELLER56 version 10.2 based on a cryo-EM structure of the D3 dopamine receptor complexed with the Gi protein57 (PDB accession code: 7CMV) as a template.


Residues 5 to 11 were restrained to form an alpha helix and the residue pairs Cys77–Cys152 and Cys235–Cys237 were set to form disulfide bridges. Molecular docking grids of homology models


were constructed using the same protocols as for the antagonist-bound D2 dopamine receptor crystal structure (PDB accession code 6CM4; Supplementary Table 1). The resulting docking grids


were then evaluated for their ability to reproduce the modeled binding mode of PD128907 in the corresponding homology models and enrich 25 known D2R agonists among a set of


in-house-generated decoys54. A final receptor model was selected based on agonist enrichment calculations and the presence of the conserved salt bridge interaction with Asp1143.32 in docking


screens.

Molecular docking calculations


The orientational sampling parameter was set to 5,000 matches for both the rule-of-four benchmarking set and molecules selected based on machine learning predictions. Molecules in the


ultralarge docking screens (235 million lead-like molecules from the ZINC15 database32) were docked using 1,000 matches. During the generation of the benchmarking dataset, for each docked


compound, 18,652 orientations were calculated on average, and for each orientation, an average of 1,654 conformations were sampled. The best-scoring pose of each ligand was optimized using a


simplex rigid-body minimizer.

Compound selection from docking screens


To bias the multi-billion virtual screen targeting the D2R toward identification of novel chemotypes, we selected compounds that were dissimilar to known dopamine receptor ligands (>11,000


ChEMBL compounds). The molecular diversity was increased by clustering the 100,000 top-scoring docked compounds (0.003% of the entire library) by topological similarity. The best-scoring


molecule from each cluster was visually inspected for its complementarity with the binding site and a set of 31 compounds were selected from the 1,000 top-ranking clusters (Supplementary


Table 10 and Supplementary Data). The make-on-demand compounds were available in in the Enamine REAL Database and were successfully synthesized in 4–5 weeks.

Machine learning-accelerated


virtual screening pipeline


Our workflow for combining machine learning and molecular docking (Fig. 1) consists of the following consecutive steps, which are described in detail in this section and Supplementary Fig.


1.

Step 1 Preparation and docking of the training set


A set of randomly selected molecules from an ultralarge chemical library is docked to the target protein structure (Fig. 1a–d). We recommend a training set of 1 million molecules in virtual


screens of multi-billion-scale libraries.

Step 2 Generation and labeling of the training set


A docking score threshold (Fig. 1e) is selected to label each compound in the training set as either virtual active (better score than the selected threshold) or inactive (equal or worse


score than the selected threshold). As our CP approach is based on aggregating predictions made by several classifiers, multiple independent training sets are generated. Our recommendation


is to label the top-scoring 1% of the training set as virtual active and generate 5 independent training sets.

Step 3 Molecule featurization and training of the classifier


Molecular descriptors of each molecule in the training set are generated as input for the classifier. Each of the training sets is used to train an independent classification model to


distinguish virtual actives from inactives (Fig. 1f,g).

Step 4 Conformal prediction for the ultralarge library


The trained classification models are used to evaluate compounds from the ultralarge chemical library (Fig. 1h). Mondrian conformal predictors provide class-specific confidence levels, which


allow compounds to be categorized into one of the following four sets based on a selected significance level (ε): virtual active, virtual inactive, both (virtual active or inactive) and


null (no class assignment). The significance level can be tuned to control the size of the virtual active set, which is predicted to contain compounds with a docking score better than the


selected threshold.

Step 5 Post-processing and compound selection


The database reduction level achieved by the workflow is target dependent, and additional post-processing steps can be applied to identify the most promising compounds. The compounds


assigned to the virtual active set are ranked by sorting them based on the quality of information (meaning, prioritizing the predictions in which the classification model has the highest


confidence) and a subset of these are docked to the target. Top-scoring molecules are clustered by chemical similarity and representative compounds are visually inspected (Fig. 1i,j),


followed by synthesis and experimental evaluation of selected compounds (Fig. 1k–l). We recommend docking a set of 1 million to 5 million molecules selected based on the quality of


information.

Training and evaluation of machine learning classifiers


CP is a QSAR method in which an ensemble of models is used to classify molecules present in an objective set (Supplementary Fig. 1). The docking scores of the datasets were used to label


molecules as virtual actives (top 1%) and virtual inactives (bottom 99%), unless stated otherwise. The scikit-learn 0.24.2 package58 was used to perform a stratified split of the datasets


into proper training sets (80% of training set), calibration sets (20% of training set) and test sets. The ratio between virtual actives and inactives was maintained in all sets. This


procedure was repeated using different random seeds to obtain independent sets. The CatBoost 0.26 Python package was used for building and training the corresponding classifiers. The PyTorch


1.7.1 package59 combined with the RangerLars optimizer60 was used for training the deep neural networks. The RoBERTa classifier was implemented from the simpletransformers 0.61.6 package46.


The Skorch 0.10.0 package61 was used to connect the scikit-learn and PyTorch frameworks. A detailed description and analyses of the hyperparameters used in each classifier are provided in


Supplementary Table 2 and Supplementary Figs. 3–5. The compounds in the test set were assigned normalized P values (confidence that the sample belongs to 1 class, P1; and 0 class, P0) by


each individual classification model and its corresponding calibration set. The resulting sets of P1 and P0 values were aggregated into a single pair of P values by taking the respective


medians of predictions made by individual models (Supplementary Fig. 1). On the basis of the aggregated P values and the selected significance level (ε), the Mondrian CP framework was used


to divide the compounds into virtual active, virtual inactive, both (meaning either virtual active or inactive) and null (no class assignment) sets. Several metrics were used to assess the


performance of the conformal predictors. The sensitivity was defined as:

$${{\mathrm{Sensitivity}}}=\frac{{{\mathrm{TP}}}}{{{\mathrm{AP}}}}$$ (1)


where TP (true positives) were true active molecules correctly classified by the CP framework (that is, in the predicted virtual active and both sets). AP (all positives) were all molecules


with a score better than the threshold used to define the virtual actives. The precision was defined as:

$${{\mathrm{Precision}}}=\frac{{{\mathrm{TP}}}}{{{\mathrm{TP}}}+{{\mathrm{FP}}}}$$


(2)


where FP (false positives) were true inactive molecules incorrectly classified by the CP framework (that is, in the predicted virtual active and null sets). The efficiency was defined


as:

$${{\mathrm{Efficiency}}}=\frac{\left\{1\right\}+\left\{0\right\}}{{{\mathrm{AP}}}+{{\mathrm{AN}}}}$$ (3)


where {1} are the predicted virtual actives and {0} the predicted virtual inactives. AN (all negatives) were all molecules with a score worse than or equal to the threshold used to define


the virtual actives. The overall error rate was defined as:

$${{\mathrm{Overall}}\,


{\mathrm{error}}\,{\mathrm{rate}}}=\frac{{{\mathrm{FP}}}+{{\mathrm{FN}}}}{{{\mathrm{AP}}}+{{\mathrm{AN}}}}$$ (4)


where FN (false negatives) are true virtual active molecules incorrectly classified by the CP framework (that is, in the predicted virtual inactives and null sets). The error rate for the


virtual actives was defined as:

$${{\mathrm{Actives}}\, {\mathrm{error}}\, {\mathrm{rate}}}=\frac{{{\mathrm{FN}}}}{{{\mathrm{AP}}}}$$ (5)


The error rate for the virtual inactives was defined as:

$${{\mathrm{Inactives}}\, {\mathrm{error}}\, {\mathrm{rate}}}=\frac{{{\mathrm{FP}}}}{{{\mathrm{AN}}}}$$ (6) Computational costs


and hardware specifications


To train a conformal predictor on 1 million Morgan2 fingerprints, approximately 4 GB of random access memory was required. Training and predicting were performed using 2x Intel Xeon Gold


6130 CPUs @ 2.10 GHz. The RoBERTa models were trained on 12 cores of a single Nvidia Tesla T4 graphics processing unit with 1,844 GiB memory. The times (in seconds) required to train a


conformal predictor on 1 million molecules with different architectures and descriptors or predict 1 million molecules are reported in Supplementary Table 9.

Binding assays


Screening compounds (Supplementary Tables 10 and 12) were purchased from Enamine (compound purity >90%, which was confirmed by liquid chromatography−mass spectrometry and 1H NMR spectroscopy


for identified ligands; Supplementary Figs. 19–30). Human D2R competition binding experiments were carried out in polypropylene 96-well plates. Each well contained 20 µg of membranes from a


CHO-D2R #S20 cell line (protein concentration of 4,322 µg ml−1), 1.5 nM [3H]-spiperone (54.3 Ci mmol−1, 1 mCi ml−1, PerkinElmer NET1187250UC) and the compound studied. Non-specific binding


was determined in the presence of 10 µM sulpiride (Sigma Aldrich S8010). The reaction mixture (250 µl per well) was incubated at 25 °C for 120 min, after which 200 µl was transferred to a


GF/C 96-well plate (Millipore) pretreated with 0.5% PEI and treated with binding buffer (50 mM Tris-HCl, 1 mM EDTA, 5 mM MgCl2, 5 mM KCl, 120 mM NaCl, pH 7.4). The reaction mixture was


filtered and washed 4 times with 250 µl wash buffer (50 mM Tris-HCl, 0.9% NaCl, pH 7.4) before the addition of 30 μl Universol. The final measurement was performed in a microplate beta


scintillation counter (Microbeta Trilux, PerkinElmer). Human A2AR competition binding experiments were carried out in a multiscreen GF/C 96-well plate (Millipore) pretreated with binding


buffer (Tris-HCl 50 mM, EDTA 1 mM, MgCl2 10 mM, 2 U ml−1 adenosine deaminase, pH 7.4). Each well was incubated with 5 μg of membranes from the HeLa-A2A cell line and prepared in our


laboratory (lot A003/14-04-2019, protein concentration 2,058 μg ml−1), 1 nM [3H]-ZM241385 (50 Ci mmol−1, 1 mCi ml−1, ARC-ITISA 0884), and the compounds studied and standards. Non-specific


binding was determined in the presence of NECA 50 μM (Sigma E2387). The reaction mixture (total volume of 200 μl per well) was incubated at 25 °C for 30 min, then filtered and washed 4 times


with 250 μl wash buffer (Tris-HCl 50 mM, EDTA 1 mM, MgCl2 10 mM, pH 7.4), and measured in a microplate beta scintillation counter (Microbeta Trilux, PerkinElmer). Unless stated otherwise,


three independent experiments were carried out to calculate Ki values.

Functional assays


Human D2R activity was measured by determining the amount of cAMP produced. Human D2R functional experiments were carried out in a CHO-D2 #S20 cell line. Five-thousand cells were seeded in


30 μl of Optimem (Invitrogen 11058) + 500 μM IBMX (Sigma 17018) on a 96-well black-and-white isoplate (PerkinElmer 6005030). Test compounds and dopamine were added in their corresponding


wells and incubated for 10 min at 37 °C with gentle stirring (150 r.p.m.). Then, 10 µM forskolin (Sigma 17018) was added and incubated for 5 min at 37 °C with gentle stirring (150 r.p.m.).


Reagents from the kit (#CISBIO 62AM4PEC) were added and incubated for 1 h at room temperature with gentle stirring (90 r.p.m.) and protected from light. HTRF (excitation wavelength: 320 nm;


emission wavelengths: 620–665 nm) from each well was measured using a Tecan Infinite M1000 Pro. Two independent experiments were carried out to calculate EC50 and Emax values.

Statistics


and reproducibility


Compounds for which no molecular descriptors could be generated were excluded from the analyses. No statistical method was used to predetermine the sample size. Training and test sets were


generated by taking random samples with the determined size of the virtual libraries. Proper training and calibration sets were constructed by performing a stratified split on the parent


training sets, maintaining the ratio of samples belong to the minority and majority classes. Unless stated otherwise, three independent calculations (training and predictions) were performed


to derive statistics on model performance. Unless stated otherwise, three independent experiments were performed to derive statistics on the activities of the designed


compounds.

Reporting summary


Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability


The ZINC15 and Enamine REAL databases are available at https://zinc15.docking.org and https://enamine.net/compound-collections/real-compounds/real-database, respectively. The PDB accession


codes for the molecular docking calculations are 4EIY, 6DPT, 6XUE, 6CM4, 5FNU, 6W63, 6G3Y, 6X48, 8GNE and 7CMV. Associated source data are provided for Figs. 2–5. All compounds tested are


listed in the Supplementary Tables 10 and 12, and Supplementary Data. Chemical identities, purities (liquid chromatography−mass spectrometry), yields and spectroscopic analysis (1H NMR) for


active compounds are provided in the Supplementary figures. Large-scale docking datasets are deposited on Zenodo at https://doi.org/10.5281/zenodo.7953917 (ref. 62).

Code availability


The conformal predictor source code is freely available and can be found on the GitHub at https://github.com/carlssonlab/conformalpredictor. The original code is deposited on Zenodo at


https://doi.org/10.5281/zenodo.14709041 (ref. 63).


References Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: a molecular modeling perspective. Med. Res. Rev. 16, 3–50 (1996).


Article  Google Scholar 


Irwin, J. J. et al. ZINC20—a free ultralarge-scale chemical database for ligand discovery. J. Chem. Inf. Model. 60, 6065–6073 (2020).


Article  Google Scholar 


Grygorenko, O. O. et al. Generating multibillion chemical space of readily accessible screening compounds. iScience 23, 101681 (2020).


Article  Google Scholar 


Bellmann, L., Klein, R. & Rarey, M. Calculating and optimizing physicochemical property distributions of large combinatorial fragment spaces. J. Chem. Inf. Model. 62, 2800–2810 (2022).


Article  Google Scholar 


Lyu, J. et al. Ultra-large library docking for discovering new chemotypes. Nature 566, 224–229 (2019).


Article  Google Scholar 


Luttens, A. et al. Ultralarge virtual screening identifies SARS-CoV-2 main protease inhibitors with broad-spectrum activity against coronaviruses. J. Am. Chem. Soc. 144, 2905–2920 (2022).


Article  Google Scholar 


Carlsson, J. & Luttens, A. Structure-based virtual screening of vast chemical space as a starting point for drug discovery. Curr. Opin. Struct. Biol. 87, 102829 (2024).


Article  Google Scholar 


Gorgulla, C. et al. An open-source drug discovery platform enables ultra-large virtual screens. Nature 580, 663–668 (2020).


Article  Google Scholar 


Fink, E. A. et al. Large library docking for novel SARS-CoV-2 main protease non-covalent and covalent inhibitors. Protein Sci. 32, e4712 (2023).


Article  Google Scholar 


Lyu, J., Irwin, J. J. & Shoichet, B. K. Modeling the expansion of virtual screening libraries. Nat. Chem. Biol. 19, 712–718 (2023).


Article  Google Scholar 


Muratov, E. N. et al. QSAR without borders. Chem. Soc. Rev. 49, 3525–3564 (2020).


Article  Google Scholar 


Zhang, J., Norinder, U. & Svensson, F. Deep learning-based conformal prediction of toxicity. J. Chem. Inf. Model. 61, 2648–2657 (2021). 2021.


Article  Google Scholar 


Yang, Y. et al. Efficient exploration of chemical space with docking and deep learning. J. Chem. Theory Comput. 17, 7106–7119 (2021).


Article  Google Scholar 


Gentile, F. et al. Deep docking: a deep learning platform for augmentation of structure based drug discovery. ACS Cent. Sci. 6, 939–949 (2020).


Article  Google Scholar 


Graff, D. E. et al. Self-focusing virtual screening with active design space pruning. J. Chem. Inf. Model. 62, 3854–3962 (2022).


Article  Google Scholar 


Vovk, V., Gammerman, A. & Shafer, G. Algorithmic Learning in a Random World (Springer Nature, 2005); https://doi.org/10.1007/b106715


Norinder, U., Carlsson, L., Boyer, S. & Eklund, M. Introducing conformal prediction in predictive modeling. A transparent and flexible alternative to applicability domain determination. J.


Chem. Inf. Model. 54, 1596–1603 (2014).


Article  Google Scholar 


Norinder, U. & Boyer, S. Binary classification of imbalanced datasets using conformal prediction. J. Mol. Graph. Model. 72, 256–265 (2017).


Article  Google Scholar 


Norinder, U., Spjuth, O. & Svensson, F. Synergy conformal prediction applied to large-scale bioactivity datasets and in federated learning. J. Cheminform. 13, 77 (2021).


Article  Google Scholar 


Svensson, F., Norinder, U. & Bender, A. Improving screening efficiency through iterative screening using docking and conformal prediction. J. Chem. Inf. Model. 57, 439–444 (2017).


Article  Google Scholar 


Ahmed, L. et al. Efficient iterative virtual screening with Apache Spark and conformal prediction. J. Cheminform. 10, 8 (2018).


Article  Google Scholar 


Lavecchia, A. Machine-learning approaches in drug discovery: methods and applications. Drug Discov. Today 20, 318–331 (2015).


Article  Google Scholar 


Sadybekov, A. V. & Katritch, V. Computational approaches streamlining drug discovery. Nature 616, 673–685 (2023).


Article  Google Scholar 


Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18, 463–477 (2019).


Article  Google Scholar 


Hauser, A. S., Attwood, M. M., Rask-Andersen, M., Schiöth, H. B. & Gloriam, D. E. Trends in GPCR drug discovery: new agents, targets and indications. Nat. Rev. Drug Discov. 16, 829–842


(2017).


Article  Google Scholar 


Prokhorenkova, L. et al. CatBoost: unbiased boosting with categorical features. Preprint at https://arxiv.org/abs/1706.09516 (2013).


Chen, H. et al. The rise of deep learning in drug discovery. Drug Discov. Today 23, 1241–1250 (2018).


Article  Google Scholar 


Liu, Y. et al. RoBERTa: a robustly optimized BERT pretraining approach. Preprint at https://arxiv.org/abs/1907.11692 (2019).


Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).


Article  Google Scholar 


Riniker, S. & Landrum, G. A. Open-source platform to benchmark fingerprints for ligand-based virtual screening. J. Cheminform. 5, 26 (2013).


Article  Google Scholar 


Winter, R. et al. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem. Sci. 10, 1692–1701 (2019).


Article  Google Scholar 


Sterling, T. & Irwin, J. J. ZINC 15—ligand discovery for everyone. J. Chem. Inf. Model. 55, 2324–2337 (2015).


Article  Google Scholar 


Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, 1100–1107 (2012).


Article  Google Scholar 


Roth, B. L., Sheffler D. J. & Kroeze W. K. Magic shotguns versus magic bullets: selectively non-selective drugs for mood disorders and schizophrenia. Nat. Rev. Drug Discov. 3, 353–359


(2004).


Article  Google Scholar 


Kampen, S. et al. Structure-guided design of G-protein-coupled receptor polypharmacology. Angew. Chem. Int. Ed. 60, 18022–18030 (2021).


Article  Google Scholar 


Ballesteros, J. A. & Weinstein, H. Integrated methods for the construction of three-dimensional models and computational probing of structure-function relations in G protein-coupled


receptors. Neurosci. Methods 25, 366–428 (1995).


Article  Google Scholar 


Marin, E. et al. Regression-based active learning for accessible acceleration of ultra-large library docking. J. Chem. Inf. Model. 64, 2612–2623 (2024).


Article  Google Scholar 


Sadybekov, A. A. et al. Synthon-based ligand discovery in virtual libraries of over 11 billion compounds. Nature 601, 452–459 (2022).


Article  Google Scholar 


Beroza, P. et al. Chemical space docking enables large-scale structure-based virtual screening to discover ROCK1 kinase inhibitors. Nat. Commun. 13, 6447 (2022).


Article  Google Scholar 


Ballante, F., Kooistra, A. J., Kampen, S., de Graaf, C. & Carlsson, J. Structure-based virtual screening for ligands of G protein-coupled receptors: what can molecular docking do for you?


Pharmacol. Rev. 73, 527–565 (2021).


Article  Google Scholar 


Wu, Y. et al. Identifying artifacts from large library docking. J. Med. Chem. 67, 16796–16806 (2024).


Article  Google Scholar 


RDKit Open-Source Cheminformatics Software (2025).


Salle, A. Terashuf: A fast external memory shuffling tool. GitHub https://github.com/alexandres/terashuf (2025).


Coleman, R. G. et al. Ligand pose and orientational sampling in molecular docking. PLoS ONE 8, e7599 (2019).


Google Scholar 


Chithrananda, S., Grand, G. & Ramsundar, B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. Preprint at https://arxiv.org/abs/2010.09885 (2020).


Rajapakse, T. SimpleTransformers. GitHub https://github.com/ThilinaRajapakse/simpletransformers (2025).


Schrödinger, L. & DeLano, W. PyMOL (2020); https://www.pymol.org/pymol


Word, J. M., Lovell, S. C., Richardson, J. S. & Richardson, D. C. Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation. J. Mol. Biol. 285,


1735–1747 (1999).


Article  Google Scholar 


Weiner, S. J. et al. A new force field for molecular mechanical simulation of nucleic acids and proteins. J. Am. Chem. Soc. 106, 765–784 (1984).


Article  Google Scholar 


Bender, B. J. et al. A practical guide to large-scale docking. Nat. Protoc. 16, 4799–4832 (2021).


Article  Google Scholar 


Gallagher, K. & Sharp, K. Electrostatic contributions to heat capacity changes of DNA–ligand binding. Biophys. J. 75, 769–776 (1998).


Article  Google Scholar 


Mysinger, M. M. & Shoichet, B. K. Rapid context-dependent ligand desolvation in molecular docking. J. Chem. Inf. Model. 50, 1561–1573 (2010).


Article  Google Scholar 


Meng, E. C., Shoichet, B. K. & Kuntz, I. D. Automated docking with grid-based energy evaluation. J. Comput. Chem. 13, 505–524 (1992).


Article  Google Scholar 


Mysinger, M. M., Carchia, M., Irwin, J. J. & Shoichet, B. K. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. J. Med. Chem. 55, 6582–6594


(2012).


Article  Google Scholar 


Pándy-Szekeres, G. et al. GPCRdb in 2023: state-specific structure models using AlphaFold2 and new ligand resources. Nucleic Acids Res. 51, 395–402 (2023).


Article  Google Scholar 


Webb, B. & Sali, A. Comparative protein structure modeling using MODELLER. Curr. Protoc. Protein Sci. 86, 2.9.1–2.9.37 (2016).


Article  Google Scholar 


Xu, P. et al. Structures of the human dopamine D3 receptor–Gi complexes. Mol. Cell 81, 1147–1159.e4 (2021).


Article  Google Scholar 


Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).


MathSciNet  Google Scholar 


Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, (2019).


Grankin, M. RangerLars Optimizer. GitHub https://www.github.com/mgrankin/over9000 (2025).


Tietz, M., Fan, T. J., Nouri, D., & Bossan, B. skorch: A scikit-learn compatible neural network library that wraps PyTorch. Skorch Developers https://skorch.readthedocs.io (2017).


Luttens, A., Cabeza de Vaca, I., Sparring, L., Norinder, U. & Carlsson, J. Large-scale docking datasets for machine learning. Zenodo https://doi.org/10.5281/zenodo.7953917 (2023).


Luttens, A. Conformal predictor. Zenodo https://doi.org/10.5281/zenodo.14709041 (2025).


Download references

Acknowledgements


A.L. was supported by a postdoctoral scholarship from the Knut and Alice Wallenberg Foundation (KAW2022.0347). J.C. received funding from the European Research Council (ERC) under the


European Union’s Horizon 2020 research and innovation program (grant agreement 715052), the Swedish Cancer Society, the Swedish Research Council and the Olle Engkvist Foundation. This


research was partially supported by the project AI4Research at Uppsala University. I.C.d.V. was funded by a postdoctoral fellowship provided by the Sven och Lilly Lawski foundation. The


computations were enabled using resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) (partially funded by the Swedish Research Council through


grant agreement number 2022-06725) and the supercomputing resource Berzelius provided by the National Supercomputer Centre at Linköping University and the Knut and Alice Wallenberg


Foundation. J.B., A.L.M. and M.I.L. were funded by Agencia Estatal de Investigación (PID2020-119428RB-I00), Xunta de Galicia (ED431C 2022/20) and European Regional Development Fund (ERDF).


A.L., I.C.d.V. and J.C. thank OpenEye Scientific Software for the use of OEToolkits at no cost. We thank J. Zhang for providing the initial deep neural network code.

Funding


Open access funding provided by Uppsala University.


Author informationAuthors and Affiliations Science for Life Laboratory, Department of Cell and Molecular Biology, Uppsala University, BMC, Uppsala, Sweden


Andreas Luttens, Israel Cabeza de Vaca, Leonard Sparring, Nour Aldin Kahlous & Jens Carlsson


Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA


Andreas Luttens


Institute for Medical Engineering and Science and Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA


Andreas Luttens


Innopharma Drug Screening and Pharmacogenomics Platform, BioFarma research group, Center for Research in Molecular Medicine and Chronic Diseases (CiMUS), Department of Pharmacology, Pharmacy


and Pharmaceutical Technology, University of Santiago de Compostela, Santiago de Compostela, Spain


José Brea, Antón Leandro Martínez & María Isabel Loza


Health Research Institute of Santiago de Compostela, Santiago de Compostela, Spain


José Brea, Antón Leandro Martínez & María Isabel Loza


Enamine Ltd, Kyiv, Ukraine


Dmytro S. Radchenko & Yurii S. Moroz


Taras Shevchenko National University of Kyiv, Kyiv, Ukraine


Yurii S. Moroz


Chemspace LLC, Kyiv, Ukraine


Yurii S. Moroz


Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden


Ulf Norinder


AuthorsAndreas LuttensView author publications You can also search for this author inPubMed Google Scholar


Israel Cabeza de VacaView author publications You can also search for this author inPubMed Google Scholar


Leonard SparringView author publications You can also search for this author inPubMed Google Scholar


José BreaView author publications You can also search for this author inPubMed Google Scholar


Antón Leandro MartínezView author publications You can also search for this author inPubMed Google Scholar


Nour Aldin KahlousView author publications You can also search for this author inPubMed Google Scholar


Dmytro S. RadchenkoView author publications You can also search for this author inPubMed Google Scholar


Yurii S. MorozView author publications You can also search for this author inPubMed Google Scholar


María Isabel LozaView author publications You can also search for this author inPubMed Google Scholar


Ulf NorinderView author publications You can also search for this author inPubMed Google Scholar


Jens CarlssonView author publications You can also search for this author inPubMed Google Scholar

Contributions


A.L., U.N. and J.C. designed the study. A.L. performed the molecular docking, homology modeling and machine learning calculations and selected compounds under the supervision of J.C. and


U.N. A.L., I.C.d.V., L.S. and U.N. developed the protocol. A.L. and I.C.d.V. wrote the final version of the code. N.A.K. performed sequence alignments for homology modeling, constructed the


D2R homology models and provided the D2R agonist set. The binding and functional assays were performed by the USEF screening platform under the supervision of J.B., A.L.M. and M.I.L. D.S.R.


and Y.S.M. provided the Enamine REAL Database and analytical data for the synthesized compounds. U.N. provided support with critical evaluation of the machine learning protocol. A.L.,


I.C.d.V. and J.C. wrote the paper with contributions from the other authors.


Corresponding authors Correspondence to Andreas Luttens, María Isabel Loza, Ulf Norinder or Jens Carlsson.

Ethics declarations Competing interests


J.C. is a founder of DareMe Drug Discovery Consulting. D.S.R. and Y.S.M. are employed by Enamine Ltd. Y.S.M. serves as a scientific advisor to Chemspace LLC. The other authors declare no


competing interests.

Peer review Peer review information


Nature Computational Science thanks Mayukh Chakrabarti, Matthew O’Meara and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports


are available. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.

Additional information


Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary informationSupplementary Information


Supplementary Sections 1 and 2, Tables 1–14, Figs. 1–30, methods and References.

Reporting SummaryPeer Review FileSupplementary Data


Experimentally validated molecules. SMILES, identifiers and experimental activities.

Source dataSource Data Fig. 2


Source data for Fig. 2a–f.

Source Data Fig. 3


Source data for Fig. 3a–c, e and f.

Source Data Fig. 4


Source data for Fig. 4a.

Source Data Fig. 5


Source data for Fig. 5b and c.

Rights and permissions


Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or


format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or


other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in


the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the


copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.


Reprints and permissions


About this articleCite this article Luttens, A., Cabeza de Vaca, I., Sparring, L. et al. Rapid traversal of vast chemical space using machine learning-guided docking screens. Nat Comput Sci


5, 301–312 (2025). https://doi.org/10.1038/s43588-025-00777-x


Download citation


Received: 05 June 2024


Accepted: 04 February 2025


Published: 13 March 2025


Issue Date: April 2025


DOI: https://doi.org/10.1038/s43588-025-00777-x


Share this article Anyone you share the following link with will be able to read this content:


Get shareable link Sorry, a shareable link is not currently available for this article.


Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative