Lemedisco is a computational method for large-scale prediction & molecular interpretation of disease comorbidity

ABSTRACT To understand the origin of disease comorbidity and to identify the essential proteins and pathways underlying comorbid diseases, we developed LEMEDISCO (Large-ScalE Molecular

IntErpretation of DISease COmorbidity), an algorithm that predicts disease comorbidities from shared mode of action proteins predicted by the artificial intelligence-based MEDICASCY

algorithm. LEMEDISCO was applied to predict the occurrence of comorbid diseases for 3608 distinct diseases. Benchmarking shows that LEMEDISCO has much better comorbidity recall than the two

molecular methods XD-score (44.5% vs. 6.4%) and the SAB score (68.6% vs. 8.0%). Its performance is somewhat comparable to the phenotype method-based Symptom Similarity Score, 63.7% vs. 100%,

but LEMEDISCO works for far more cases and its large comorbidity recall is attributed to shared proteins that can help provide an understanding of the molecular mechanism(s) underlying

disease comorbidity. The LEMEDISCO web server is available for academic users at: http://sites.gatech.edu/cssb/LeMeDISCO. SIMILAR CONTENT BEING VIEWED BY OTHERS NETWORK MEDICINE FOR DISEASE

MODULE IDENTIFICATION AND DRUG REPURPOSING WITH THE NEDREX PLATFORM Article Open access 25 November 2021 A MODE OF ACTION PROTEIN BASED APPROACH THAT CHARACTERIZES THE RELATIONSHIPS AMONG

MOST MAJOR DISEASES Article Open access 20 March 2025 PHEVIR: AN ARTIFICIAL INTELLIGENCE ALGORITHM THAT PREDICTS THE MOLECULAR ROLE OF PATHOGENS IN COMPLEX HUMAN DISEASES Article Open access

03 December 2022 INTRODUCTION Of the total of 3,634,743 disease pairs involving 13,034 distinct diseases in clinical data from 13,039,018 individuals1, 78.8% involving all 13,034 diseases

have a larger than random probability that they co-occur in one individual. Disease comorbidity, the co-occurrence of distinct diseases in one individual, is an interesting medical

phenomenon, and it is important to understand their molecular origins. For example, rheumatoid arthritis, autoimmune thyroiditis, and insulin-dependent diabetes mellitus co-occur, but

rheumatoid arthritis and multiple sclerosis do not2. Previously, there have been several efforts to investigate the molecular features responsible for human disease

comorbidities3,4,5,6,7,8,9. Some studies focused on particular subsets of diseases4 or ethnic groups, while others investigated the entire human disease network5,6,7,8. For example, ref. 6

applied text mining to search the literature for disease-symptom associations. They then predicted the entire human disease–disease network based on a calculated symptom similarity score.

While this approach covers many human diseases, it relies on prior knowledge of symptomatic information; this limits its disease coverage and only explains one phenotype (disease) by another

phenotype (symptom). ref. 7 utilized known disease–gene associations from GWAS10 and OMIM combined with a protein–protein interaction network to identify connected disease–gene clusters or

modules. Another study also utilized known disease–gene associations and protein–protein interaction networks to characterize disease–disease relationships without requiring gene clusters8;

thus, its disease coverage is better than in ref. 7. These studies that used known disease–gene associations are limited by data availability. Indeed, only a small fraction of diseases have

known associated genes. For example, ref. 8 only covers 1022 of the 8043 diseases in the Disease Ontology database11, with just 6594 pairs of diseases having a non-zero number of shared

genes. Similarly, ref. 7 found that about 59% of 44,551 disease pairs do not share genes and their relationship cannot be resolved based on the shared gene hypothesis. The effect of possibly

missed proteins arising from both direct and indirect protein–protein interactions with known interacting proteins are accounted for by the network propagation method in the XD-score8 or by

the disease module and network distance of the SAB score7. However, those scores only marginally improve the recall rate of disease pairs that are clinically comorbid compared to that of

shared genes in their methods (<10% recall rate by both the XD-score and SAB score). To address these limitations of existing studies, we developed LEMEDISCO, which extends our recently

developed MEDICASCY machine learning approach12 for predicting disease indications and mode of action (MOA) proteins (as well as small molecule drug side effects and efficacy) to predict

disease comorbidities and the proteins and pathways responsible for their comorbidity. LEMEDISCO covers 6.5 million pairs of diseases compared to 97,666 pairs by the XD-score8, 44,551 pairs

by the SAB score7, and 133,107 pairs by the Symptom Similarity Score6. Assuming that the most enriched comorbid proteins are responsible for disease comorbidity, we determine the most

frequent comorbidity enriched MOA proteins. These proteins are then employed in pathway analysis13. As examples, we predict the comorbid diseases, comorbidity enriched MOA proteins, and

pathways associated with coronary artery disease (CAD) and ovarian cancer (OC). We note that recently machine learning (ML) methods have been successfully employed in numerous areas of

biology12,14,15,16. However, due to MLs “black box” nature, it is not easy to trace back the biological meaning of the predictions and the molecular origin(s) of disease comorbidity. Thus,

as in previous works6,7,8, we adopt an explicit score that provides a set of common proteins responsible for comorbidity predictions. RESULTS BENCHMARKING RESULTS OF LEMEDISCO To assess its

relative performance, we compared the results of LEMEDISCO to three other methods, the XD-score8, the SAB score7, and the Symptom Similarity Score6. The XD-score was calculated as described

in ref. 8: Using known disease–gene associations to create a vector representation of the disease by setting 1 for all associated genes and 0 for all others; then the vector was iteratively

updated based on the Random Walk with Restart (RWR) algorithm, with a restart probability of _p_ = 0.9 by using the STRING network database. Finally, the XD-score quantifying the relation of

two diseases is defined using the updated vectors of two diseases. NG is the number of shared genes between disease pairs8. The _S__AB_ score, a protein–protein network-based separation of

a disease pair calculated from known disease–gene associations is defined as SAB = <dAB>− (<dAA> +<dBB>)/2, where SAB compares the shortest distances between proteins

within each disease A & B7, <dAA> and <dBB> , to the shortest distances <dAB > between A-B protein pairs7. The Symptom Similarity Score was obtained by large scale text

mining of the literature for disease-symptom relations represented as a vector, with the similarity score defined as the cosine similarity of the respective vectors6. In this work, a

J-score for disease similarity is defined as the Jaccard index17 of two diseases (see Method for details). The disease–disease relations of benchmarking data from Medicare insurance

databases are quantified by their relative risk (RR) and φ-score (see Methods section for details)1. The relative risk RR is defined in Eq. 4a and is the probability that two diseases occur

in a single individual relative to random. The φ-score is the Pearson’s correlation for binary variables and is defined in Eq. 4b. Diseases in this work are represented by DOID numbers from

the Human Disease Ontology database11, and they are in clinical data usually denoted by ICD-9 or ICD-10 classifications18 or their Medical subject headings (MeSH) names19. Table 1 summarizes

the results. We define a true positive comorbidity pair when their clinic log(RR) >0, a predicted positive when XD-score >0, _S__AB_ score <0, or the Symptom Similarity Score

>0.1 and _q_ value <0.05 for our J-score. Recall is defined as (the number of true positives having score > cutoff or < cutoff for _S__AB_ score)/(total number of true

positives). We emphasize that in calculating recall, the cutoffs are suggested by the respective work as being either biologically meaningful7,8 or statistically significant6. In addition to

Pearson’s correlation coefficient (c.c.), recall and precision, the cutoff independent measures area under the receiver operating characteristic (AUROC) and the area under the

precision-recall curve (AUPRC) are also compared. Mapping the DOIDs to the ICD-9 ID classifications of ref. 1, excluding easy pairs when in the MEDICASCY library two diseases have

$\frac{{shared\; \#\; of\; efficious\; drugs}}{\sqrt{{\#\; disease}1{efficious\; drugs}\times {\#\; disease}2{efficious\; drugs}}} \; > \; 0.9$, we obtain 191,966 disease pairs for use

in LEMEDISCO benchmarking. All Pearson’s correlations of the J-score with the log(RR) score (c.c. = 0.116, _p_ value = 0) and φ-score (c.c. = 0.090, _p_ value = 0) are statistically

significant (_p_ value <0.05). The recall rate of J-score for this large set is 37.1%, and the AUROC of 0.528 is well better than random of 0.5. The Permute drug–protein test has an

average ± standard deviation from 100 runs 1958.6 ± 144.9 (54.3%) for diseases with non-zero MOA proteins. We note there are still significant correlations, though the absolute c.c. drops

from 0.116 to 0.050 (_p_ value = 0.0) for log(RR) and from 0.090 to 0.060 (_p_ value = 0.0) for the φ-score, and the recall drops from 37.1% of true relationships to 8.8% due to that the

number of diseases having correctly assigned MOA proteins drops to around half. All other measures are also worse. The _p_ values of the difference between LEMEDISCO and this permutation

test are significant for all measures (<0.05). On average, the Permute drug–disease test only has 55.09 (1.5%) diseases with non-zero MOAs. Its average recall of 0.0137% is much worse

than that of the Permute drug–protein test because it loses the correct connections between diseases and proteins. Correlations with both log(RR) (c.c. = 0.0026, _p_ value = 0.24) and the

φ-score (c.c. = 0.0029, _p_ value = 0.19) are insignificant. Except for precision, the _p_ values of the difference between LEMEDISCO and this permutation test are all very significant (well

below 0.05). The 54.3% average precision is due to its very few predictions (average only ~26). With these few predictions, a random selection of 26 pairs from the 191,966 (75.6% are true

positives defined as log(RR) >0) will have a probability of $\mathop{\sum }\nolimits_{k=14}^{26}{C}_{26}^{k}$ × 0.756k × 0.24426-k = 0.996 of having greater than 54.3% precision. This

means the precision is not better than random selection. To understand the significant difference between the Permute drug–protein and the Permute drug–disease tests, we note that MEDICASCY

predicts drug–disease pairs based on two components: One uses the drug’s chemical structure to learn the indications of a drug from those drugs with similar structure. This component is

insensitive to whether the drugs’ protein targets change. The other depends on the drug’s protein targets. In the Permute drug–protein test, a permuted drug–protein relation will randomly

change the drug’s protein targets to another drug’s. MEDICASCY was applied after the permutation to ensure correct drug–disease relations. Thus, MEDICASCY’s prediction of drug–disease

relations still has information from the permuted drug–protein relation and the disease-(through permuted drug)–protein relations are not completely lost. This actually reflects the fact

that there are a subset of proteins that occur in many diseases and permuting the drug–protein relationship for this subset does not change the identification of proteins in a given disease.

On the other hand, the permuted drug–disease test completely destroys the mapping of the protein (through the drug) to disease. To compare LEMEDISCO’s J-score to the XD-score, we mapped

their ICD-9 disease code to the DOIDs and obtained a subset of 29,658 pairs from their dataset of 97,665 pairs8. As shown in Supplementary Fig. 1 and Table 1, the XD-score has a c.c. of

0.042 (_p_ value = 2.8 × 10−13) with log(RR) and c.c. = 0.071 (_p_ value = 9.7 × 10−35) with the φ-score. Their NG score (the number of shared genes) essentially has no significant

correlation with log(RR) with a c.c. of 0.0047 (_p_ value = 0.42) and only shows a correlation of 0.053 (_p_ value = 6.6 × 10−20) with the φ-score. The J-score has much better correlations:

c.c. = 0.146 (_p_ value = 0.0) with log(RR), 0.106 (_p_ value = 0.0) with the φ-score. The recall rate of the J-score is 44.5% compared to 6.4% for the XD-score. J-score’s precision (80.6

vs. 77.8%), AUROC (0.531 vs. 0.510), AUPRC (0.812 vs. 0.801) are all better. Supplementary Fig. 1 shows distinct patterns of J-score and XD-score. The data points of the XD-score are mostly

concentrated at an XD-score = 0, whereas those of the J-score spread across the full range of 0–1. For comparison with the _S__AB_ score7, the MeSH19 disease names were mapped to their

DOIDs. A consensus set of 943 disease pairs from their dataset and ours was obtained. As shown in Supplementary Fig. 2 and Table 1, compared to _S__AB_7, for the 947 disease pairs,

LEMEDISCO’s J-score has a c.c. = 0.0986 (_p_ value = 0.0024) with log(RR) and a c.c. = 0.0886 (_p_ value = 0.0065) with the φ-score that are both better than those of the _S__AB_ score with

a c.c. = −0.0620 (_p_ value = 0.057) with log(RR) and c.c. = −0.0413(_p_ value = 0.205) with the φ-score; both are insignificant. The recall rate of the J-score is 68.6% and is an order of

magnitude better than the 8.0% by the _S__AB_ score when defining comorbid pairs when the _S__AB_ score <0; that is for a biologically meaningful disease–disease relationship7. J-score

has AUROC = 0.531 compared to _S__AB_ score’s 0.434 that is even worse than random value 0.5 because its dominant _S__AB_ score >0 region is worse than random. The J-score also has a

better AUPRC (0.798 vs. 0.761). However, J-score’s precision (77.7% vs. 85.3%) is slightly worse. Supplementary Fig. 2 shows that the data points of the _S__AB_ score are concentrated in the

region _S__AB_ >0, whereas those of the J-score spread over the 0–1 region. Next, a common dataset of 2621 disease pairs was obtained for comparison with the Symptom Similarity Score6.

As shown in Supplementary Fig. 3 and Table 1, the Symptom Similarity Score has better correlations of 0.322 (_p_ value = 0.0) than 0.140 (_p_ value = 5.2 × 10−13) by the J-score for the

log(RR) and 0.194 (_p_ value = 1.4 × 10−23) than 0.135 (_p_ value = 3.8 × 10−12) by the J-score for the φ-score. It also has better recall (100 vs. 63.7%), AUROC (0.587 vs. 0.512) and AUPRC

(0.856 vs. 0.814). However, the Symptom Similarity Score only explains the relationship of one phenotype (symptom) to another phenotype (disease). Nevertheless, all correlations of the

J-score are statistically significant. The J-score’s precision is almost identical to that of the Symptom Similarity Score (79.3 vs. 79.6%). We note that Supplementary Figs. 3, 4 show very

similar patterns of the J-score and Symptom Similarity Score. MEDICASCY BASED MOA PROTEIN PREDICTION The ICD-10 main classification coverage of the 3608 diseases is shown in Fig. 1a. We

first examine the number of predicted MOA proteins per indication from MEDICASCY12. Using a _q_ value cutoff of 0.05 and including protein isoforms, the average (median) number of MOA

proteins per indication is 1,142.2 (339); the maximal and minimal values are 15,281 (almost half of the total 32,584 screened proteins) for mast cell sarcoma and 0 for 82 diseases. The

histogram of the number of MOAs is shown in Fig. 1b. 71.0% (40.6%) of indications have >100 (500) MOA proteins. These associations allowed us to expand the protein repertoire that might

be associated with each disease and resemble the statistics from GWAS studies. Below, we describe the use of LEMEDISCO to predict disease comorbidities as well as prioritize these proteins.

SHARED MOA PROTEINS EXPLAIN DISEASE COMORBIDITY BY WAY OF DISEASE–DISEASE RELATIONSHIPS We next examine the overall characteristics of the predicted comorbidity network of 3608 diseases.

Eighty-two diseases do not have MOA protein predictions and thus do not have predicted comorbidities. There are a total of 6,507,028 possible pairwise disease associations. Of these, there

are 2,137,022 significant pairwise disease associations (_q_ value <0.05) excluding the diagonals given by LEMEDISCO. Out of 3608, 3523 diseases have significant comorbidities. The

density and frequency of the J-score for the significant non-redundant pairs is in Fig. 1c, and the density and frequency of the degree (number of edges) for each node (disease) is

represented in Fig. 1d. Using a _q_ value cutoff of 0.05, the average (median) number of comorbidities per disease is 608.3 (491). The largest (smallest) number of comorbidities is 2229 for

Pneumonia aspiration and the smallest is 1 for these four diseases: glossopharyngeal neuralgia, median arcuate ligament syndrome, toxoplasmosis, hallucinogen dependence. The average

closeness $\pm$ one standard deviation (defined as the reciprocal of the shortest distance to all other nodes: \(\frac{{{{{{{\mathrm{Number}}}}}}\; {{{{{\mathrm{of}}}}}}\;

{{{{{\mathrm{other}}}}}}\; {{{{{\mathrm{nodes}}}}}}}}{\sum {{{{{{\mathrm{shortest}}}}}}\; {{{{{\mathrm{distance}}}}}}\; {{{{{\mathrm{to}}}}}}\; {{{{{\mathrm{other}}}}}}\;

{{{{{\mathrm{nodes}}}}}}}}\)) of all nodes is 0.535 ± 0.084, indicating that the majority of disease pairs have the shortest distance around 1/0.535. The average betweenness is 1135 ± 1727,

i.e., on average, 1135 pairs of diseases have their shortest distance passing through the given disease node. Thus, the disease network is very dense. The cumulative distribution for the

J-score and _q_ values for all of the comorbidities and the top 100 are shown in Supplementary Figs. 5, 6, respectively. The summary statistics of the scores for these thresholds are shown

in Supplementary Table 1. What is clear from these figures and Supplementary Table 1, particularly for the top 100 ranked comorbidities, is that the 99.6% top-ranked 100 comorbidities have a

_q_ value <0.005. In other words, while a _q_ value threshold of 0.05 is used, in reality, the actual _q_ values employed for subsequent analysis are far more significant. Around 32.8%

of the disease pairs have a _q_ value <0.05. This result is consistent with the 37.1% recall of large-scale benchmarking (see Table 1). As shown in Fig. 1e, the giant component (GP) of

the disease–disease network covers the entire network when the J-score is <0.1 and the _q_ value <0.05, i.e., starting from any disease, one can walk to any other disease on the

network. As the J-score cutoff increases, the number of diseases in the giant component decreases; however, the decrease is very slow. The rapid decrease only happens around a 0.45 J-score

corresponding to an average _q_ value of 1.63 × 10−6 ± 1.81 × 10−4. Thus, the disease network is not only dense, but it is also strongly (i.e., one has to apply a high J-score cutoff to

break the network into small GP) and highly significantly (compared to default _q_ value 0.05) connected. These issues will be explored in future work. LEMEDISCO IDENTIFIED MOA PROTEINS In

addition to the comorbidity predictions, LEMEDISCO also identifies comorbidity enriched MOA proteins. The comorbidity enriched MOA proteins are hierarchically ranked by their CoMOAenrich

score (defined in the Methods section). Comparing the top 100 comorbidity enriched MOA proteins (hierarchically ranked by the CoMOAenrich score) with the MEDICASCY top 100 MOA proteins

(ranked by _q_ value), 92.5% of the diseases have proteins with a significant overlap _p_ value (<0.05). The cumulative distribution for the CoMOAenrich scores and _q_ values for all the

comorbidity enriched MOA proteins and the top 100 are shown in Supplementary Figs. 7, 8, respectively. The summary statistics of the scores for these thresholds are shown in Supplementary

Table 1. For the comorbidity enriched MOA proteins ranked by their CoMOAenrich score, 65.9% have a _q_ value <0.005. If one only assesses the top 100 comorbidity enriched MOA proteins,

67.1% have a _q_ value <0.005, which are the proteins used for the global pathway analysis. MAPPING OF THE LEMEDISCO MOA PROTEINS TO SIGNIFICANT PATHWAYS The cumulative distribution of

the _q_ values for the pathways and the top 100 pathways are shown in Supplementary Fig. 9 and the summary statistics are provided in Supplementary Table 1. As shown in Supplementary Fig. 9,

73.1% of the significant pathways (_q_ value <0.05) have a _q_ value <0.015. About 3453 or 95.7% of the 3608 diseases have significant pathways. We further note that there are some

MOA proteins (e.g., AR, NR4A3, and PGR) and pathways (e.g., HSP90 chaperone cycle for steroid hormone receptors, SUMO E3 ligases SUMOylate target proteins, SUMOylation) that are present in

approximately a third of the diseases in our library. APPLICATIONS OF LEMEDISCO By way of illustration, we applied LEMEDISCO to two disparate diseases, coronary artery disease (CAD) and

ovarian cancer (OC). CORONARY ARTERY DISEASE (CAD) CAD, a leading cause of death worldwide, is caused by narrowed or blocked arteries due to plaques composed of cholesterol or other fatty

deposits lining the inner wall of the artery. These plaques result in decreased blood supply to the heart20. We find 2576 significant comorbid diseases (_q_ value <0.05) and 785 (558)

comorbidity enriched MOA proteins (genes) (score >0.01), meaning that at least one of the top 100 comorbid disease shares the protein as an MOA protein. This is the _p_ value weighted

comorbidity frequency normalized by the number of comorbid diseases used for calculating the frequency. See Methods for more details. Forty-nine significant pathways (_q_ value <0.05) are

associated with the top-ranked 100 comorbidity enriched proteins. The top 20 disease comorbidities, top 20 comorbidity enriched MOA proteins, and top 20 significant pathways are shown in

Table 2. There are several significant cardiovascular-related comorbidities such as heart disease, cardiovascular system disease, myocardial infarction, and congestive heart failure.

Asthma21, diabetes22, and obstructive lung disease23 are also in the top ten with known comorbidities to CAD. CAD is also known to be comorbid with liver disease24, kidney disease25, and

hyperthyroidism26. Interestingly, allergic rhinitis is associated with decreased coronary heart disease27. In summary, 14 (70%) of the top 20 predicted comorbid diseases have literature

evidence to support these predictions. To further show that these comorbidities with literature evidence cannot be generated randomly, we randomly selected 20 diseases from the 3608 diseases

and did a literature search for their associations with CAD. We found nine diseases, far fewer than our list of 14 diseases (see Supplementary Table 2). A further random test selecting 20

from those after excluding LeMeDISCO predicted comorbid diseases to CAD, we find six diseases having literature evidence (see Supplementary Table 3). Among the top, COX-related comorbidity

enriched proteins were found. COX proteins are involved in the synthesis of prostanoids. Prostanoids are structurally like lipids and are involved in thrombosis and other undesirable

cardiovascular events28. Several GPCR-related pathways (Class A/1 rhodopsin-like receptors, olfactory signaling pathway, GPCR ligand binding) are among the top five pathways predicted for

CAD, consistent with the literature that GPCRs play a crucial role in heart function29. The above results were obtained without any extrinsic knowledge of CAD. Next, we show how LEMEDISCO

can be used to prioritize targets from other studies. A GWAS study identified 155 CAD-associated genes30. While they are associated with CAD, to find out which ones to target is a

non-trivial task. Here, we applied LEMEDISCO to prioritize them by examining their frequencies of presence in other diseases. There were 26 comorbidity enriched MOA proteins (score >0.01)

and 40 pathways (_p_ value <0.05, but _q_ value <0.20) found from global pathway analysis of the 26 comorbidity enriched MOA proteins. The top disease comorbidities, top 20

comorbidity enriched MOA proteins, and top 20 pathways are shown in Table 3. There were only three significant predicted comorbidities (_q_ value <0.05) by LEMEDISCO. The top two

comorbidities are renal artery disease and anuria, both are associated with dysfunction of the kidneys and are related to CAD25,31. Anuria is attributed to failure of the kidneys to produce

urine, and renal artery disease occurs when the arteries that supply blood and oxygen to the kidneys narrows. A study found an increase in renal artery stenosis in patients with CAD31. The

last comorbid disease is anterior uveitis. Studies showed that anterior uveitis is associated with Kawasaki disease that can lead to heart complication32. Thus, all three have literature

evidence. While the top genes are associated with CAD according to the GWAS study of ref. 30, we predicted that they are also associated with the corresponding comorbid diseases—renal artery

disease, anuria, and anterior uveitis. For example, VEGFA is predicted to be associated with all three comorbid diseases. It was found that in progressive kidney disease, the VEGFA

expression level is attenuated33; in contrast, in uveitis disease, it is increased34. Other top genes are predicted to be associated only with anterior uveitis. Among them, SERPINA1 is a

potential causal gene of uveitis35, RAB23 is associated with uveitis in sarcoidosis36, and HHAT has evidence of association with uveitis37. None of the 40 pathways obtained using the top 26

genes overlaps with the six pathways obtained using the original 155 genes with the same cutoff. The predicted top pathway RAB geranylgeranylation through RAB23/RAB5C genes is part of the

signaling network of statin-induced effects of improving cardiac health in Drosophila38. OVARIAN CANCER (OC) LEMEDISCO predicts 1,092 significant comorbidities to OC (_q_ value <0.05),

with 282 (171) comorbidity enriched MOA proteins (genes) (score >0.01). There were 159 significant pathways (_q_ value <0.05) from the top 100 comorbidity enriched MOA proteins. The

top 20 disease comorbidities, top 20 comorbidity enriched MOA proteins, and all significant pathways are shown in Table 4. It is not surprising that all of the top comorbidities are cancers.

The top first comorbid disease is testicular cancer. Although OC and testicular cancer cannot occur in one individual, they are hereditarily associated39. Fallopian tube cancer is

considered similar to OC. It was reported that squamous cell carcinoma occurred in the ovary40. Nodular prostate, (the male version of OC), cervical cancer41, and inflammatory breast

carcinoma42 are all reproduction-related cancers like OC. OC from lung cancer metastasis occurs in 2–4% of OC patients43. Bile duct cancer is a very rare site of OC metastases44. Peritoneal

cancer behaves similarly to OC. Gland cancer is linked to BRCA-positive families, and BRCA is a risk gene for ovarian cancer45. Neurofibroma is reported to mimic ovarian tumors46. Renal cell

carcinoma is metastatic to ovarian and fallopian tube cancers47. In total, 14 of the top 20 (70%) comorbidities have literature evidence. Similar to CAD, we did a literature search of 20

randomly selected diseases for their associations with OC. We found only 6 cases; far fewer than our 14 (see Supplementary Table 2). In a further random test selecting 20 from those after

excluding LEMEDISCO predicted comorbid diseases to OC, we find four diseases have literature evidence (see Supplementary Table 3). Eleven of the top 20 enriched MOA proteins are kinases that

are cancer-related. The topmost comorbidity enriched MOA protein is TEK, angiopoietin-1 receptor; angiopoietins are found to promote ovarian cancer progression48. Interestingly, TYRO3 is

related to drug resistance in OC49. The top predicted pathway by enriched MOAs is MAPK1/MAPK3 signaling that mediates the expression of ERBB2 silencing, OC cell migration, and invasion50.

There are also enriched pathways associated with ephrin ligands. Aggressive forms of ovarian cancer have been previously shown to involve upregulated forms of ephrin, such as ephrinA551.

There are 14 ephrin-related comorbidity enriched MOA proteins found (all score >0.37). We next examined a set of 11 genes associated with OC risk from a study that assessed the

multiple-gene germline sequences in 95,561 women with OC using LEMEDISCO52. The results for the top 20 comorbidities, seven MOA proteins (score >0.01), and their associated pathways are

shown in Table 5. There were 125 significant comorbidities (_q_ value <0.05) predicted and 33 significant pathways (_q_ value <0.05) associated with these seven proteins. The top

comorbidity associated with OC was angiosarcoma, a rare cancer of the inner blood and lymph vessels and in very rare cases, it occurs in the ovaries53. Patients with epithelial ovarian

cancers show an increased risk of skin cancer54. OC is also considered to have genetic risk factors55. Myxoid leiomyosarcoma is a very rare tumor with similarity to ovarian cancer56, and

leiomyosarcoma was reported in the ovaries57. A study found a relationship between hemoglobin levels and interleukin-6 levels in individuals with untreated epithelial ovarian cancer,

indicating an inflammatory role in cancer-associated anemia58. Medulloblastoma can arise from ovarian tumors in pregnancy59. Uveal cancer is associated with breast cancer and OC60. OC is

part of urinary system neoplasm. In total, 15 (75%) of top 20 comorbidities have literature evidence. The top two comorbidity enriched MOA proteins are RAD51C, RAD51D and belong to 16 of the

top 20 pathways. These involve such processes as DNA repair, transcriptional regulation by TP53, DNA double-strand break repair, and reproduction (see Table 5). The top two and the

third-ranked MSH6 proteins are shared by all top 100 comorbidities. For example, RAD51C is associated with Fanconi anemia (ranked 63th)61; RAD51D is associated with leiomyosarcoma62, and

MSH6 is a risk gene for pancreatic adenocarcinoma (26th)63. LEMEDISCO WEB SERVER The LEMEDISCO web service allows researchers to query our library of 3608 diseases or input a set of

pathogenic human genes/proteins and compute their predicted comorbidities, prioritized MOA proteins, and pathways associated. The web service is freely available for academic users at

http://sites.gatech.edu/cssb/LeMeDISCO. The programs and input data for reproducing disease–protein, disease–disease relationships, and all LEMEDISCO results as well as benchmark results are

available at https://github.com/hzhou3ga/lemedisco. DISCUSSION LEMEDISCO is a systematic approach for studying and analyzing possible features underlying the common proteins driving

comorbid diseases. The resulting predicted driver proteins and pathways for each disease or input gene set can allow researchers to generate new diagnostic and treatment options and

hypotheses. Interestingly, there were some MOA proteins and pathways present across approximately a third of the diseases, implying common disease drivers. The implications of this

observation and its relationship to disease origins will be pursued in future work. We do note that the current comorbid disease analysis strongly suggests that the “one target-one

disease-one molecule” approach often used in developing disease therapeutics31 is likely too simplistic. To fully understand the complexities of a disease, one must trace the origin of its

pathogenesis, which may be due to a genetic or somatic variant that is somehow related to the disease. However, such variants may also be associated with a disease not previously known to be

associated with that disease. Such interrelations can be further investigated by identifying high confidence comorbidity predictions from LEMEDISCO, regardless of whether or not their

comorbidity was previously known in the literature. For example, analysis of the comorbid diseases associated with CAD and OC have not only recapitulated known disease comorbidities but have

also provided novel insights. The results for CAD yielded high confidence associations between liver diseases and forms of asthma, which can be further investigated through the comorbidity

enriched MOA proteins and pathways. Furthermore, the results for OC revealed more high confidence associations to other forms of cancer such as squamous cell carcinoma and lung cancer.

LEMEDISCO not only has applications to the study of the underlying etiology behind a disease but may also be used during the early stages of drug discovery to identify efficacious drugs.

Rather than starting with a small molecule or protein target of choice, LEMEDISCO allows one to begin at the level of disease biology, often termed phenotypic drug discovery. In future work,

we shall demonstrate the utility of LEMEDISCO in identifying efficacious drugs to treat a given disease. Overall, the results of the current analysis and preliminary applications to drug

discovery suggest that LEMEDISCO provides a set of tools for elucidating disease etiology and interrelationships and that a more systems-wide, comprehensive approach to both personalized

medicine and drug discovery is required. We note that some of the predicted MOA proteins are present in around 1/3 of diseases. The top five (AR, NR4A3, PGR, NR3C2, and NR3C1) proteins all

belong to the nuclear receptor family and regulate other genes. All have DNA binding sites, especially two zinc finger domains64. The regulatory functions and ubiquity of well-studied zinc

fingers in these proteins may explain their frequent presentation as disease MOA proteins65. Even though in our predicted drug targets of the probe drugs, these proteins are not the most

frequent ones (e.g., AR is ranked 1135th of 16,762), their disease associations were enriched by MEDICASCY predicted disease–drug relationships. With the above possible applications, there

is also the limitation of the current approach of using FDA-approved DrugBank drugs as probe drugs to tease out the MOA proteins of diseases, i.e., some possible MOA proteins of a given

disease might not be the targets of the probe drugs and others might be incorrectly assigned. This will be addressed in future work that includes more diverse small molecule drug libraries

and improved virtual ligand screening algorithms to map the drugs to their respective protein targets66. As the probe drug target space is expanded, additional MOA proteins will be

discovered. Concomitantly, as the virtual ligand screening algorithms that assign small molecules to their predicted protein targets improve, false positives will be eliminated, and

additional true positive proteins might be added. These will result in more accurate MOA protein predictions. Similar to previous work1,6,7,8, our predicted disease–disease relationships

were benchmarked using large-scale clinical data and have only small-scale validation by literature searches. One single relationship requires at least one published work to validate.

Large-scale automatic text mining is a feasible way to scale up the validation and build a more confident subset of our predictions67. This is the subject of ongoing studies. METHODS

OVERVIEW OF LEMEDISCO A flowchart of LEMEDISCO is shown in Fig. 2. LEMEDISCO employs MEDICASCY12 to predict possible disease MOA proteins. Here, MEDICASCY is applied in prediction mode

(i.e., any training drugs having a Tanimoto-Coefficient = 1 to a given input drug is excluded from training) to avoid a strong bias toward drugs in the training set on a set of 2095

FDA-approved drugs68. For each of the 3608 indications, we rank the 2095 probe drugs according to their _Z_-scores, _Z__d_, defined using the raw score computed by MEDICASCY from:

$${Z}_{d}=\left(\frac{{{{{{{\mathrm{raw}}}}}}}\,{{{{{{\mathrm{score}}}}}}}{{\mbox{--}}}{{{{{{\mathrm{average}}}}}}}\,{{{{{{\mathrm{raw}}}}}}\,{{{{{\mathrm{score}}}}}}}\,{{{{{{\mathrm{of}}}}}}}\,2095\,{{{{{{\mathrm{drugs}}}}}}}}{{{{{{{\mathrm{standard}}}}}}}\,{{{{{{\mathrm{deviation}}}}}}}\,{{{{{{\mathrm{of}}}}}}}\,2095\,{{{{{{\mathrm{raw}}}}}}}\,{{{{{{\mathrm{scores}}}}}}}}\right)$$

(1) To predict a drug as having the given indication, we applied a _Z__d_ cutoff of 1.65, that approximately corresponds to a _p_ value of 0.05 for the upper-tailed null hypotheses of

random variable _Z__d_. Thus, for each indication D, the 2095 probe drugs are separated into two groups: _N_1 are predicted to have indication D (_Z__d_ ≥ 1.65) and _N_2 (=2095 − _N_1) are

not predicted to have indication _D_ (_Z__d_ < 1.65). This is a very loose prediction of a drug’s indication with the advantage that it always predicts some drugs having the indication

with its expected statistical confidence. Then, for a given indication D and each protein target, T, in the human proteome of our modeled 32,584 proteins, there are a subset of the drugs (or

perhaps none) predicted by FINDSITECOMB2.0 69 to bind T. The relative risk RR(D, T) of the given target T with respect to indication D as:

$${{{{{\rm{RR}}}}}}({{{{{\rm{D}}}}}},{{{{{\rm{T}}}}}})=\frac{{N}_{1}^{T}/{N}_{1}}{{N}_{2}^{T}/{N}_{2}}$$ (2a) where ${N}_{1}^{T}$ and ${N}_{2}^{T}$ are the numbers of drugs binding to T

with and without indication D, respectively. The numerator is the estimation of the probability of drugs having the predicted indication D (_Z__d_ ≥ 1.65) that bind to protein T (F1 =

${N}_{1}^{T}/{N}_{1}$). The denominator is the probability of finding drugs that do not have the predicted indication D but which bind to protein T (F2 = ${N}_{2}^{T}/{N}_{2}$). This

latter probability serves as the background probability that an arbitrary drug will bind to T. When no drug is predicted to bind to protein T, RR(D, T) is set to zero. RR(D, T) = F1/F2 >1

means that a drug having indication D is more likely to bind to T than arbitrary drugs not having the predicted indication D will bind to T. We then compute the statistical significance of

RR(D, T) by calculating a _p_ value using Fisher’s exact test70,71 on the following contingency table: $$\left(\begin{array}{cc}{N}_{1}^{T} & {N}_{1}-{N}_{1}^{T}\\ {N}_{2}^{T} &

{N}_{2}-{N}_{2}^{T}\end{array}\right)$$ (2b) We define a protein target T as predicted to be a possible MOA target for indication D if its _p_ value <0.05 because it is more likely to be

targeted by efficacious drugs than arbitrary drugs. Thus, for each of the 3608 indications, there is a list of predicted possible MOA proteins. To reduce false positive MOAs, we utilized the

human protein atlas database (https://www.proteinatlas.org/about/download, _normal_tissue.tsv_) of expression profiles for proteins in normal human tissues based on immunohistochemistry

using tissue micro arrays72 to filter those proteins that are “not detected” and not “uncertain” in all tested tissues related to an indication. To determine the tissues related to an

indication, tissues are mapped to their ICD-10 main codes and indications having the same main codes are related to the tissue. Using the input of two sets of putative MOA proteins having a

_p_ value of <0.05 calculated by Fisher’s exact test70, we calculate their Jaccard index17 J(D1,D2) (J-score) defined in Eq. 3a as

$${{{{{\rm{J}}}}}}{\mbox{-}}{{{{{\rm{score}}}}}}={N}_{s}/({{{{{\rm{N}}}}}}{{{{{\rm{D}}}}}}1+{{{{{\rm{N}}}}}}{{{{{\rm{D}}}}}}2-{N}_{s})$$ (3a) We then calculate the _p_ value for significance

by Fisher’s exact test for the contingency table70 that gives the probability of having overlap ≥_N__s_ by randomly selecting ${N}_{D2}$ out of ${N}_{t}$ proteins70,73:

$$\left(\begin{array}{cc}{N}_{s} & {N}_{D2}-{N}_{s}\\ {N}_{D1} & {N}_{t}-{N}_{D1}\end{array}\right)$$ (3b) ${N}_{D1}$, ${N}_{D2}$ are the numbers of MOA proteins/genes of disease

D1 and D2; _N__s_ is the number of overlapped MOA proteins between D1, D2, and ${N}_{t}$ is the total number of human proteins. The Jaccard index J-score is a statistical measure of the

similarity between MOA proteins of D1 and D2, and its value ranges between 0 and 1. Since the null hypothesis of ${N}_{s}$ corresponds to a hypergeometric distribution, the _p_ value of

observing the number of overlapped MOA proteins between D1, D2 $\ge$ ${N}_{s}$ can be calculated using Fisher’s exact test on the table in Eq. 3b71. We will use the J-score for

predicting comorbidity and compare it with the observed comorbidity. We note that the J-score is determined by the number of overlapped MOAs, which means that the comorbidity defined by the

J-score are not limited by diseases occurring in one individual but rather considers the effect of the malfunctioning proteins in the human population. This is especially true for

sex-specific diseases such as ovarian cancer and prostate cancer that may have overlapping MOA proteins; this may result in significant comorbidity between them. In other words, the two

diseases may share common driver proteins, although ovarian and prostate cancer could occur unless the individual has both an ovary and a prostate, which is highly unlikely. Similarly, it

can predict the comorbidity of rare and common diseases; again, whether this would occur would depend on the presence in a given person of the appropriate set of malfunctioning genes.

Therefore, though many of the LeMeDISCO comorbidity predictions are seen in one individual, others may not be. Thus, LeMeDISCO comorbidity predictions are a population-based approach. To

better control the false discovery rate (FDR) due to background noise from statistic errors, we performed the multiple testing correction to the _p_ values for disease–protein and

disease–disease associations calculated by Fisher’s exact test by computing the _q_ value using the method described in ref. 74. In large-scale disease–disease comorbidity calculations, we

use the MOAs predicted by MEDICASCY12. In addition, MOA targets between disease pairs can also be derived from experimental data; examples include differential gene expression (GE),

Mendelian or somatic mutation profiles comparing disease vs. control normal samples, better vs. worse prognosis samples, or drug-treated vs. control untreated samples75. BENCHMARKING OF

LEMEDISCO We validated LEMEDISCO’s J-score by correlating it with the observed comorbidity as quantified by (a) the logarithm of relative risk log(RR) score and (b) the φ-score (Pearson’s

correlation for binary variables)1. The relative risk (RR) is the probability that two diseases co-occur in a single individual relative to random. Since RR scales exponentially with respect

to the strength of two interacting diseases, we use log(RR) for correlation analysis. The log(RR) and φ-score are computed from US Medicare insurance claim data using1: $${{\log

}}({{{{{\rm{RR}}}}}})\,=\,{{\log }}\left(\frac{{n}_{{AB}}/{n}_{{tot}}} {({n}_{A}/{n}_{{tot}}){({n}_{B}/{n}_{{tot}})}}\right)$$ (4a) $${{\varphi }}{\mbox{-}}{{{{{\rm{score}}}}}}=({n}_{{AB}}*

{n}_{{tot}}-{n}_{A}* {n}_{B})/\sqrt{{n}_{A}* {n}_{B}* \left({n}_{{tot}}-{n}_{A}\right)* \left({n}_{{tot}}-{n}_{B}\right)}$$ (4b) where _n__tot_ = total number of patients; _n__A_, _n__B_ =

number of patients diagnosed with diseases A and B, and _n__AB_ = number of patients diagnosed with both diseases A and B. PERMUTATION TESTS Two permutation tests were performed: (a) Permute

drug–protein relationships: Randomly permute the predicted drug–protein relations (i.e., randomly replace a drug’s protein targets with another drug’s protein targets predicted by

FINDSITECOMB2.0). This acts to transfer the protein targets of a drug (possibly incorrectly) to another drug. This test evaluates the performance of LEMEDISCO if we have the correct

drug–disease relations (predicted by MEDICASCY) but the incorrect drug–protein relations. To ensure the correct drug–disease relations after permuting the drug–protein relations, MEDICASCY

was applied to the permuted drug–protein relations since MEDICASCY depends on the drug’s protein targets; (b) Permute drug–disease relationships: Randomly permute the predicted drug–disease

relations (by randomly replacing a drug’s predicted indications with another drug’s indications). This test evaluates how LEMEDISCO will perform if the drug–protein relations are correct

(predicted by FINDSITECOMB2.0), but the drug–disease relations are randomly permuted. In both cases, disease MOAs are derived using the permuted relationships and 100 runs for each test with

different random seeds were performed. A _p_ value is calculated from _z_-score = (LeMeDISCO value-average)/standard deviation to characterize the significance of the difference between

LEMEDISCO and the permutation tests. IDENTIFICATION OF KEY MOA PROTEINS AND ASSOCIATED PATHWAYS FOR DISEASE COMORBIDITY After determining the significant comorbidities for each disease, the

_p_ value weighted frequency of shared MOA proteins across the top 100 predicted comorbidities are calculated. We define a _p_ value weighted frequency of an input MOA as follows (i.e.,

CoMOAenrich score): If MOA protein T is shared by a comorbid indication D and the _p_ value of T associated with D is _P_, then the weight defined by the min(1.0,−αlog_P_) is counted as T’s

frequency. In practice, we used 10 cancer cell line data76 to optimize the coefficient α to 0.025. We further computed a _p_ value via \({e}^{-\frac{{{{{{{\mathrm{COMOAenrich}}}}}}\;

{{{{{\mathrm{score}}}}}}}}{\alpha }}\) where α = 0.025, as previously mentioned. These MOA proteins expand the number of possible molecular players driving disease pathogenesis. An

empirically derived CoMOAenrich score (normalized by the number of comorbid indications that is 100) threshold of 0.01 was used, which is equivalent to 1% of the comorbid indications having

the MOA proteins with a significant _p_ value (<4.2 × 10−18). Then, up to the top 100 comorbidity enriched MOA proteins for each disease were used in global pathway analysis via

Reactome13. The pathways with a _p_ value <0.05 were extracted. The frequency of pathways across diseases was assessed to identify common pathways of disease. LEMEDISCO USAGE As shown in

Fig. 2, LEMEDISCO can be used in two different ways: (1) MEDICASCY-driven LEMEDISCO: The comorbidities for any of the 3608 diseases from the MEDICASCY-provided MOA proteins are predicted

(Fig. 2a). (2) Pathogenic gene set driven LEMEDISCO: Input your own pathogenic gene set derived from differential gene expression, GWAS, exome analysis, or other experimental/clinical

techniques (shown in Fig. 2b). The LEMEDISCO web service allows users to query the LEMEDISCO database as well as input their own set of pathogenic genes to assess the associated

comorbidities, MOA proteins, and pathways. REPORTING SUMMARY Further information on research design is available in the Nature Research Reporting Summary linked to this article. DATA

AVAILABILITY The input data for reproducing disease–protein, disease–disease relationships, and all LEMEDISCO results as well as benchmark results are available at

https://github.com/hzhou3ga/lemedisco. The web service is freely available for academic users at http://sites.gatech.edu/cssb/LeMeDISCO. The underline data for Fig. 1 is in file

supplementary data 1. CODE AVAILABILITY The programs for reproducing disease–protein and disease–disease relationships are available at https://github.com/hzhou3ga/lemedisco. REFERENCES *

Hidalgo, C. A., Blumm, N., Barabasi, A. L. & Christakis, N. A. A dynamic network approach for the study of human phenotypes. _PLoS Comput. Biol._ 5, e1000353 (2009). Article PubMed

PubMed Central CAS Google Scholar * Somers, E. C., Thomas, S. L., Smeeth, L. & Hall, A. J. Are individuals with an autoimmune disease at higher risk of a second autoimmune disorder.

_Am. J. Epidemiol._ 169, 749–755 (2009). Article PubMed Google Scholar * Cramer, A., Waldorp, L., van der Maas, H. & Borsboom, D. Comorbidity: a network perspective. _Behav. Brain

Sci._ 33, 137–150 (2010). Article PubMed Google Scholar * Melamed, R. D., Emmett, K. J. & Madubata, C. Genetic similarity between cancers and comorbid Mendelian diseases identifies

candidate driver genes. _Nat. Commun._ 6, 7033 (2015). Article CAS PubMed Google Scholar * Lee, D.-S. et al. The implications of human metabolic network topology for disease comorbidity.

_Proc. Natl Acad. Sci. USA_ 105, 9880–9885 (2008). Article CAS PubMed PubMed Central Google Scholar * Zhou, X., Menche, J., Barabási, A.-L. & Sharma, A. Human symptoms–disease

network. _Nat. Commun._ 5, 4212 (2014). Article CAS PubMed Google Scholar * Menche, J. et al. Disease networks. Uncovering disease-disease relationships through the incomplete

interactomes. _interactome. Science_ 347, 1257601 (2015). PubMed Google Scholar * Ko, Y., Cho, M., Lee, J.-S. & Kim, J. Identification of disease comorbidity through hidden molecular

mechanisms. _Sci. Rep._ 6, 39433 (2016). Article CAS PubMed PubMed Central Google Scholar * Guo, M. et al. Analysis of disease comorbidity patterns in a large-scale China population.

_BMC Med. Genomics_ 12, 177 (2019). Article PubMed PubMed Central Google Scholar * Ramos, E. M. et al. Phenotype–Genotype Integrator (PheGenI): synthesizing genome-wide association study

(GWAS) data with existing genomic resources. _Eur. J. Hum. Genet_. 22, 144–147 (2014).. * Schriml, L. et al. Disease ontology: a backbone for disease semantic integration. _Nucleic Acids

Res._ 40, D940–D946 (2012). Article CAS PubMed Google Scholar * Zhou, H. et al. MEDICASCY: a machine learning approach for predicting small-molecule drug side effects, indications,

efficacy, and modes of action. _Mol. Pharm._ 17, 1558–1574 (2020). Article CAS PubMed PubMed Central Google Scholar * Jassal, B. et al. The reactome pathway knowledgebase. _Nucleic

Acids Res._ 48, D498–d503 (2020). CAS PubMed Google Scholar * Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. _Nature_

https://doi.org/10.1038/s41586-021-03819-2 (2021). * Carpenter, K. A., Cohen, D. S., Jarrell, J. T. & Huang, X. Deep learning and virtual drug screening. _Future Med. Chem._ 10,

2557–2567 (2018). Article CAS PubMed PubMed Central Google Scholar * Zhou, H., Gao, M. & Skolnick, J. ENTPRISE-X: predicting disease-associated frameshift and nonsense mutations.

_PLoS ONE_ 13, e0196849 (2018). Article PubMed PubMed Central CAS Google Scholar * Jaccard, P. THE distribution of the flora in the alpine zone. _N. Phytologist_ 11, 37–50 (1912).

Article Google Scholar * World Health, O. (World Health Organization, 2004). https://www.cdc.gov/nchs/icd/icd-10-cm.htm. * Rogers, F. B. Medical subject headings. _Bull. Med. Libr. Assoc._

51, 114–116 (1963). CAS PubMed PubMed Central Google Scholar * Fuster, V., Badimon, L., Badimon, J. J. & Chesebro, J. H. The pathogenesis of coronary artery disease and the acute

coronary syndromes. _N. Engl. J. Med._ 326, 310–318 (1992). Article CAS PubMed Google Scholar * Wang, L., Gao, S., Yu, M., Sheng, Z. & Tan, W. Association of asthma with coronary

heart disease: a meta analysis of 11 trials. _PLoS ONE_ 12, e0179335 (2017). Article PubMed PubMed Central CAS Google Scholar * Aronson, D. & Edelman, E. R. Coronary artery disease

and diabetes mellitus. _Cardiol. Clin._ 32, 439–455 (2014). Article PubMed PubMed Central Google Scholar * Falk, J. A. et al. Cardiac disease in chronic obstructive pulmonary disease.

_Proc. Am. Thorac. Soc._ 5, 543–548 (2008). Article PubMed PubMed Central Google Scholar * Montemezzo, M. et al. Nonalcoholic fatty liver disease and coronary artery disease: big

brothers in patients with acute coronary syndrome. _Sci. World J._ 2020, 8489238 (2020). Article Google Scholar * Cai, Q., Mukku, V. K. & Ahmad, M. Coronary artery disease in patients

with chronic kidney disease: a clinical update. _Curr. Cardiol. Rev._ 9, 331–339 (2013). Article PubMed PubMed Central Google Scholar * Beyer, C., Plank, F., Friedrich, G., Wildauer, M.

& Feuchtner, G. Effects of hyperthyroidism on coronary artery disease: a computed tomography angiography study. _Can. J. Cardiol._ 33, 1327–1334 (2017). Article PubMed Google Scholar

* Crans Yoon, A. M., Chiu, V., Rana, J. S. & Sheikh, J. Association of allergic rhinitis, coronary heart disease, cerebrovascular disease, and all-cause mortality. _Ann. Allergy Asthma

Immunol._ 117, 359–364.e351 (2016). Article PubMed Google Scholar * Zhu, L., Zhang, Y., Guo, Z. & Wang, M. Cardiovascular biology of prostanoids and drug discovery. _Arterioscler.

Thromb. Vasc. Biol._ 40, 1454–1463 (2020). Article CAS PubMed Google Scholar * Wang, J., Gareri, C. & Rockman, H. A. G-protein-coupled receptors in heart disease. _Circ. Res._ 123,

716–735 (2018). Article CAS PubMed PubMed Central Google Scholar * van der Harst, P. & Verweij, N. Identification of 64 novel genetic loci provides an expanded view on the genetic

architecture of Coronary Artery Disease. _Circulation Res._ 122, 433–443 (2018). Article PubMed CAS Google Scholar * Zandparsa, A., Habashizadeh, M., Moradi Farsani, E., Jabbari, M.

& Rezaei, R. Relationship between renal artery stenosis and severity of coronary artery disease in patients with coronary atherosclerotic disease. _Int. Cardiovasc. Res. J._ 6, 84–87

(2012). PubMed PubMed Central Google Scholar * Lee, K. J. et al. Usefulness of anterior uveitis as an additional tool for diagnosing incomplete Kawasaki disease. _Korean J. Pediatr._ 59,

174–177 (2016). Article CAS PubMed PubMed Central Google Scholar * Rudnicki, M. et al. Hypoxia response and VEGF-A expression in human proximal tubular epithelial cells in stable and

progressive renal disease. _Lab. Invest._ 89, 337–346 (2009). Article CAS PubMed Google Scholar * Paroli, M. P. et al. Increased vascular endothelial growth factor levels in aqueous

humor and serum of patients with quiescent uveitis. _Eur. J. Ophthalmol._ 17, 938–942 (2007). Article CAS PubMed Google Scholar * de-la-Torre, A. et al. Uveitis and multiple sclerosis:

potential common causal mutations. _Mol. Neurobiol._ 56, 8008–8017 (2019). Article CAS PubMed PubMed Central Google Scholar * Davoudi, S. et al. Association of genetic variants in RAB23

and ANXA11 with uveitis in sarcoidosis. _Mol. Vis._ 24, 59–74 (2018). CAS PubMed PubMed Central Google Scholar * Rouillard, A. D. et al. The harmonizome: a collection of processed

datasets gathered to serve and mine knowledge about genes and proteins. _Database_ 2016, baw100 (2016). Article PubMed PubMed Central CAS Google Scholar * Spindler, S. R. et al. Statin

treatment increases lifespan and improves cardiac health in Drosophila by decreasing specific protein prenylation. _PLoS ONE_ 7, e39581 (2012). Article CAS PubMed PubMed Central Google

Scholar * Etter, J. L. et al. Hereditary association between testicular cancer and familial ovarian cancer: a familial ovarian cancer registry study. _Cancer Epidemiol._ 53, 184–186 (2018).

Article PubMed PubMed Central Google Scholar * Srivastava, H., Shree, S., Guleria, K. & Singh, U. R. Pure primary squamous cell carcinoma of ovary - A rare case report. _J. Clin.

Diagn. Res._ 11, Qd01–qd02 (2017). PubMed PubMed Central Google Scholar * Guidozzi, F., Sonnendecker, E. W. & Wright, C. Ovarian cancer with metastatic deposits in the cervix, vagina,

or vulva preceding primary cytoreductive surgery. _Gynecol. Oncol._ 49, 225–228 (1993). Article CAS PubMed Google Scholar * Bergfeldt, K., Nilsson, B., Einhorn, S. & Hall, P. Breast

cancer risk in women with a primary ovarian cancer-a case-control study. _Eur. J. Cancer_ 37, 2229–2234 (2001). Article CAS PubMed Google Scholar * Losito, N. S. et al. Lung cancer

diagnosis on ovary mass: a case report. _J. Ovarian Res._ 6, 34 (2013). Article PubMed PubMed Central Google Scholar * Shijo, M. et al. Metastasis of ovarian cancer to the bile duct: a

case report. _Surg. Case Rep._ 5, 100 (2019). Article PubMed PubMed Central Google Scholar * Shen, T. K., Teknos, T. N., Toland, A. E., Senter, L. & Nagy, R. Salivary gland cancer in

BRCA-positive families: a retrospective review. _JAMA Otolaryngol. Head Neck Surg._ 140, 1213–1217 (2014). Article PubMed Google Scholar * Chao, W.-T. et al. Neurofibroma involving

obturator nerve mimicking an adnexal mass: a rare case report and PRISMA-driven systematic review. _J. Ovarian Res._ 11, 14 (2018). Article PubMed PubMed Central Google Scholar * Liang,

L. et al. Renal cell carcinoma metastatic to the ovary or fallopian tube: a clinicopathological study of 9 cases. _Hum. Pathol._ 51, 96–102 (2016). Article PubMed PubMed Central Google

Scholar * Brunckhorst, M. K., Xu, Y., Lu, R. & Yu, Q. Angiopoietins promote ovarian cancer progression by establishing a procancer microenvironment. _Am. J. Pathol._ 184, 2285–2296

(2014). Article CAS PubMed PubMed Central Google Scholar * Lee, C. Overexpression of Tyro3 receptor tyrosine kinase leads to the acquisition of taxol resistance in ovarian cancer cells.

_Mol. Med. Rep._ 12, 1485–1492 (2015). Article CAS PubMed Google Scholar * Yu, T. T., Wang, C. Y. & Tong, R. ERBB2 gene expression silencing involved in ovarian cancer cell

migration and invasion through mediating MAPK1/MAPK3 signaling pathway. _Eur. Rev. Med. Pharm. Sci._ 24, 5267–5280 (2020). Google Scholar * Jukonen, J. et al. Aggressive and recurrent

ovarian cancers upregulate ephrinA5, a non-canonical effector of EphA2 signaling duality. _Sci. Rep._ 11, 8856 (2021). Article CAS PubMed PubMed Central Google Scholar * Kurian, A. W.

et al. Association of ovarian cancer (OC) risk with mutations detected by multiple-gene germline sequencing in 95,561 women. _J. Clin. Oncol._ 34, 5510–5510 (2016). Article Google Scholar

* Ye, H. et al. Primary ovarian angiosarcoma: a rare and recognizable ovarian tumor. _J. Ovarian Res._ 14, 21 (2021). Article CAS PubMed PubMed Central Google Scholar * van Niekerk, C.

C., Bulten, J. & Verbeek, A. L. Epithelial ovarian cancer and the occurrence of skin cancer in the Netherlands: histological type connotations. _ISRN Obstet. Gynecol._ 2011, 617082

(2011). PubMed PubMed Central Google Scholar * Lech, A. et al. Ovarian cancer as a genetic disease. _Front. Biosci._ 18, 543–563 (2013). Article CAS Google Scholar * Kaleli, S., Calay,

Z., Ceydeli, N., Aydýnlý, K. & Kösebay, D. A huge abdominal mass mimicking ovarian cancer: p53-negative but aneuploid myxoid leiomyosarcoma of the uterus. _Eur. J. Obstet. Gynecol.

Reprod. Biol._ 100, 96–99 (2001). Article CAS PubMed Google Scholar * Tanaka, A. et al. Case report of a primary ovarian leiomyosarcoma diagnosed by H-caldesmon staining. _J.Clin.

Gynecol. Obstet_. 7, 26–29 (2018). * Macciò, A. et al. Hemoglobin levels correlate with interleukin-6 levels in patients with advanced untreated epithelial ovarian cancer: role of

inflammation in cancer-related anemia. _Blood_ 106, 362–367 (2005). Article PubMed CAS Google Scholar * Clinkard, D. J., Khalifa, M., Osborned, R. J. & Bouffet, E. Successful

management of medulloblastoma arising in an immature ovarian teratoma in pregnancy. _Gynecologic Oncol._ 120, 311–312 (2011). Article CAS Google Scholar * Hearle, N. et al. Contribution

of germline mutations in BRCA2, P16 INK4A, P14 ARF and P15 to uveal melanoma. _Invest. Ophthalmol. Vis. Sci._ 44, 458–462 (2003). Article PubMed Google Scholar * Vaz, F. et al. Mutation

of the RAD51C gene in a Fanconi anemia-like disorder. _Nat. Genet._ 42, 406–409 (2010). Article CAS PubMed Google Scholar * Futagawa, M. et al. Retroperitoneal leiomyosarcoma in a female

patient with a germline splicing variant RAD51D c.904-2A > T: a case report. _Hered. Cancer Clin. Pract._ 19, 48 (2021). Article CAS PubMed PubMed Central Google Scholar * Lorenzo,

D. et al. Role of endoscopic ultrasound in the screening and follow-up of high-risk individuals for familial pancreatic cancer. _World J. Gastroenterol._ 25, 5082–5096 (2019). Article

PubMed PubMed Central Google Scholar * The UniProt, C. UniProt: the universal protein knowledgebase in 2021. _Nucleic Acids Res._ 49, D480–D489 (2021). Article CAS Google Scholar *

Klug, A. The discovery of zinc fingers and their applications in gene regulation and genome manipulation. _Annu. Rev. Biochem._ 79, 213–231 (2010). Article CAS PubMed Google Scholar *

Irwin, J. J. & Shoichet, B. K. ZINC-a free database of commercially available compounds for virtual screening. _J. Chem. Inf. Model._ 45, 177–182 (2005). Article CAS PubMed PubMed

Central Google Scholar * Hassanzadeh, O. et al. Causal knowledge extraction through large-scale text mining. _Proc. AAAI Conf. Artif. Intell._ 34, 13610–13611 (2020). Google Scholar *

Wishart, D. S. et al. DrugBank 5.0: a major update to the DrugBank database for 2018. _Nucleic Acids Res._ 46, D1074–D1082 (2018). Article CAS PubMed Google Scholar * Zhou, H., Cao, H.

& Skolnick, J. FINDSITEcomb2.0: a new approach for virtual ligand screening of proteins and virtual target screening of biomolecules. _J. Chem. Inf. Model._ 58, 2343–2354 (2018). Article

CAS PubMed PubMed Central Google Scholar * Fisher, R. A. On the interpretation of χ2 from contingency tables, and the calculation of P. _J. R. Stat. Soc._ 85, 87–94 (1922). Article

Google Scholar * Mehta, C. R. & Patel, N. R. ALGORITHM 643: FEXACT: a FORTRAN subroutine for Fisher’s exact test on unordered r×c contingency tables. _ACM Trans. Math. Softw._ 12,

154–161 (1986). Article Google Scholar * Uhlén, M. et al. Tissue-based map of the human proteome. _Science_ 347, 1260419 (2015). Article PubMed CAS Google Scholar * Goeman, J. J. &

Bühlmann, P. Analyzing gene expression data in terms of gene sets: methodological issues. _Bioinformatics_ 23, 980–987 (2007). Article CAS PubMed Google Scholar * Benjamini, Y. &

Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. _J. R. Stat. Soc. Ser. B_ 57, 289–300 (1995). Google Scholar * Hintzsche, J. D.,

Robinson, W. A. & Tan, A. C. A survey of computational tools to analyze and interpret whole exome sequencing data. _Int. J. Genomics_ 2016, 7983236 (2016). Article PubMed PubMed

Central CAS Google Scholar * NCI-60 human tumor cell lines screen. https://dtp.cancer.gov/discovery_development/nci-60/. Download references ACKNOWLEDGEMENTS This project was funded by

R35GM118039 of the Division of General Medical Sciences of the NIH. AUTHOR INFORMATION Author notes * These authors contributed equally: Courtney Astore, Hongyi Zhou. AUTHORS AND

AFFILIATIONS * Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, 30332, USA Courtney Astore, Hongyi Zhou, Bartosz

Ilkowski, Jessica Forness & Jeffrey Skolnick Authors * Courtney Astore View author publications You can also search for this author inPubMed Google Scholar * Hongyi Zhou View author

publications You can also search for this author inPubMed Google Scholar * Bartosz Ilkowski View author publications You can also search for this author inPubMed Google Scholar * Jessica

Forness View author publications You can also search for this author inPubMed Google Scholar * Jeffrey Skolnick View author publications You can also search for this author inPubMed Google

Scholar CONTRIBUTIONS C.A., H.Z., and J.S. conceived of the method; C.A. and H.Z. implemented the method; H.Z., C.A., and J.S. analyzed the data and wrote the paper. C.A., B.I., and J.F.

created and implemented the webpage and web service. CORRESPONDING AUTHOR Correspondence to Jeffrey Skolnick. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing

interests. PEER REVIEW PEER REVIEW INFORMATION _Communications Biology_ thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editors:

Eirini Marouli and Gene Chong. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION DESCRIPTION OF ADDITIONAL SUPPLEMENTARY FILES SUPPLEMENTARY DATA 1 REPORTING SUMMARY RIGHTS AND PERMISSIONS OPEN ACCESS This article is

licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give

appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in

this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative

Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a

copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Astore, C., Zhou, H., Ilkowski, B. _et al._ LeMeDISCO

is a computational method for large-scale prediction & molecular interpretation of disease comorbidity. _Commun Biol_ 5, 870 (2022). https://doi.org/10.1038/s42003-022-03816-9 Download

citation * Received: 30 September 2021 * Accepted: 08 August 2022 * Published: 25 August 2022 * DOI: https://doi.org/10.1038/s42003-022-03816-9 SHARE THIS ARTICLE Anyone you share the

following link with will be able to read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer

Nature SharedIt content-sharing initiative

Lemedisco is a computational method for large-scale prediction & molecular interpretation of disease comorbidity

Lemedisco is a computational method for large-scale prediction & molecular interpretation of disease comorbidity

Play all audios: