
Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage
- Select a language for the TTS:
- UK English Female
- UK English Male
- US English Female
- US English Male
- Australian Female
- Australian Male
- Language selected: (auto detect) - EN
Play all audios:
Advantages and diagnostic effectiveness of the two most widely used resequencing approaches, whole exome (WES) and whole genome (WGS) sequencing, are often debated. WES dominated large-scale
resequencing projects because of lower cost and easier data storage and processing. Rapid development of 3rd generation sequencing methods and novel exome sequencing kits predicate the need
for a robust statistical framework allowing informative and easy performance comparison of the emerging methods. In our study we developed a set of statistical tools to systematically
assess coverage of coding regions provided by several modern WES platforms, as well as PCR-free WGS. We identified a substantial problem in most previously published comparisons which did
not account for mappability limitations of short reads. Using regression analysis and simple machine learning, as well as several novel metrics of coverage evenness, we analyzed the
contribution from the major determinants of CDS coverage. Contrary to a common view, most of the observed bias in modern WES stems from mappability limitations of short reads and exome probe
design rather than sequence composition. We also identified the ~ 500 kb region of human exome that could not be effectively characterized using short read technology and should receive
special attention during variant analysis. Using our novel metrics of sequencing coverage, we identified main determinants of WES and WGS performance. Overall, our study points out avenues
for improvement of enrichment-based methods and development of novel approaches that would maximize variant discovery at optimal cost.
Next-generation sequencing (NGS) is rapidly becoming an invaluable tool in human genetics research and clinical diagnostics1,2,3. Practical use of NGS methods has dramatically increased with
the development of targeted sequencing approaches, such as whole-exome sequencing (WES) or targeted sequencing of gene panels. WES emerged as an efficient alternative to whole-genome
sequencing (WGS) due to both lower sequencing cost and simplification of variant analysis and data storage4. More than 80% of all variants reported in ClinVar, and more than 89% of variants
reported to be pathogenic, come from the protein-coding part of the genome; this number increases to 99% when immediate CDS vicinity is included. Even allowing for the sampling bias, there
is an overall agreement that most heritable diseases appear to be caused by alterations in the protein-coding regions of the genome. Given this, WES has dominated the projects characterizing
human genome variation as well as clinical applications.
The pioneering 1000 Genomes project5 could not statistically characterize many of the rare variants critical to diagnostics of Mendelian disease due to a limited sample size. In an attempt
to get a representative picture of protein-coding variation in human population, 6,500 WES samples were sequenced during ESP6500 project6. When a much larger reference set of 60,706 WES
experiments was compiled and uniformly processed by the Exome Aggregation Consortium (ExAC)7, it dramatically increased the accuracy of allelic frequency (AF) estimation in general
population. This led to a surprising conclusion that up to 90% of variants reported as causative for Mendelian disease in ClinVar database are observed too often in healthy controls to
directly cause disease7. The number of available WES experiments is rapidly increasing, and the latest Genome Aggregation Database (gnomAD) collection includes 123,136 WES experiments
alongside with 15,496 WGS. Such impressive number of profiled individuals allows a much more thorough look at human coding genome variation, leading to many useful applications such as
estimation of selective pressure across protein-coding regions8.
Several published studies have concentrated on comparing the performance of different exome capture technologies, or comparison between WES and WGS. With the emergence of commercial exome
kits, three major manufacturers - Agilent, Illumina, and Nimblegen (Roche) - have become popular among users, representing the majority of all published WES studies. Early comparative
studies have focused on comparison of target intervals of various exome kits, and identified several important biases inherent to WES technology, such as coverage biases in regions with very
high or low GC content9,10,11. A later study comprised most hybridization-based capture technologies available at the time12, and showed specific features of each of the four exome kits,
including GC-content bias and differences in the distribution of coverage. Similar observations were made in one of the most recent comparative studies13. However, these and other earlier
works on the topic included very limited number of samples, often with large variation of sequencing depth, which may have interfered with consistent platform comparison. Only one of the
recent studies included larger amount of samples that allowed to identify tendencies in cross-sample coverage unevenness14.
It is often assumed that WGS offers more uniform coverage of CDS regions due to the nature of hybridization-based enrichment process used in WES. Such differences in coverage evenness
increase the costs of effective per-base coverage in WES, questioning the overall benefit from using WES instead of WGS. Hence, the issue of WES/WGS comparison has been addressed by several
studies that sought out optimal sequencing method to achieve maximum coverage of the protein-coding regions of the genome (listed in Supplementary Table S1). One of these included Agilent
and Nimblegen (Roche) WES capture technologies, that were compared with the conventional WGS approach in terms of resulting coverage per sequencing read and the efficiency of clinically
significant SNV detection15. Similarly to earlier studies9,10, it was found that WES achieves similar percentage of well-covered CDS bases only when the average coverage is 2–3 times higher,
and with a substantial sequence bias. In several more recent studies, it was repeatedly stated that WGS provides more even and unbiased coverage of coding regions and generates more
accurate variant calls13,16,17.
It is already well understood that long-read sequencing dramatically increases the power and accuracy of complex variant discovery in the human genome18. With rapid development of 3rd
generation sequencing technologies, long-read resequencing of human genomes becomes an attractive and increasingly realistic option. For example, the highest throughput Oxford Nanopore
device, PromethION, is expected to generate 30x long-read coverage of human genome for less than $1000. A recent publication has highlighted limitations of short-read technologies,
identifying “dark” regions in the protein-coding parts of the genome, including numerous disease-causing genes19. At the same time, it is unclear what combination of methods would allow the
best effective coverage for regions of interest. There is a defined need for a robust statistical framework that would allow accurate evaluation of method performance on the level of
coverage and before variant identification. Our study describes such framework, and uses it to find important determinants of coding sequence coverage in the human genome.
We started off by characterizing the efficiency of CDS interval coverage by current WES and WGS technologies. It is important to note that most modern variant calling tools ignore reads with
mapping quality (MQ) less than 10 and reads marked as PCR or optical duplicates; thus, such reads were removed when calculating coverage. All WES samples irrespective of the platform showed
50–70% efficiency of target enrichment and a similar distribution of sequencing depths across our WES dataset (Fig. 1a,b), corresponding to 38 ± 5 fold enrichment of target regions
(Supplementary Fig. S1). Interestingly, we observed a weak trend showing that libraries having higher depth of sequencing tend to show less efficient exome enrichment. The strength of the
trend depends on the particular technology: for SureSelect and TruSeq Exome kits the trend is almost absent (R2 = 0.034 and R2 = 0.026, respectively), while for MedExome and Nextera Rapid
the dependence is much more pronounced (R2 = 0.360 and R2 = 0.815) (Fig. 1c).
Coverage of target regions across WES and WGS samples. (a,b,d) Total read depth (a), target enrichment efficiency (b) and mean CDS coverage (d) for all samples for each platform. (c) A
scatterplot of enrichment efficiency plotted against total read depth. Lines are linear regression fits with 95% confidence intervals indicated as grey envelopes. (e) The distribution of the
normalized coverage for all WES technologies compared to WGS. Dotted line represents ideal case baseline, i.e. all bases covered at mean value. (f) Overall evenness (OE) scores for all four
WES technologies and WGS. Red points indicate WGS samples obtained from open sources, while grey points represent our dataset. For plots (e,f) a subset of 10 samples with similar mean
coverages was selected for all WES platforms.
Mean coverage of CDS regions in our dataset was comparable among different WES technologies (~ 70x), with exception for SureSelect, that had mean coverage of ~120x (Fig. 1d). We then
calculated profiles of normalized coverage across CDS bases (Fig. 1e). In order to characterize the overall evenness of CDS coverage (OE), we have used the score developed by Mokry et al.20
(see Methods). Normalized coverage profiles and OE scores showed that both Illumina kits perform significantly worse than SureSelect and MedExome, while all exome platforms provided less
even coverage than PCR-free WGS (Fig. 1e,f).
To dissect potential sources of coverage bias we defined two possible components of coverage evenness: coverage distribution between different CDS regions (between-interval evenness, BIE),
and uniformity of coverage within individual intervals (within-interval evenness, WIE). The latter type of coverage unevenness is inherent to WES platforms; hence, we first questioned
whether it explains the difference between WES and WGS in the OE scores. Indeed, visual inspection of coverage profiles on individual CDS regions suggests that exome platforms highly vary in
WIE (Fig. 2a). To more accurately assess the observed differences, we calculated average profiles of relative coverages and WIE scores for all CDS regions (Fig. 2b). We found that WIE
scores are well correlated with the OE (Fig. 2c), however, WIE does not completely explain differences observed in Fig. 1f. Similar results were obtained by calculation of WIE profiles
across CDS intervals including flanking regions and with exon length stratification (Supplementary Fig. S2). As anticipated, WGS did not show any noticeable within-interval unevenness,
confirming that such type of coverage bias is specific to exome sequencing. Since our results suggested WIE is not the only source of increased coverage bias in WES, we next calculated
profiles of relative mean interval coverage across all CDS regions to estimate BIE (Fig. 2d). We observed that, while WGS generally performed better than WES, exome platforms showed a
distinct pattern of between-group differences (Fig. 2e) that explains the discrepancy between OE and WIE scores, implying that the overall coverage evenness is the product of both BIE and
WIE.
Different technologies exhibit specific patterns of coverage within exons and differ coverage distribution within exons. For all plots, a subset of samples was used as described earlier. (a)
Example of sequencing coverage patterns across exons of the HNRNPD gene. Selected samples with similar mean CDS coverage are shown. (b) Distribution of relative coverage from the start to
the end of target interval, averaged over all CDS regions. (c) Within-interval coverage evenness (WIE) values calculated from distributions shown in (b) (see Methods for more details; all
capture technologies differ in pairwise U-test with Holm-Bonferroni FDR correction (adjusted p-value