The mutation significance cutoff: gene-level thresholds for variant predictions

Y Itan, L Shang, B Boisson, MJ Ciancanelli, JG Markle… - Nature …, 2016 - nature.com
Nature methods, 2016nature.com
Next-generation sequencing (NGS) identifies about 20,000 variants per exome, of which
only a few may underlie genetic diseases. Variant-level methods such as PolyPhen-2
(polymorphism phenotyping version 2), SIFT (sorting intolerant from tolerant) and CADD
(combined annotation–dependent depletion) attempt to predict whether a given variant is
benign or deleterious 1, 2, 3. These methods are commonly interpreted in a binary manner
as a means of filtering out benign variants from NGS data, with a single significance cutoff …
Next-generation sequencing (NGS) identifies about 20,000 variants per exome, of which only a few may underlie genetic diseases. Variant-level methods such as PolyPhen-2 (polymorphism phenotyping version 2), SIFT (sorting intolerant from tolerant) and CADD (combined annotation–dependent depletion) attempt to predict whether a given variant is benign or deleterious 1, 2, 3. These methods are commonly interpreted in a binary manner as a means of filtering out benign variants from NGS data, with a single significance cutoff value across all genes. CADD developers propose (but do not recommend for categorical usage) a fixed cutoff value between 10 and 20 on a scale of 1–99, with 99 being the most deleterious. Gene-level methods, including RVIS (residual variation intolerance score, which applies combined fixed gene-and variant-level cutoffs), de novo excess and GDI (gene damage index), are also useful 4, 5, 6. However, a uniform cutoff is unlikely to be accurate genome-wide (see Supplementary Note).
Here we describe the mutation significance cutoff (MSC), a quantitative approach that provides gene-level and gene-specific phenotypic impact cutoff values to improve the use of existing variant-level methods, and a public server for utilizing it (http://lab. rockefeller. edu/casanova/MSC). We first showed that with fixed cutoffs CADD outperformed PolyPhen-2 and SIFT (Supplementary Fig. 1a). We found that 40.84% of Human Gene Mutation Database (HGMD) 7 curated disease-associated mutations were not missense (but rather nonsense, frameshift, regulatory, etc.)(Fig. 1a), contributing to low true positive predictions with PolyPhen-2 and SIFT. The 95% confidence interval (CI) of CADD scores for disease-associated mutations of a given HGMD gene overlapped, on average, with only 37.63%(41.89% median) of the 95% mutation CIs of all other HGMD genes (Fig. 1b). The CADD scores of private disease-associated mutations were significantly higher than those of non-private disease-associated mutations (P< 10− 300, Supplementary Fig. 1b), resulting in lower overall impact prediction scores when the allele frequency of a mutation was considered (Supplementary Fig. 2).
nature.com