TY - JOUR
T1 - Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight
AU - Ebbert, Mark T.W.
AU - Jensen, Tanner D.
AU - Jansen-West, Karen
AU - Sens, Jonathon P.
AU - Reddy, Joseph S.
AU - Ridge, Perry G.
AU - Kauwe, John S.K.
AU - Belzil, Veronique
AU - Pregent, Luc
AU - Carrasquillo, Minerva M.
AU - Keene, Dirk
AU - Larson, Eric
AU - Crane, Paul
AU - Asmann, Yan W.
AU - Ertekin-Taner, Nilufer
AU - Younkin, Steven G.
AU - Ross, Owen A.
AU - Rademakers, Rosa
AU - Petrucelli, Leonard
AU - Fryer, John D.
N1 - Publisher Copyright:
© 2019 The Author(s).
PY - 2019/5/20
Y1 - 2019/5/20
N2 - Background: The human genome contains "dark" gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged. We assess how well long-read or linked-read technologies resolve these regions. Results: Based on standard whole-genome Illumina sequencing data, we identify 36,794 dark regions in 6054 gene bodies from pathways important to human health, development, and reproduction. Of these gene bodies, 8.7% are completely dark and 35.2% are ≥ 5% dark. We identify dark regions that are present in protein-coding exons across 748 genes. Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively. We present an algorithm to resolve most camouflaged regions and apply it to the Alzheimer's Disease Sequencing Project. We rescue a rare ten-nucleotide frameshift deletion in CR1, a top Alzheimer's disease gene, found in disease cases but not in controls. Conclusions: While we could not formally assess the association of the CR1 frameshift mutation with Alzheimer's disease due to insufficient sample-size, we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.
AB - Background: The human genome contains "dark" gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged. We assess how well long-read or linked-read technologies resolve these regions. Results: Based on standard whole-genome Illumina sequencing data, we identify 36,794 dark regions in 6054 gene bodies from pathways important to human health, development, and reproduction. Of these gene bodies, 8.7% are completely dark and 35.2% are ≥ 5% dark. We identify dark regions that are present in protein-coding exons across 748 genes. Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively. We present an algorithm to resolve most camouflaged regions and apply it to the Alzheimer's Disease Sequencing Project. We rescue a rare ten-nucleotide frameshift deletion in CR1, a top Alzheimer's disease gene, found in disease cases but not in controls. Conclusions: While we could not formally assess the association of the CR1 frameshift mutation with Alzheimer's disease due to insufficient sample-size, we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.
KW - 10x Genomics
KW - APOE
KW - Alzheimer's Disease Sequencing Project (ADSP)
KW - CR1
KW - Camouflaged genes
KW - Dark genes
KW - Long-read sequencing
KW - Oxford Nanopore Technologies (ONT)
KW - Pacific Biosciences (PacBio)
UR - http://www.scopus.com/inward/record.url?scp=85066014432&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85066014432&partnerID=8YFLogxK
U2 - 10.1186/s13059-019-1707-2
DO - 10.1186/s13059-019-1707-2
M3 - Article
C2 - 31104630
AN - SCOPUS:85066014432
SN - 1474-7596
VL - 20
JO - Genome Biology
JF - Genome Biology
IS - 1
M1 - 97
ER -