Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight

Mark T.W. Ebbert, Tanner D. Jensen, Karen Jansen-West, Jonathon P. Sens, Joseph S. Reddy, Perry G. Ridge, John S.K. Kauwe, Veronique Belzil, Luc Pregent, Minerva M. Carrasquillo, Dirk Keene, Eric Larson, Paul Crane, Yan W. Asmann, Nilufer Ertekin-Taner, Steven G. Younkin, Owen A. Ross, Rosa Rademakers, Leonard Petrucelli, John D. Fryer

Research output: Contribution to journalArticlepeer-review

107 Scopus citations

Abstract

Background: The human genome contains "dark" gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged. We assess how well long-read or linked-read technologies resolve these regions. Results: Based on standard whole-genome Illumina sequencing data, we identify 36,794 dark regions in 6054 gene bodies from pathways important to human health, development, and reproduction. Of these gene bodies, 8.7% are completely dark and 35.2% are ≥ 5% dark. We identify dark regions that are present in protein-coding exons across 748 genes. Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively. We present an algorithm to resolve most camouflaged regions and apply it to the Alzheimer's Disease Sequencing Project. We rescue a rare ten-nucleotide frameshift deletion in CR1, a top Alzheimer's disease gene, found in disease cases but not in controls. Conclusions: While we could not formally assess the association of the CR1 frameshift mutation with Alzheimer's disease due to insufficient sample-size, we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.

Original languageEnglish
Article number97
JournalGenome Biology
Volume20
Issue number1
DOIs
StatePublished - May 20 2019

Bibliographical note

Publisher Copyright:
© 2019 The Author(s).

Funding

This work was supported by the PhRMA Foundation [RSGTMT17 to M.E.]; the Ed and Ethel Moore Alzheimer's Disease Research Program of Florida Department of Health [8AZ10 and 9AZ08 to M.E., and 6AZ06 to J.F.]; the Muscular Dystrophy Association (M.E.); the National Institutes of Health [NS094137 to J.F., AG047327 to J. F, AG049992 to J.F., NS097261 to R.R., NS097273 to L.P., NS084528 to L.P., NS084974 to L.P., NS099114 to L.P., NS088689 to L.P., NS093865 to L.P.]; Department of Defense [ALSRP AL130125 to L.P.]; Mayo Clinic Foundation (L.P. and J.F.); Mayo Clinic Center for Individualized Medicine (L.P. and J.F.); Amyotrophic Lateral Sclerosis Association (M.E., L.P.); Robert Packard Center for ALS Research at Johns Hopkins (L.P.) Target ALS (L.P.); Association for Frontotemporal Degeneration (L.P.); GHR Foundation (J.F.); and the Mayo Clinic Gerstner Family Career Development Award (J.F.).

FundersFunder number
National Institutes of Health (NIH)NS097261, NS084528, NS097273, NS088689, AG049992, NS093865, NS084974, NS094137, AG047327, NS099114
National Institutes of Health (NIH)
National Institute on AgingR01AG054076
National Institute on Aging
Association for Frontotemporal Degeneration
Muscular Dystrophy Association

    Keywords

    • 10x Genomics
    • APOE
    • Alzheimer's Disease Sequencing Project (ADSP)
    • CR1
    • Camouflaged genes
    • Dark genes
    • Long-read sequencing
    • Oxford Nanopore Technologies (ONT)
    • Pacific Biosciences (PacBio)

    ASJC Scopus subject areas

    • Ecology, Evolution, Behavior and Systematics
    • Genetics
    • Cell Biology

    Fingerprint

    Dive into the research topics of 'Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight'. Together they form a unique fingerprint.

    Cite this