Curated variation benchmarks for challenging medically relevant autosomal genes

  • Justin Wagner
  • , Nathan D. Olson
  • , Lindsay Harris
  • , Jennifer McDaniel
  • , Haoyu Cheng
  • , Arkarachai Fungtammasan
  • , Yih Chii Hwang
  • , Richa Gupta
  • , Aaron M. Wenger
  • , William J. Rowell
  • , Ziad M. Khan
  • , Jesse Farek
  • , Yiming Zhu
  • , Aishwarya Pisupati
  • , Medhat Mahmoud
  • , Chunlin Xiao
  • , Byunggil Yoo
  • , Sayed Mohammad Ebrahim Sahraeian
  • , Danny E. Miller
  • , David Jáspez
  • José M. Lorenzo-Salazar, Adrián Muñoz-Barrera, Luis A. Rubio-Rodríguez, Carlos Flores, Giuseppe Narzisi, Uday Shanker Evani, Wayne E. Clarke, Joyce Lee, Christopher E. Mason, Stephen E. Lincoln, Karen H. Miga, Mark T.W. Ebbert, Alaina Shumate, Heng Li, Chen Shan Chin, Justin M. Zook, Fritz J. Sedlazeck

Research output: Contribution to journalArticlepeer-review

128 Scopus citations

Abstract

The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS, CRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.

Original languageEnglish
Pages (from-to)672-680
Number of pages9
JournalNature Biotechnology
Volume40
Issue number5
DOIs
StatePublished - May 2022

Bibliographical note

Publisher Copyright:
© 2022, This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply.

Funding

We thank the Genome Reference Consortium for their curation efforts of GRCh37 and GRCh38 (https://www.genomereference.org), especially V.A. Schneider and P.A. Kitts from the National Institutes of Health (NIH)/NCBI for developing the falsely duplicated regions that should be masked in GRCh38. We thank S. Miller at NIST for helping make available benchmark sets and READMEs. Certain commercial equipment, instruments or materials are identified to adequately specify experimental conditions or reported results. Such identification does not imply recommendation or endorsement by NIST, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose. C.F. was funded by Instituto de Salud Carlos III (PI20/00876) and Ministerio de Ciencia e Innovación (RTC-2017-6471-1; AEI/FEDER, UE), cofinanced by the European Regional Development Fund ‘A Way of Making Europe’ from the European Union, and Cabildo Insular de Tenerife (CGIEU0000219140). J.M.L.-S. was funded by Consejería de Educación-Gobierno de Canarias and Cabildo Insular de Tenerife (BOC 163, 24/08/2017). F.J.S. and M.M. was supported by the NIH (UM1 HG008898). C.X. was supported by the Intramural Research Program of the National Library of Medicine, NIH. K.H.M. was supported by the NIH/National Human Genome Research Institute (R01 1R01HG011274-01 and U01 1U01HG010971). H.L. was supported by the NIH (R01 HG010040 and U01 HG010961). C.E.M. thanks funding from the WorldQuant Foundation, NASA (NNX14AH50G), the National Institutes of Health (R01MH117406, R01CA249054, R01AI151059, P01CA214274) and the Leukemia and Lymphoma Society (LLS) (MCL7001-18, LLS 9238-16, LLS-MCL7001-18). We thank the Genome Reference Consortium for their curation efforts of GRCh37 and GRCh38 ( https://www.genomereference.org ), especially V.A. Schneider and P.A. Kitts from the National Institutes of Health (NIH)/NCBI for developing the falsely duplicated regions that should be masked in GRCh38. We thank S. Miller at NIST for helping make available benchmark sets and READMEs. Certain commercial equipment, instruments or materials are identified to adequately specify experimental conditions or reported results. Such identification does not imply recommendation or endorsement by NIST, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose. C.F. was funded by Instituto de Salud Carlos III (PI20/00876) and Ministerio de Ciencia e Innovación (RTC-2017-6471-1; AEI/FEDER, UE), cofinanced by the European Regional Development Fund ‘A Way of Making Europe’ from the European Union, and Cabildo Insular de Tenerife (CGIEU0000219140). J.M.L.-S. was funded by Consejería de Educación-Gobierno de Canarias and Cabildo Insular de Tenerife (BOC 163, 24/08/2017). F.J.S. and M.M. was supported by the NIH (UM1 HG008898). C.X. was supported by the Intramural Research Program of the National Library of Medicine, NIH. K.H.M. was supported by the NIH/National Human Genome Research Institute (R01 1R01HG011274-01 and U01 1U01HG010971). H.L. was supported by the NIH (R01 HG010040 and U01 HG010961). C.E.M. thanks funding from the WorldQuant Foundation, NASA (NNX14AH50G), the National Institutes of Health (R01MH117406, R01CA249054, R01AI151059, P01CA214274) and the Leukemia and Lymphoma Society (LLS) (MCL7001-18, LLS 9238-16, LLS-MCL7001-18).

FundersFunder number
Cabildo Insular de TenerifeCGIEU0000219140
Consejería de Educación-Gobierno de Canarias and Cabildo Insular de Tenerife24/08/2017, BOC 163, UM1 HG008898
NCBI
National Institutes of Health (NIH)
Author National Institute on Drug Abuse DA031791 Mark J Ferris National Institute on Drug Abuse DA006634 Mark J Ferris National Institute on Alcohol Abuse and Alcoholism AA026117 Mark J Ferris National Institute on Alcohol Abuse and Alcoholism AA028162 Elizabeth G Pitts National Institute of General Medical Sciences GM102773 Elizabeth G Pitts Peter McManus Charitable Trust Mark J Ferris National Institute on Drug AbuseU01DA053941
Author National Institute on Drug Abuse DA031791 Mark J Ferris National Institute on Drug Abuse DA006634 Mark J Ferris National Institute on Alcohol Abuse and Alcoholism AA026117 Mark J Ferris National Institute on Alcohol Abuse and Alcoholism AA028162 Elizabeth G Pitts National Institute of General Medical Sciences GM102773 Elizabeth G Pitts Peter McManus Charitable Trust Mark J Ferris National Institute on Drug Abuse
National Human Genome Research InstituteU01 1U01HG010971, U01 HG010961, R01 1R01HG011274-01, R01 HG010040
National Human Genome Research Institute
U.S. National Library of Medicine
National Aeronautics and Space AdministrationR01AI151059, R01CA249054, R01MH117406, NNX14AH50G, P01CA214274
National Aeronautics and Space Administration
National Institute of Standards and Technology
Leukemia and Lymphoma SocietyMCL7001-18, LLS 9238-16, LLS-MCL7001-18
Leukemia and Lymphoma Society
WorldQuant Foundation
European Commission
Instituto de Salud Carlos IIIPI20/00876
Instituto de Salud Carlos III
Ministerio de Ciencia, Innovación y UniversidadesRTC-2017-6471-1
Ministerio de Ciencia, Innovación y Universidades
European Regional Development Fund

    ASJC Scopus subject areas

    • Biotechnology
    • Bioengineering
    • Applied Microbiology and Biotechnology
    • Biomedical Engineering
    • Molecular Medicine

    Fingerprint

    Dive into the research topics of 'Curated variation benchmarks for challenging medically relevant autosomal genes'. Together they form a unique fingerprint.

    Cite this