Abstract
The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS, CRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.
| Original language | English |
|---|---|
| Pages (from-to) | 672-680 |
| Number of pages | 9 |
| Journal | Nature Biotechnology |
| Volume | 40 |
| Issue number | 5 |
| DOIs | |
| State | Published - May 2022 |
Bibliographical note
Publisher Copyright:© 2022, This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply.
Funding
We thank the Genome Reference Consortium for their curation efforts of GRCh37 and GRCh38 (https://www.genomereference.org), especially V.A. Schneider and P.A. Kitts from the National Institutes of Health (NIH)/NCBI for developing the falsely duplicated regions that should be masked in GRCh38. We thank S. Miller at NIST for helping make available benchmark sets and READMEs. Certain commercial equipment, instruments or materials are identified to adequately specify experimental conditions or reported results. Such identification does not imply recommendation or endorsement by NIST, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose. C.F. was funded by Instituto de Salud Carlos III (PI20/00876) and Ministerio de Ciencia e Innovación (RTC-2017-6471-1; AEI/FEDER, UE), cofinanced by the European Regional Development Fund ‘A Way of Making Europe’ from the European Union, and Cabildo Insular de Tenerife (CGIEU0000219140). J.M.L.-S. was funded by Consejería de Educación-Gobierno de Canarias and Cabildo Insular de Tenerife (BOC 163, 24/08/2017). F.J.S. and M.M. was supported by the NIH (UM1 HG008898). C.X. was supported by the Intramural Research Program of the National Library of Medicine, NIH. K.H.M. was supported by the NIH/National Human Genome Research Institute (R01 1R01HG011274-01 and U01 1U01HG010971). H.L. was supported by the NIH (R01 HG010040 and U01 HG010961). C.E.M. thanks funding from the WorldQuant Foundation, NASA (NNX14AH50G), the National Institutes of Health (R01MH117406, R01CA249054, R01AI151059, P01CA214274) and the Leukemia and Lymphoma Society (LLS) (MCL7001-18, LLS 9238-16, LLS-MCL7001-18). We thank the Genome Reference Consortium for their curation efforts of GRCh37 and GRCh38 ( https://www.genomereference.org ), especially V.A. Schneider and P.A. Kitts from the National Institutes of Health (NIH)/NCBI for developing the falsely duplicated regions that should be masked in GRCh38. We thank S. Miller at NIST for helping make available benchmark sets and READMEs. Certain commercial equipment, instruments or materials are identified to adequately specify experimental conditions or reported results. Such identification does not imply recommendation or endorsement by NIST, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose. C.F. was funded by Instituto de Salud Carlos III (PI20/00876) and Ministerio de Ciencia e Innovación (RTC-2017-6471-1; AEI/FEDER, UE), cofinanced by the European Regional Development Fund ‘A Way of Making Europe’ from the European Union, and Cabildo Insular de Tenerife (CGIEU0000219140). J.M.L.-S. was funded by Consejería de Educación-Gobierno de Canarias and Cabildo Insular de Tenerife (BOC 163, 24/08/2017). F.J.S. and M.M. was supported by the NIH (UM1 HG008898). C.X. was supported by the Intramural Research Program of the National Library of Medicine, NIH. K.H.M. was supported by the NIH/National Human Genome Research Institute (R01 1R01HG011274-01 and U01 1U01HG010971). H.L. was supported by the NIH (R01 HG010040 and U01 HG010961). C.E.M. thanks funding from the WorldQuant Foundation, NASA (NNX14AH50G), the National Institutes of Health (R01MH117406, R01CA249054, R01AI151059, P01CA214274) and the Leukemia and Lymphoma Society (LLS) (MCL7001-18, LLS 9238-16, LLS-MCL7001-18).
| Funders | Funder number |
|---|---|
| Cabildo Insular de Tenerife | CGIEU0000219140 |
| Consejería de Educación-Gobierno de Canarias and Cabildo Insular de Tenerife | 24/08/2017, BOC 163, UM1 HG008898 |
| NCBI | |
| National Institutes of Health (NIH) | |
| Author National Institute on Drug Abuse DA031791 Mark J Ferris National Institute on Drug Abuse DA006634 Mark J Ferris National Institute on Alcohol Abuse and Alcoholism AA026117 Mark J Ferris National Institute on Alcohol Abuse and Alcoholism AA028162 Elizabeth G Pitts National Institute of General Medical Sciences GM102773 Elizabeth G Pitts Peter McManus Charitable Trust Mark J Ferris National Institute on Drug Abuse | U01DA053941 |
| Author National Institute on Drug Abuse DA031791 Mark J Ferris National Institute on Drug Abuse DA006634 Mark J Ferris National Institute on Alcohol Abuse and Alcoholism AA026117 Mark J Ferris National Institute on Alcohol Abuse and Alcoholism AA028162 Elizabeth G Pitts National Institute of General Medical Sciences GM102773 Elizabeth G Pitts Peter McManus Charitable Trust Mark J Ferris National Institute on Drug Abuse | |
| National Human Genome Research Institute | U01 1U01HG010971, U01 HG010961, R01 1R01HG011274-01, R01 HG010040 |
| National Human Genome Research Institute | |
| U.S. National Library of Medicine | |
| National Aeronautics and Space Administration | R01AI151059, R01CA249054, R01MH117406, NNX14AH50G, P01CA214274 |
| National Aeronautics and Space Administration | |
| National Institute of Standards and Technology | |
| Leukemia and Lymphoma Society | MCL7001-18, LLS 9238-16, LLS-MCL7001-18 |
| Leukemia and Lymphoma Society | |
| WorldQuant Foundation | |
| European Commission | |
| Instituto de Salud Carlos III | PI20/00876 |
| Instituto de Salud Carlos III | |
| Ministerio de Ciencia, Innovación y Universidades | RTC-2017-6471-1 |
| Ministerio de Ciencia, Innovación y Universidades | |
| European Regional Development Fund |
ASJC Scopus subject areas
- Biotechnology
- Bioengineering
- Applied Microbiology and Biotechnology
- Biomedical Engineering
- Molecular Medicine