Long-Read Assembly and Annotation of Rat Genomes that are Important Models of Complex Genetic Disease

Grants and Contracts Details


The rat has long been a mainstay of investigations into organism-level biology and integrated systems physiology. Its utility has been increased by the development of disease models in rats that have taken advantage of naturally occurring genetic variation, similar to variation know to drive disease risk in humans. This variation has been fixed in a variety of inbred models. Outbred models have also been developed to further advance understanding of behavior and biology. In particular, features of rat biology arising from its socialized nature and its ability to model both human addictive behavior and other disorders of brain function have made it remarkably useful. In spite of this widespread use of rats in biomedical and behavioral research and in spite of these biological, behavioral and disease characteristics arising as the product of genetic variation within rats, the resources available to study rat genetics have been markedly delayed in their development. The current rat reference genome was produced using genomic DNA from the inbred Brown-Norway rat. While many advances in understanding the overall structure and evolution of the rat genome have resulted, the reality is that the Brown Norway rat is far distant genetically from the majority of rat models used in research. Furthermore, the existence of a single reference genome from this strain means that the many rat models in which both affected and control or reference strains have been developed cannot be contrasted directly with each other, but only by comparison of each with the reference genome. This leads to the confounding situation in which differences between the affected and control strain are often unknown since the scope of analysis is limited to genomic content common across all three strains. In particular, biological differences that arise from structural variation within the genome cannot be discovered by comparing single nucleotide polymorphism difference with the reference. Likewise, alignment of short-read, whole genome sequence data with the reference does not uncover these structural differences. However, it is known that structural differences are abundant and biologically important, in part because they are frequent and in part because they span and can otherwise affect the activity of functional regions. These are not well interrogated by SNP analyses performed on short read alignments. Recent advances in long read sequencing now allow de novo genome assembly of entire mammalian genomes to be performed with great accuracy, with excellent contiguity and at comparatively low cost. The rat research community requires new genomic tools that target the specific rat strains in widespread use so that current impediments to resolving the genetic basis of important biological traits can be resolved. Here we propose to generate de novo genome sequences for 9 of the most commonly used inbred rat strains. We will obtain high coverage PacBio long read sequences for each genome. We will assemble these genomes and ensure that very high base level accuracy is obtained. We will scaffold these assemblies using proximity ligation methods (Hi-C). Finally we will embark on a new annotation effort that will employ long read sequencing of mRNA’s that will provide outstanding information not only about gene expression, but also about alternative splicing of expressed genes. The present project seeks to determine the extent of genetic variation in IGH in an inbred animal (rat) model of disease in which IGH genetic variation has been proven to affect disease risk. Our objectives are: 1) To generate a complete long-read based de novo genome sequence assemblies of 9 inbred rat strains: WKY, SHR, SHRSP, Dahl SS, Dahl SR, Lyon Hypertensive, Lyon normotensive, Fawn Hooded Hypertensive and Fawn Hooded normotensive rat. 2) To polish these assemblies with short-read Illumina sequencing to achieve overall base level accuracy of >99.995% and to scaffold the polished assemblies using Arima Hi-C chromosomal conformation capture sequencing libraries. The scaffolded assemblies will have near chromosome level contiguity and will be deposited for investigator access in the NIH-funded Rat Genome Database 3) To develop and apply a contemporary annotation pipeline in order to annotate the rat genome. This will include annotation with existing gene models used by Ensembl and will add novelty by acquiring long read Iso-seq PacBio data that will provide a much enriched collection of alternative splicing data across 10 tissues
Effective start/end date7/12/214/30/26


  • University of Texas Health Science Center at Houston: $353,498.00


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.