Reference Free Query of High Throughput Genetic Data in a Distributed Environment

Grants and Contracts Details


The student will develop and provide the genomics workflow outlined in the Project Descrip-on. This workflow will be developed using open-source so:ware and made publicly available via its own github repository and the Docker image provided at the Docker Registry. The Gitub repository should include the Docker file, citaAons for the so:ware used, installaAon instrucAons, step-by-step usage instrucAons and a worked example with input and output files. Students will also be expected to be available to answer quesAons about their workflow and documentaAon. The recipient and student are expected to abide by guidelines outlined in the project Data Management Plan. The student will not be supervised by University of Arizona employees but it is expected that their graduate mentor will be able to provide any addiAonal technical advice or support. Project Descrip2on: My lab is beginning to work on the so:ware and interfaces necessary for an infrastructure that can support “Process once, query forever” analysis of high throughput sequence (HTS) data. To that end, a master’s student in my lab, Mr. Kai Li will work at 50% effort for the Fall 2023 semester to set up a query server that will allow for the rapid query of sequence, both whole genome sequence and RNA-Seq data we have collected. The work will specifically involve the construcAon of kmer indices on each HTS dataset. The most Ame-consuming element of a query is that required by the so:ware to load the kmer index in memory. In queries run in our lab, loading the index takes 5 minutes of a total of 6 when querying 1.3M kmers. To eliminate this wait, we will set up server so:ware on a server equipped with 56TB of solid-state storage capable of effecAvely maintaining the kmer indices in memory. Once completed, we will have a server that can be queried repeatedly, in seconds per query, to genotype individual whole genome sequence, or similarly calculate transcripAon levels in RNA-Seq datasets. This work will provide a strong foundaAon for work that will be transformaAve by creaAng a distributed network of query engines of HTS data, thus providing truly FAIR access to these data
Effective start/end date8/25/2312/31/23


  • University of Arizona: $14,054.00


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.