TY - GEN
T1 - A Robust method for transcript quantification with RNA-seq data
AU - Huang, Yan
AU - Hu, Yin
AU - Jones, Corbin D.
AU - MacLeod, James N.
AU - Chiang, Derek Y.
AU - Liu, Yufeng
AU - Prins, Jan F.
AU - Liu, Jinze
PY - 2012
Y1 - 2012
N2 - The advent of high throughput RNA-seq technology allows deep sampling of the transcriptome, making it possible to characterize both the diversity and the abundance of transcript isoforms. Accurate abundance estimation or transcript quantification of isoforms is critical for downstream differential analysis (e.g. healthy vs. diseased cells), but remains a challenging problem for several reasons. First, while various types of algorithms have been developed for abundance estimation, short reads often do not uniquely identify the transcript isoforms from which they were sampled. As a result, the quantification problem may not be identifiable, i.e. lacks a unique transcript solution even if the read maps uniquely to the reference genome. In this paper, we develop a general linear model for transcript quantification that leverages reads spanning multiple splice junctions to ameliorate identifiability. Second, RNA-seq reads sampled from the transcriptome exhibit unknown position-specific and sequence-specific biases. We extend our method to simultaneously learn bias parameters during transcript quantification to improve accuracy. Third, transcript quantification is often provided with a candidate set of isoforms, not all of which are likely to be significantly expressed in a given tissue type or condition. By resolving the linear system with LASSO our approach can infer an accurate set of dominantly expressed transcripts while existing methods tend to assign positive expression to every candidate isoform. Using simulated RNA-seq datasets, our method demonstrated better quantification accuracy than existing methods. The application of our method on real data experimentally demonstrated that transcript quantification is effective for differential analysis of transcriptomes.
AB - The advent of high throughput RNA-seq technology allows deep sampling of the transcriptome, making it possible to characterize both the diversity and the abundance of transcript isoforms. Accurate abundance estimation or transcript quantification of isoforms is critical for downstream differential analysis (e.g. healthy vs. diseased cells), but remains a challenging problem for several reasons. First, while various types of algorithms have been developed for abundance estimation, short reads often do not uniquely identify the transcript isoforms from which they were sampled. As a result, the quantification problem may not be identifiable, i.e. lacks a unique transcript solution even if the read maps uniquely to the reference genome. In this paper, we develop a general linear model for transcript quantification that leverages reads spanning multiple splice junctions to ameliorate identifiability. Second, RNA-seq reads sampled from the transcriptome exhibit unknown position-specific and sequence-specific biases. We extend our method to simultaneously learn bias parameters during transcript quantification to improve accuracy. Third, transcript quantification is often provided with a candidate set of isoforms, not all of which are likely to be significantly expressed in a given tissue type or condition. By resolving the linear system with LASSO our approach can infer an accurate set of dominantly expressed transcripts while existing methods tend to assign positive expression to every candidate isoform. Using simulated RNA-seq datasets, our method demonstrated better quantification accuracy than existing methods. The application of our method on real data experimentally demonstrated that transcript quantification is effective for differential analysis of transcriptomes.
KW - RNA-seq
KW - Transcript quantification
KW - Transcriptome
UR - http://www.scopus.com/inward/record.url?scp=84860803772&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84860803772&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-29627-7_12
DO - 10.1007/978-3-642-29627-7_12
M3 - Conference contribution
AN - SCOPUS:84860803772
SN - 9783642296260
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 127
EP - 147
BT - Research in Computational Molecular Biology - 16th Annual International Conference, RECOMB 2012, Proceedings
T2 - 16th Annual International Conference on Research in Computational Molecular Biology, RECOMB 2012
Y2 - 21 April 2012 through 24 April 2012
ER -