Abstract
In genetic pathway analysis and other high dimensional data analysis, thousands and millions of tests could be performed simultaneously. p-values from multiple tests are often presented in a negative log-transformed format. We construct a contaminated exponential mixture model for-ln(P) and propose a D CDF test to determine whether some-ln(P) are from tests with underlying effects. By comparing the cumulative distribution functions (CDF) of-ln(P) under mixture models, the proposed method can detect the cumulative effect from a number of variants with small effect sizes. Weight functions and truncations can be incorporated to the D CDF test to improve power and better control the correlation among data. By using the modified maximum likelihood estimators (MMLE), the D CDF tests have very tractable limiting distributions under H0. A copula based procedure is proposed to address the correlation issue among p-values. We also develop power and sample size calculation for the D CDF test. The extensive empirical assessments on the correlated data demonstrate that the (weighted and/or c-level truncated) D CDF tests have well controlled Type I error rates and high power for small effect sizes. We applied our method to gene expression data in mice and identified significant pathways related the mouse body weight.
Original language | English |
---|---|
Pages (from-to) | 187-200 |
Number of pages | 14 |
Journal | Statistics and its Interface |
Volume | 7 |
Issue number | 2 |
DOIs | |
State | Published - 2014 |
Keywords
- D_CDF test
- Mixture model
- Modified maximum likelihood estimator (MMLE)
- Negative log transformed p-values
- Weight function
- c-level truncated test
ASJC Scopus subject areas
- Statistics and Probability
- Applied Mathematics